Data Science Project Life Cycle : Must for every data scientist

Data Science projects follows some flow in designing and implementation . When we connect each steps in order , It becomes a life cycle for the project (Data Science Project Life Cycle) . Although this life cycle steps varies in term of duration from project to project . Lets directly jump into steps .

Data Science Project Life Cycle Steps –

Business Understanding –

Before starting any project , Business understanding is must and required . Domain expertise in data science helps data scientist to understand the data . In this stage data scientist can adopt multiple techniques like interviewing clients and cross department interactions , prepare a questionnaires etc.Some time they have to sit with real operation team and perform some hands on activity as well .

It is important because how much effort you make here , will decide your project overall outcome .

2. Extract the business problems ( Hidden or known )-

In my recent interviews with chief data scientist of big MNCs , I have seen a very strange things. Now a days people are not hing the data scientist for applying data analytics to achieve well known functional target like making recommendation system . Its more on machine learning engineer job. Actually they expect data scientist do finds some gap in their businesses which even top management do not know. Thats why they also give the title scientist to this job profile .

So this step in Data Science Project Life Cycle is one of the most challenging part .As you have to decide the the outcome of your project .

3. Data Acquisition-

In this Data Science Project Life Cycle step, data scientist need to acquire the data . Some time small piece of data become sufficient and some time even a huge amount of data is still not enough . So this process also further classified into manual process and automatic process. If you are required to extract huge amount of data , you must need automated process. For Example – web crawler etc.

4.Data preparation –

In this Data Science Project Life Cycle stage , we perform most of the preprocessing like cleaning and removing missing values etc . Actually noise can mislead the machine learning model in reverse direction . This mis lead training may result into low accuracy . Some time these basics steps of cleaning and noise removal do not give proper results .In that case , Data transformation is performed .

After this stage , You should wait until you are not confident of –

Is the data set balance for tarining?
Are you dealing missing values properly?
Are Outlierse handled proper ? etc.

5.Data Modelling –

Although we can not ignore any of the above data science life cycle stages but this is the core . Here you have to decide which statistical techniques or machine learning model , you want to apply .In machine learning there are several algorithms for the same objective . The only difference comes in implementation accuracy and other performance matrices . So you need to choose best for you .

6. Evaluation and Interpretation –

Once you have done with Modeling and Hypothesis , We need to evaluate the Model . To accomplish this we generate some statistical performance matrices . The objective behind drawing these matrices is to check how the model is fitting overall in the given data set . Lets understand with some example , In spam detection or email classifier , We use Average accuracy and log loss matrices . Some time to make it more clear and understandable , we apply some data visualization on these matrices .

Although there are so many parameters around which you can measure the models performance . I am sure most of you guys must have heard about precision and recall . We can show the precision and recall matrices using different ways like –

Confusion matrices.
Receiver operating characteristic (ROC) curve.
Area under the curve (AUC) .

Note (Do not Miss it ) –

You should never expect to get good results in single iteration . So we need to perform and repeat from step 4-6 . We should continue repeating these steps until we reach the place where our model is mature enough in the term of accuracy and performance etc . The next step or step 7 should be started once the repetition loop of ( 4-6 ) is break .

7. Deployment –

Machine learning model usually involves to much computation and some times involves big data as well .We can look deployment in two different parts . First is Infrastructure end and second is Underline Integration Technology .

Lets start from the first , Infrastructure. Data Science project are usually too much resource consuming . By the term Resource I mean memory and processing cycle . So we need faster deployment environment . Again here you have to decide cloud vs dedicated server .Each has advantages and dis advantages etc .

The Second part is Underline Technology Integration , Suppose you are developing a recommendation system for a musical website . This websites plays the song on user choice . Now you need to give song recommendation on the portal as well .Lets assume the whole website backend part is designed in Java spring mvc and your recommendation system is in python . Then you need to create the packaging of your recommendation engine to either rest type( web service ) or deploy it as script which is consumed by java process . Or any other way which suits the end to end flow .

8.Operations/Maintenance-

Do not think after the deployment the work is over .Actually operations and maintenance issue can come at ant time . Lets understand this with some example . Suppose you have deployed your code on server and it is working fine Suddenly due to hardware failure we need to restart the system . Now may be in some scenario we need to turn on some backend service manually if not automated in every restart . All such type of issues may arise .So we need to keep monitoring the deployed application .

9. Optimization-

Specially in Machine Learning project where we keep training the model with new data either generated or collected . We need to optimize the project in so many iteration in different time intervals . Its ongoing process .

Conclusion –

Following a systemic workflow or life cycle is best practice always whether in data science or some other type of projects . This workflow avoids missing any steps . In practical terms people start focusing from step 3 . We usually focus on technical terms rather than domain info which is a major cause of data science project failure . So please spends enough time to understand the business environment before jumping to the technical steps .

Thanks

Data Science Learner Team