One of the questions I always get from students and trainees who are beginner for data science is- “From where do I start my data science project ?” As we know, there is no data science without data and many think that gathering data is the very first step. But gathering data will be very challenging task if you are not clear about your data science objective or if you do not know clearly what problem are you going to solve. Thus, I would suggest the students that “Problem definition” is the very first step. Once you are clear about your objectives, you will have an overview of the types of data that may be desired to achieve that objective and gathering data plus finding the relevant sources to gather the required data become much easier.
Real life data is always raw, which means the data can be very random and crazy looking obtained from different sources in different formats, sizes and features. No information can be extracted from such random heaps of digital mess. It is a daunting task for data scientists and data engineers to properly convert this mess to something understandable by the machine to apply some learning algorithms. Applying the learning algorithms is not only enough if the results are not properly deployed and presented in the relevant places. Hence, this whole approach of data science after data collection can be divided into three phases:
As shown in the figure below, the pre-processing phase is the phase of sensing, feeling, observing, understanding, visualizing and modifying the data if necessary. What does it mean for a digital data ? Sensing and feeling means that the core essence of the data should be preserved while removing the inconsistencies and noises observed in the data. Understanding means that all the features in the data should be clear and sensible. One should be able to visualize the specific statistical and/or non-statistical properties and modify the results and data if needed. Data pre-processing steps can be roughly be categorized as:
The data pre-processing phase provides us “processed data”. Then comes the processing phase, where we train our system with the machine learning model. Before that, we divide the available processed dataset into: training data, validation data and test data . We then select the machine learning algorithms to train our model. In the training phase, we use only training dataset.
Different learning algorithms are suitable for different types of data. Hence, care should be taken to choose the best suitable algorithms for your data. There are many approaches of machine learning and most popular among them are the Supervised learning and Unsupervised learning .
If we have continuous labeled dataset such as forecasting the price of a house, predicting the buy rate of logistic goods etc., we use regression algorithms such as- Linear Regression, Polynomial regression, Ridge Regression, LASSO, Elastic Net, Least Angle Regression (LARS), Orthogonal Matching Pursuit (OMP), Baysian Ridge Regression, Automatic Relevance Determination Regression and some robust estimators such as Random Sample Consensus (RANSAC), Theil-Sen estimator, Huber Regression etc.
On the other hand, if we have discrete type of labels such as separation of cats from dogs, determination of handwritten digits, determination of hand-written letters etc, we use classification algorithms such as Logistic Regression, Decision Trees, Random Forest, Support Vector classifiers, K Nearest Neighbors, Neural networks etc.
We should know how the trained model works. We measure the performance of our model using different metrics such as accuracy, confusion matrix, Area under curve (AUC) etc. using the validation dataset. Sometimes, there are internal parameters which need to be tuned in order to improve the performance of our model. Such approach is called hyper parameter tuning. Validation dataset is also used for that purpose. We use cross validation techniques in order to fully utilize all the available data for both training and validation set.
While choosing the model, one should always remember two of the following theorems:
Hence selecting a machine learning model requires trying out different models on the given dataset and selecting the simplest and the optimal of them all.
Once everything (hyper parameter and other parameters ) is set for the chosen model, it can be deployed on the given system, which is the post processing phase.
Post processing phase includes deploying our trained machine learning model. There are many approaches to deploy the model. One can use relevant APIs or other platforms for presenting the results to deploy the model. If we want to deploy the results in a real-time environment, it can be a challenging task because of the huge amount of data to be processed very fast. In that case, we can use approaches and architectures, which run all three phases simultaneously. Such distributed computing approach is used for big data analysis.
If you are python freak and want to try django for deployment, it is a very cool tool but might require a lot of effort at the beginning. But Amazon Web Services also do a very good job at the moment.
Below, I have provided the lists of some of the tools, environments, databases and deployment methods used in data science.
As a data scientist, be very clear on each of these three phases. It is not necessary that a single person has to be perfect in all these three phases but one has to clearly understand all of them. Rest are all tools and techniques that you have to know. I compare each of these phases to our body parts. For example, the pre-processing phase is like our sense organs which visualize, understand, sense and modifies the data if needed; the main phase is the
processing phase which is like our brain and the post-processing phase is like our mouth or hands which actually show what we are doing. It is the way of expressing our actions or results obtained from previous two phases.Each phase need to be in harmony with each other.