Okay, You have decided to build your own machine learning model. You are using Sklearn that is popular machine learning libraries for modeling. But wait do you know the common machine learning modeling challenges faced by every data scientist. No, then you have come to the right place. Here You will know each modeling challenges you face while building the model.
Machine Learning Modeling Challenges
Imbalancing of the Target Categories
When you have a categorical target dataset. Then in the data preprocessing phase, you make a mistake of imbalance of the target dataset. Obviously, it leads to the wrong model score. For example lets, you have 1000 binary values of the categorical target variable. If you take 60% of 0 value and 40 % of 1 values, then it leads to imbalance. You should consider taking almost 50 % of the 0 value and 50% of 1 values. Keep in mind that this rule only applies to the training dataset, not the test dataset. The test data gives real predictions on the balanced trained dataset.
It is also one of the common challenges find generally by the data scientist. In the interactions, the third variable depends upon the relationship between the two variables. You can think of it as in a predictable model. You want to predict a binary output, then its prediction can be affected if one input influence by the other input. For example: As you know the salary of a person are dependent upon the experience and education level. Here the input is experience and education level and the output is salary. But there is a thing that can somehow affect the output that is the sex of the person male or female. You can use the techniques like trees, discriminant analysis for creating your own interactions. I generally prefer decision trees for finding the interaction between the variables.
Missing values in the dataset always affect your model accuracy. Most of the machine learning algorithms do the listwise deletion on the NaN automatically. You can remove all the missing values if the total number of the missing values in the dataset is less than 5%, If there are more than the best method is to do imputation. Imputation is a simple method for replacing the missing values with the mean, median or mode automatically. You can use the sklearn imputation method for that. Sometime imputation on the dataset is more complex if there are more missing values or many features have missing values. In fact, you cannot always rely on imputation. Thus the best solution for it is to use decision trees on the incomplete data and other algorithms on the complete dataset.
High Bias ( Underfitting) and High Variance (Overfitting)
In the machine learning model if you have got high bias and high variance then the model prediction score is worst. For example in a High Bias, Model is not flexible to get enough signal or output. Like Linear Discriminant Analysis can only be fit on the Linear Relationships.
In High variance, the model is sensitive to noise. It means you will see a lot of noise in the training data that lead to the performance gap between the test and training data. Therefore for the better prediction model, you have balance the Bias and Variance. The solution for these are below
If you have high bias then do the following things to increase accuracy.
- Add more inputs
- Tackle the Interactions and Curvilinearity
- Change the setting of the model.
In the case of the high variance remove weak and redundant inputs. and Use a simple algorithm.
You can read more about in The Signal and the Noise.
It is one of the major challenges faced by the data scientist. You have many variables in a dataset and you want to reduce the variable for the better model. To do so you identify the poor and reductant predictors and remove them. How can you identify it?
1. You can do the bivariate analysis for finding the relationship between the target and the input variable.
2. Use one machine learning model to identify the relevant input variables. For example, if I want to build a neural network. First I will build a decision tree model and then identify those variables that are utilized by the tree. After that, I will use only those variables as input to the neural network.
3. Identify the redundancy in the dataset. You can use the correlation matrix of the variables. You can also use the sklearn Factor analysis and Principal Component analysis for this.
Solving machine learning modeling challenges are an important part of the data preprocessing steps. You should not directly jump to the model creation phase without understanding and analyzing the dataset. The above-described challenges always come when you build a learning model. Therefore just keep in mind how to solve these challenges to build a successful model.
Join our list
Subscribe to our mailing list and get interesting stuff and updates to your email inbox.