random forest classifier feature image

How to Improve Accuracy of Random Forest ? Tune Classifier In 7 Steps

Random Forest is the best algorithm after the decision trees. You can say its collection of the independent decision trees. Each decision tree has some predicted score and value and the best score is the average of all the scores of the trees. But wait do you know you can improve the accuracy of the score through tuning the parameters of the Random Forest. Yes, rather than completely depend upon adding new data to improve accuracy, you can tune the hyperparameters to improve the accuracy. In this tutorial of “how to, you will know how to improve the accuracy of random forest classifier.

How Random Forest Works?

In a Random Forest,  algorithms select a random subset of the training data set. Then It makes a decision tree on each of the sub-dataset. After that, it aggregates the score of each decision tree to determine the class of the test object. It is the case of Random Forest Classifier. But for the Random Forest regressor, it averages the score of each of the decision tree. This intuition is for random forest Classifier.

When to use Random Forest?

There are various machine learning algorithms and choosing the best algorithms requires some knowledge. Here are the things you should remember before using the Random Forest Algorithm

1. Random Forest works very well on both the categorical ( Random Forest Classifier) as well as continuous Variables (Random Forest Regressor).

2. Use it to build a quick benchmark of the model as it is fast to train.

3. If you have a dataset that has many outliers, missing values or skewed data, it is very useful.

In the background, Random Forest Tree has hundreds of trees, Due to this, it takes more time to predict, therefore you should not use it for real-time predictions.

Hyper Parameters Tuning of Random Forest

Step1: Import the necessary libraries

import numpy as np
import pandas as pd
import sklearn

Step 2: Import the dataset.

train_features = pd.read_csv("train_features.csv")
train_label = pd.read_csv("train_label.csv")

You can download the dataset here. Same Dataset that works for tuning Support Vector Machine.

Step 3: Import the Random Forest Algorithm from the scikit-learn.

from sklearn.ensemble import RandomForestClassifier,RandomForestRegressor
print(RandomForestClassifier())
print(RandomForestRegressor())

parameters for the Random Forest

Step 4: Choose the parameters to be tuned.

On running step 3, you will see a lot of parameters for both the Random Forest Classifier and Regressor. I am choosing the important one that us number of estimators/trees (n_estimators)  and the maximum depth of the tree (max_depth).

Step 5: Call the classifier constructor and make the expected list of all the parameters.

You will make a list of all the parameters, you chose on step 4. Like in this example.

rfc = RandomForestClassifier()
parameters = {
    "n_estimators":[5,10,50,100,250],
    "max_depth":[2,4,8,16,32,None]
    
}

Step 6: Use the GridSearchCV model selection for cross-validation

You will pass the classifier and parameters and the number of iteration in the GridSearchCV method. In this example, I am passing the cross-validation iteration of 5. Then you will fit the GridSearchCV to the X_train variables and the X_train label.

Please note that you have to convert the values of label into a one-dimensional array. That’s why we are using ravel() method.

from sklearn.model_selection import GridSearchCV
cv = GridSearchCV(rfc,parameters,cv=5)
cv.fit(train_features,train_label.values.ravel())

grid search cross validation parameters

Step 7: Print the best Parameters.

This feature is available in the GridSearchCV. You can use cv. best_params_ to know the best parameters. But what is the algorithm is doing inside it doesn’t print. That’s why We have defined the method for printing all the iteration done and scores in each iteration.

def display(results):
    print(f'Best parameters are: {results.best_params_}')
    print("\n")
    mean_score = results.cv_results_['mean_test_score']
    std_score = results.cv_results_['std_test_score']
    params = results.cv_results_['params']
    for mean,std,params in zip(mean_score,std_score,params):
        print(f'{round(mean,3)} + or -{round(std,3)} for the {params}')
display(cv)

scores for the each iteration

It will print the entire iteration results defined in the above function. And you can clearly see it print out the best score and the parameters. In  this example the best parameters are :

{'max_depth': 8, 'n_estimators': 250}

Use it in your random forest classifier for the best score.

Conclusion

The Parameters tuning is the best way to improve the accuracy of the model. In fact, There are also other ways, like adding more data e.t.c. But it obvious that it adds some cost and time to improve the score. Therefore  I recommend you to first go with parameter tuning if you have sufficient data and then move to add more data.

That’s all for now. If you want to get featured on Data Science Learner Page. Then contact us to know what are the requirements. If you have any query, then message us. You can also message on our official Facebook Page.

 
Thank you For sharing.We appreciate your support. Don't Forget to LIKE and FOLLOW our SITE to keep UPDATED with Data Science Learner