The Answer for the question “Does random forest need normalization ?” is No. Random Forest is Tree Based Approach where distance matrix is not required. In fact, The normalization or any kind of Feature scaling is only applicable for only those ML algorithms where any distance matrix is required. In this article we will see how Random Forest is unimpacted with any Scaling. We will also explore multiple algorithms where scaling is mandatory.
Does random forest need normalization ? ( Practical Scenario) –
Above all, Lets construct two Radom Forest Model. One without scaling ( with Absolute values of feature ) and second with scaling ( Normalization ). Hence We will compare their performance with Accuracy to validate our assumptions and Hypothesis.
Scenario 1 : Random Forest without Feature Scaling ( Normalization ) –
Firstly, In order to demonstrate lets build a simple random forest with sklearn dataset. We will put all parameters and syntax very generic. Our main focus is on comparing both the hypothesis. Here is the complete code for the same.
from sklearn import metrics
from sklearn.datasets import load_boston
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.model_selection import train_test_split
data = load_boston()
X = data.data
y = data.target
y_round = y.round()
rfc = RandomForestClassifier(random_state=20)
X_train, X_test, y_train, y_test = train_test_split(X, y_round, test_size=0.20, random_state=20)
rfc.fit(X_train, y_train)
y_pred = rfc.predict(X_test)
print(f'Accuracy with Absolute Value is : {metrics.accuracy_score(y_test, y_pred)}')
Here we used the data from sklearn boston dataset. We trained the random forest model and got the below accuracy on running the above code.

Scenario 2 : Random Forest with Feature Scaling ( Normalization ) –
Secondly, In this scenario, we will keep everything identical. I mean parametric value , dataset etc. Only we will add the step to scale the data. Then we will train the model and again check the accuracy for the random forest model. Please run the below code.
from sklearn import metrics
from sklearn.preprocessing import MinMaxScaler
from sklearn.datasets import load_boston
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.model_selection import train_test_split
data = load_boston()
X = data.data
y = data.target
y_round = y.round()
rfc = RandomForestClassifier(random_state=20)
X_train, X_test, y_train, y_test = train_test_split(X, y_round, test_size=0.20, random_state=20)
sc = MinMaxScaler()
X_train_norm = sc.fit_transform(X_train)
X_test_norm = sc.transform(X_test)
rfc.fit(X_train, y_train)
y_pred = rfc.predict(X_test)
print(f'Accuracy After Normalization is : {metrics.accuracy_score(y_test, y_pred)}')
Here you may see that we have scaled our data using MinMaxScaler before training the model. Finally we will check the accuracy for the model.

Final Comparison –
When we compare both the scenario, we found both the places the accuracy is almost same. It means the Feature Scaling will not impact the performance of Tree Models ( Random Forest ).Actually it involves Gini Index, Information gain where scaling will not make any sense. Most Importantly, The Algorithms like Neural Networks, Regressions etc requires Feature Scaling. Since It involves distance Matrix.
Thanks
Data Science Learner Team
Join our list
Subscribe to our mailing list and get interesting stuff and updates to your email inbox.