Decision trees are one of the most commonly used technique among all business analysts. It helps us with the prediction and classification and also a very effective tool to understand the behavior of various variables. A decision tree is a type of supervised learning algorithm that is mostly used in classification problems. It works for both categorical and continuous input and output variables. In this article, we will be considering the regression problem to predict the price of the house.
Terminologies related to the Decision tree.
ROOT Node: It represents the entire population or sample and this further gets divided into two or more homogeneous sets.
s.
Decision Node: When a sub-node splits into further sub-nodes, then it is called a decision node.
Leaf/ Terminal Node: Nodes which do not split further is called Leaf or Terminal node.
Branch / Sub-Tree: A subsection of the entire tree is called a branch or sub-tree.
Parent and Child Node: A node, which is divided into sub-nodes is called the parent node of sub-nodes whereas sub-nodes are the child of the parent node.
Regression trees are used when the target variable is continuous like in our example we want to predict the sales price of the house. For regression trees, the value of terminal nodes is the mean of the observations falling in that region. Therefore, if an unseen data point falls in that region, we predict using the mean value.
Now let's start implementing the decision tree on nJupyternotebook.
First import all the necessary libraries in jupyter notebook.
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score,mean_squared_errorimport seaborn as sns
Now load the dataset into the jupyter notebook.
df= pd.read_csv('chennai_houseprediction')
Next to perform data exploration on a given dataset. After applying data exploration separate the dependent and independent variables.
y = data['Survived']
X = data.drop(['Survived'], axis=1)
Now divide the dataset into two parts.
#importing train_test_split to create validation set
from sklearn.model_selection import train_test_split
Note that the test size of 0.25 indicates we’ve used 25% of the data for testing.
X_train, X_valid, y_train, y_valid = train_test_split(X, y, random_state = 101, stratify=y, test_size=0.25)
We’re going to use x_train and y_train obtained above to train our decision tree regression model. we will create the instance and fit train data.
dt_model = DecisionTreeClassifier(random_state=10)
dt_model.fit(X_train, y_train)
Since we have trained our model, the next step is to make a prediction.
y_pred = dt_model.predict(x_valid)
Finally, we need to check to see how well our model is performing on the test data. For this, we evaluate our model by finding the root mean squared error produced by the model.
mse = mean_squared_error(Y_valid,Y_pred
rmse=np.sqrt(mse)
We can check rmse for different values of depth by the following code.
train_rmse = []
validation_rmse = []
for depth in range(1,10):
dt_model = DecisionTreeClassifier(max_depth=depth, random_state=10)
dt_model.fit(X_train, y_train)
train_rmse.append(dt_model.score(X_train, y_train))
validation_rmse.append(dt_model.score(X_valid, y_valid))
Dataset for the above problem
Comments