The main aim of any machine learning model is to perform well on unseen data. It is important that the model performs well on training as well as testing or previously unseen data.
The performance between training and testing data is due to overfitting and underfitting. Let us understand these with an example. In an algebra class, there are 3 students A, B, and C and a teacher D. D teaches different methods, expressions to students A, B, and C. Student A has interest to become a topper in the class, he mugs up the concept. Student B has no interest in a study he likes designing. Student C has the interest to learn new concepts and he focuses on understanding the concept. After one week the teacher conducts a test based on the class questions. Student A scored 97%, student B scored 40% by guessing, student c scored 92%. A teacher came to know that student A and C are performing well. Now, the teacher wants to know whether they have learned the concept or not. D again conducts a test on the concept that he taught but the question is slightly different from classwork. In this test, Student A sored 65% and C scored 90%, B scored 38%.
A comes under overfitting as he performed well on classwork but not on the unseen question
B comes under underfitting as he didn't perform well on classwork and unseen question
C comes under bestfitting as he performed well on classwork as well as on the unseen question
We will understand all these concepts with respect machine learning model of classification problems (titanic problem). To demonstrate this we will make KNN model
First import all the necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
Loading the dataset
data = pd.read_csv('titanic.csv')
Segregating independent and dependent variables on the titanic dataset.
#seperating independent and dependent variables
x = data.drop(['Survived'], axis=1)
y = data['Survived']
Dividing the dataset into train and test
#scaling the data
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
x = ss.fit_transform(x)
from sklearn.model_selection import train_test_split
train_x,test_x,train_y,test_y = train_test_split(x,y, random_state = 96, stratify=y)
Implementing KNN
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.metrics import f1_score
Go through this article to know more about KNN algorithm. In this, we fit the model on training data and make a prediction on the train as well as a test dataset and calculate the f1 score for both.
# Creating instance of KNN
clf = KNN(n_neighbors = 3)
# Fitting the model
clf.fit(train_x, train_y)
# Predicting over the Train Set and calculating F1
train_predict = clf.predict(train_x)
k = f1_score(train_predict, train_y)
print('Training F1 Score', k )
# Predicting over the Train Set and calculating F1
test_predict = clf.predict(test_x)
k = f1_score(test_predict, test_y)
print('Test F1 Score ', k )
This F1Score function will take in a list of k values and return two list train_f1 and test_f1 which will contain f1 score for train and test for different values of k.
def F1score(K):
train_f1 = list of train f1 score corresponding K
test_f1 = list of test f1 score corresponding to K
train_f1 = []
test_f1 = []
for i in K:
# Instance oh KNN
clf = KNN(n_neighbors = i)
clf.fit(train_x, train_y)
# Appending F1 scores to empty list claculated using the predictions
tmp = clf.predict(train_x)
tmp = f1_score(tmp,train_y)
train_f1.append(tmp)
tmp = clf.predict(test_x)
tmp = f1_score(tmp,test_y)
test_f1.append(tmp)
return train_f1, test_f1
Define the range of k
k = range(1,150)
Now plot the graph to see the difference between train and test scores for different values of k. After running this code in jupyter notebook, you will come to know what value of k will give less difference between the train and test scores. The value of k will be taken for which there is a minimum difference between train and test score
plt.figure(figsize=(6,3), dpi=150)
plt.plot(k[0:60], test_f1[0:60], color = 'red' , label = 'test')
plt.plot(k[0:60], train_f1[0:60], color = 'green', label = 'train')
plt.xlabel('K Neighbors')
plt.ylabel('F1 Score')
plt.title('F1 Curve')
plt.ylim(0.5,1)
plt.legend()
Titanic dataset
Comments