top of page

Selecting the right model- Machine learning

Writer's picture: TECH BUDDYTECH BUDDY



The main aim of any machine learning model is to perform well on unseen data. It is important that the model performs well on training as well as testing or previously unseen data.

The performance between training and testing data is due to overfitting and underfitting. Let us understand these with an example. In an algebra class, there are 3 students A, B, and C and a teacher D. D teaches different methods, expressions to students A, B, and C. Student A has interest to become a topper in the class, he mugs up the concept. Student B has no interest in a study he likes designing. Student C has the interest to learn new concepts and he focuses on understanding the concept. After one week the teacher conducts a test based on the class questions. Student A scored 97%, student B scored 40% by guessing, student c scored 92%. A teacher came to know that student A and C are performing well. Now, the teacher wants to know whether they have learned the concept or not. D again conducts a test on the concept that he taught but the question is slightly different from classwork. In this test, Student A sored 65% and C scored 90%, B scored 38%.






A comes under overfitting as he performed well on classwork but not on the unseen question

B comes under underfitting as he didn't perform well on classwork and unseen question

C comes under bestfitting as he performed well on classwork as well as on the unseen question



We will understand all these concepts with respect machine learning model of classification problems (titanic problem). To demonstrate this we will make KNN model

First import all the necessary libraries

import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Loading the dataset


data = pd.read_csv('titanic.csv')

Segregating independent and dependent variables on the titanic dataset.

#seperating independent and dependent variables

x = data.drop(['Survived'], axis=1)
y = data['Survived']

Dividing the dataset into train and test

#scaling the data
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
x = ss.fit_transform(x)
from sklearn.model_selection import train_test_split
train_x,test_x,train_y,test_y = train_test_split(x,y, random_state = 96, stratify=y)

Implementing KNN


from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.metrics import f1_score


Go through this article to know more about KNN algorithm. In this, we fit the model on training data and make a prediction on the train as well as a test dataset and calculate the f1 score for both.

# Creating instance of KNN
clf = KNN(n_neighbors = 3)

# Fitting the model
clf.fit(train_x, train_y)

# Predicting over the Train Set and calculating F1
train_predict = clf.predict(train_x)
k = f1_score(train_predict, train_y)
print('Training F1 Score', k )

# Predicting over the Train Set and calculating F1
test_predict = clf.predict(test_x)
k = f1_score(test_predict, test_y)
print('Test F1 Score    ', k )

This F1Score function will take in a list of k values and return two list train_f1 and test_f1 which will contain f1 score for train and test for different values of k.


def F1score(K):
  train_f1 = list of train f1 score corresponding K
  test_f1  = list of test f1 score corresponding to K
     train_f1 = []
      test_f1 = []
  
      
    for i in K:
        # Instance oh KNN
        clf  = KNN(n_neighbors = i)
        clf.fit(train_x, train_y)
        # Appending F1 scores to empty list claculated using the predictions
        tmp = clf.predict(train_x)
        tmp = f1_score(tmp,train_y)
        train_f1.append(tmp)
    
        tmp = clf.predict(test_x)
        tmp = f1_score(tmp,test_y)
        test_f1.append(tmp)
    
    return train_f1, test_f1

Define the range of k

k = range(1,150)

Now plot the graph to see the difference between train and test scores for different values of k. After running this code in jupyter notebook, you will come to know what value of k will give less difference between the train and test scores. The value of k will be taken for which there is a minimum difference between train and test score



plt.figure(figsize=(6,3), dpi=150)
plt.plot(k[0:60], test_f1[0:60], color = 'red' , label = 'test')
plt.plot(k[0:60], train_f1[0:60], color = 'green', label = 'train')
plt.xlabel('K Neighbors')
plt.ylabel('F1 Score')
plt.title('F1 Curve')
plt.ylim(0.5,1)
plt.legend()


Titanic dataset



73 views0 comments

Recent Posts

See All

Comments


Post: Blog2_Post
bottom of page