Dividing data set into train and test is one method to quickly evaluate the performance of the algorithm on the problem. The training dataset is used to prepare a model, to train it.
We consider the test dataset is new data where the output values are withheld from the algorithm. We gather predictions from the trained model on the inputs from the test dataset and compare them to the withheld output values of the test set.
Different techniques used to train and evaluate the performance of the model depending upon the type of problem. Some of the techniques have discussed below.
K-nearest neighbors:- In kNN algorithm, prediction on new data is made on the basis of the behavior of its neighbor. KNN learning model has not its learning process, it is based on the behavior of its neighbor. consider an example, suppose you went to the restaurant and people who sit left side are non-vegetarian, and on the right side vegetarian. You didn't know about this you decided to sit on the left sit. A waiter gives you a menu of non-veg without asking whether you are vegetarian or non-vegetarian.
Before applying algorithm make sure that you have imported all these libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
% matplotlib inline
Pandas and NumPy are used for data manipulation and cleaning. Matplotlib is used to plot the graph in jupyter. %matplotlib.inline is used to plot the graph in jupyter notebook itself.
The first problem is a titanic problem of classification type. After importing all the libraries, load dataset n to the jupyter notebook and then separate dependent and independent variables.
df= pd.read_csv("titanic.csv")
#seperate variables
x= data.drop(['Survived'], axis=1)
y= df['Survived']
Scaling data is important when we are dealing with distance-based algorithm
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaled_x= scaler.fit_transform(x)
x= pd.DataFrame(x_scaled, columns= x.columns)
Divide the dataset into two-part training and testing.
from sklearn.model_selection import train_test_split
train_x,test_x,train_y,test_y= train_test_split(x,y, random_state=56, stratify=y)
from sklearn.model_selection import KneighborsClassifier as KNN
from sklearn.metrics import F1_score
clf = KNN()
clf.fit(train_x, train_y)
test_predict= clf.predict(test_x)
k= f1_score(test_predict,test_y)
In the same way, we can do for regression problem by importing KneighborsRegressor and following other steps.
Linear Regression:- LinearRegression algorithm is used when there is a linear relationship between independent and dependent variables. Let us take an example of predicting fare with distance. As the distance increases the fare also increases. The model which can be quickly built is a benchmark model. For a benchmark model, we can say that fare is equal to the mean of the overall dataset. This model does not help and not makes sense as we know the relationship between fare and distance. We can use some of the linear relationships to improve the model. This linear relationship can be represented in mathematical form:-
y= Bx+b
where y = dependent variable, b=intercept , B= slope of a line, x=independent variable.
By changing the values of parameter B and b, which will represent the best linear relationship can be found by mean squared error.
from sklearn.linear_model import LinearRegression as LR
from sklearn.metrics import mean_absolute_error as mae
lr = LR()
# Fitting the model
lr.fit(train_x, train_y)
train_predict = lr.predict(train_x)
k = mae(train_predict, train_y)
test_predict = lr.predict(test_x)
k = mae(test_predict, test_y)
print('Test Mean Absolute Error ', k )
Logistic regression:- Logistic regression is a classification algorithm. Let us take an example to predict churn by the salary of the person. We try to fit a line and check whether we can classify it or not. When we fit a line, we see a value between zero and one. But when you add a new data point you will see that slope of a line changes. It does not give the right way to classify a problem. Another problem is it can have a value greater than 1. It makes the model interpretation challenge. This problem can be solved through logistic regression. Consider a function that will convert all the values between zero and function. This function sigmoid function or logistic function. The advantage of this is that it will give continuous prediction value.
Z= Bx+b
Y* = Q(Z)- sigmoid function
Q(Z)= 1/1+e^-z Y* = 1/1+e^-z
from sklearn.linear_model import LogisticRegression as LogReg
from sklearn.metrics import f1_score
logreg.fit(train_x, train_y)
train_predict = logreg.predict(train_x)
k = f1_score(train_predict, train_y)
Comments