The classification problem is a type of supervised problem in which the target variable is categorical in nature. In this article, we will discuss an algorithm used to solve simple classification problems effectively using Machine Learning. Also, we will analyze a hypothetical Binary Class problem involving business sourced outcomes based on the features given in a dataset
Logistic regression is a classification algorithm. We can also think of logistic regression as a special case of linear regression when the outcome variable is categorical, where we are using a log of odds as a dependent variable. where p/1-p is the odd ratio.
I have given a dataset at the end of the article. First import all the libraries and load CSV file on jupyter notebook .
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.read_csv('Train_pjb2QcD.csv')
df.shape
Before applying the algorithm, check whether there is a missing value with the help of isnull function.
df.isnull().sum()
If there is a missing value present in the categorical, in data exploration part we have studied that we can impute missing values of a categorical variable with its mode. To check the frequency of a variable we use value_counts.
df['Applicant_Gender'].value_counts()
df['Applicant_Gender'].fillna('M', inplace = true)
Impute the missing values for all the variables present in the dataset.
Machine learning algorithms cannot work with categorical data directly. Categorical data must be converted in a number. There are two ways to convert categorical data into numerical, integer, and one-hot encoding. In this problem, we will use a one-hot encoding.
df = pd.get_dummies(df)
For implementing logistic regression first separate the dependent and independent variable.
x = df.drop(['Business_Sourced'],axis=1)
y = df['Business_Sourced']
Use the train test split function to divide the dataset into two parts.
from sklearn.model_selection import train_test_split
train_x, valid_x, train_y, valid_y= train_test_split(x, y, test_size = 0.3, random_state=1)
Import the logistic regression from sklearn.linear model and import other evaluation metrics.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, auc
from sklearn.metrics import roc_auc_score
Now create an instance and fit the train dataset.
logreg = LogisticRegression()
logreg.fit(train_x, train_y)
pred_train = logreg.predict_proba(train_x)
pred_valid = logreg.predict_proba(valid_x)
Finally, evaluate the performance of train and test dataset using roc_auc_score metrics.
roc_auc_score(train_y, pred_train[:,1])
roc_auc_score(valid_y, pred_valid[:,1])
Dataset and jupyter notebook file of the above problem.
Kommentarer