Reason for outliers
An outlier is a data point that differs significantly from other observations. The outlier present in the data set is due to multiple reasons. One of the reasons is due to data entry error. for example:- In the salary variable of churn prediction dataset, extra zeroes have been added by mistake. The next reason is due to measurement error. one may have considered another unit to measure data. Another reason is due to processing error and change in the underlying population.
Types of outliers
Univariate outliers:- Univariate outlier is an outlier that we get analyzing single variables.
Bivariate outliers:- Bivariate outlier is an outlier that we get analyzing two-variable at a time.
How to identify outliers
Outliers can be identified by a different method depending on the type of outlier.
Graphical method:- A graphical method to find outliers are scatterplot and boxplot. A univariate outlier can be found by a boxplot. A bivariate outlier can be found by scatterplot. In the given below picture, two different methods to identify outliers has been shown. A circle marked with red colors are outliers.
Formula methods: Outlier is the values less than Q1-1.5*IQR or greater than Q3+1.5*IQR where IQR is the difference between upper and lower quartiles.
How to treat outliers
Deleting observation:- First, identify the values which are an outlier. For example, Use a scatter plot method between fare and age to find outliers, you found that one value of fare is larger than 500 all the others are value are less than this. You can remove these by using the following code.
df= df[df['fare']<500]
Transforming and binning values:- using the log function of the variable may remove the outliers present in data.
Imputing outliers:- Impute outliers in the same way as we did for missing values.
Dataset and jupyter notebook file for house prediction
Comments