top of page
Writer's pictureTECH BUDDY

Outlier treatment - Data Exploration







Reason for outliers


An outlier is a data point that differs significantly from other observations. The outlier present in the data set is due to multiple reasons. One of the reasons is due to data entry error. for example:- In the salary variable of churn prediction dataset, extra zeroes have been added by mistake. The next reason is due to measurement error. one may have considered another unit to measure data. Another reason is due to processing error and change in the underlying population.



Types of outliers


Univariate outliers:- Univariate outlier is an outlier that we get analyzing single variables.


Bivariate outliers:- Bivariate outlier is an outlier that we get analyzing two-variable at a time.



How to identify outliers


Outliers can be identified by a different method depending on the type of outlier.


Graphical method:- A graphical method to find outliers are scatterplot and boxplot. A univariate outlier can be found by a boxplot. A bivariate outlier can be found by scatterplot. In the given below picture, two different methods to identify outliers has been shown. A circle marked with red colors are outliers.


Formula methods: Outlier is the values less than Q1-1.5*IQR or greater than Q3+1.5*IQR where IQR is the difference between upper and lower quartiles.


How to treat outliers



Deleting observation:- First, identify the values which are an outlier. For example, Use a scatter plot method between fare and age to find outliers, you found that one value of fare is larger than 500 all the others are value are less than this. You can remove these by using the following code.

df= df[df['fare']<500]

Transforming and binning values:- using the log function of the variable may remove the outliers present in data.

Imputing outliers:- Impute outliers in the same way as we did for missing values.



Dataset and jupyter notebook file for house prediction












46 views0 comments

Recent Posts

See All

Comments


Post: Blog2_Post
bottom of page