top of page

Data Exploration

Writer's picture: TECH BUDDYTECH BUDDY

Updated: Jul 24, 2020




What is a data exploration


Data exploration is the fourth stage of predictive modeling. This process helps to gain insights from data. A good analyst knows his/her data very well, whereas a bad analyst relies on tools and libraries. Before applying any algorithm to the data set, this process identifies types of variables, which of the hypothesis is true, recurring pattern of variable


Before moving ahead, install anaconda in your computer or laptop. Open jupyter notebook and upload the dataset given in this link.


Steps involved in Data Exploration.


Reading the data:- The first step in data exploration is reading the data into the analysis system or software, like reading CSV or excel files into pandas. Execute the following code in jupyter notebook to read CSV file

The first you need to import pandas

import pandas

Read the CSV file

df= pd.read_csv('filename')
df

Use shape function to check the dimension of files

df.shape

Now use column function to check the name of the variables

df.columns

Now if you want to check the first five rows of the dataset, use head function

df.head()

Variable identification:- Variable identification is the process of identifying whether a variable is dependent, independent, continuous, or categorical. It is necessary to identify the type of variable because some techniques require the identification of the dependent variable, different algorithms are used to deal with categorical and continuous.


Difference between an independent and dependent variable


The dependent variable is variable we are trying to predict whereas the independent variable is variable which helps in predicting the dependent variable. The dependent and independent variable can be found from problem statement

For example:- if we are trying to predict the churn of the customer, churn is a dependent variable, and transaction history and annual income are the independent variables.


Difference between the continuous and categorical variable


A continuous variable is a type of variable which have an infinite number of possible values. For example:- Fare, Age.

A categorical is a type of variable which are discrete in nature. For example:- Survived, gender.


dtypes function is used to identify the type of variables.

df.dtypes

Python stores categorical variables as an object and continuous variables as an int or float.



Univariate analysis:- Univariate analysis is the process of exploring one variable at a time, summarising the variable, and checking whether an anomaly is present.


We can perform univariate analysis either by the tabular method or graphical method.

The tabular method is used to analyzing mean, median, standard deviation. Describe function is used to analyze the mean, median, standard deviation, etc. Describe function returns all the desired results in tabular form.

df.describe()

Before performing the graphical method male sure you have import matplotlib to jupyter notebook. %matplotlib line is used for plotting bar in jupyter notebook itself

import matplotlib.pyplot as plt
%matplotlib inline

The graphical method is used to analyze the anomalies, distribution of the variable, etc. To check the distribution of variables, you can use the hist function.

df['variable name'].plot.hist()

Boxplot method is used to detect the outliers present in data.

df['variablename'].plot.box()


Bivariate analysis:- Bivariate analysis used to identify whether two variables are associated with each other or not. It helps in prediction when two variables are associated one may be used to infer the other. The bivariate analysis also helps in detecting anomalies.


Types of bivariate analysis


Continuous-continuous Analysis

Continuous-categorical Analysis

categorical-categorical Analysis




Missing value treatment:- Missing value in the dataset can be due to error in data collection, or due to nonresponse, error in reading data.


Types of missing values.

Missing completely at random

Missing at random

Missing not at random


Describe or IsNull function is used to identify whether there is missing value or not.

df.describe()
df.isnull().sum()


Outlier treatment:- Outliers present in the dataset is due to data entry errors, or due to measurement errors, change in the underlying populations.


Types of outliers

Univariate outliers

Bivariate outliers


A graphical method such as a boxplot or scatter plot is used to identify the variable.


To identify univariate outlier use the following code

df['var1].plot.box()

To identify bivariate outlier use the following command

df.plot.scatter('var1','var2')

Variable transformation:- Variable transformation is a process of replacing a variable with some function of that variable. It is the process by which we change the distribution of a variable with the others. Variable transformation is used for transforming nonlinear relationships into a linear relationship.

Different methods are used for variable transformation. One method is the logarithm taking the log of the variable reduces the right skewness of the variable.


np.log(df['var1']).plot.hist()

Data set and jupyter notebook for data exploration of house price prediction is given below:-
























101 views0 comments

Recent Posts

See All

Comments


Post: Blog2_Post
bottom of page