What is a data exploration
Data exploration is the fourth stage of predictive modeling. This process helps to gain insights from data. A good analyst knows his/her data very well, whereas a bad analyst relies on tools and libraries. Before applying any algorithm to the data set, this process identifies types of variables, which of the hypothesis is true, recurring pattern of variable
Before moving ahead, install anaconda in your computer or laptop. Open jupyter notebook and upload the dataset given in this link.
Steps involved in Data Exploration.
Reading the data:- The first step in data exploration is reading the data into the analysis system or software, like reading CSV or excel files into pandas. Execute the following code in jupyter notebook to read CSV file
The first you need to import pandas
import pandas
Read the CSV file
df= pd.read_csv('filename')
df
Use shape function to check the dimension of files
df.shape
Now use column function to check the name of the variables
df.columns
Now if you want to check the first five rows of the dataset, use head function
df.head()
Variable identification:- Variable identification is the process of identifying whether a variable is dependent, independent, continuous, or categorical. It is necessary to identify the type of variable because some techniques require the identification of the dependent variable, different algorithms are used to deal with categorical and continuous.
Difference between an independent and dependent variable
The dependent variable is variable we are trying to predict whereas the independent variable is variable which helps in predicting the dependent variable. The dependent and independent variable can be found from problem statement
For example:- if we are trying to predict the churn of the customer, churn is a dependent variable, and transaction history and annual income are the independent variables.
Difference between the continuous and categorical variable
A continuous variable is a type of variable which have an infinite number of possible values. For example:- Fare, Age.
A categorical is a type of variable which are discrete in nature. For example:- Survived, gender.
dtypes function is used to identify the type of variables.
df.dtypes
Python stores categorical variables as an object and continuous variables as an int or float.
Univariate analysis:- Univariate analysis is the process of exploring one variable at a time, summarising the variable, and checking whether an anomaly is present.
We can perform univariate analysis either by the tabular method or graphical method.
The tabular method is used to analyzing mean, median, standard deviation. Describe function is used to analyze the mean, median, standard deviation, etc. Describe function returns all the desired results in tabular form.
df.describe()
Before performing the graphical method male sure you have import matplotlib to jupyter notebook. %matplotlib line is used for plotting bar in jupyter notebook itself
import matplotlib.pyplot as plt
%matplotlib inline
The graphical method is used to analyze the anomalies, distribution of the variable, etc. To check the distribution of variables, you can use the hist function.
df['variable name'].plot.hist()
Boxplot method is used to detect the outliers present in data.
df['variablename'].plot.box()
Bivariate analysis:- Bivariate analysis used to identify whether two variables are associated with each other or not. It helps in prediction when two variables are associated one may be used to infer the other. The bivariate analysis also helps in detecting anomalies.
Types of bivariate analysis
Continuous-continuous Analysis
Continuous-categorical Analysis
categorical-categorical Analysis
Missing value treatment:- Missing value in the dataset can be due to error in data collection, or due to nonresponse, error in reading data.
Types of missing values.
Missing completely at random
Missing at random
Missing not at random
Describe or IsNull function is used to identify whether there is missing value or not.
df.describe()
df.isnull().sum()
Outlier treatment:- Outliers present in the dataset is due to data entry errors, or due to measurement errors, change in the underlying populations.
Types of outliers
Univariate outliers
Bivariate outliers
A graphical method such as a boxplot or scatter plot is used to identify the variable.
To identify univariate outlier use the following code
df['var1].plot.box()
To identify bivariate outlier use the following command
df.plot.scatter('var1','var2')
Variable transformation:- Variable transformation is a process of replacing a variable with some function of that variable. It is the process by which we change the distribution of a variable with the others. Variable transformation is used for transforming nonlinear relationships into a linear relationship.
Different methods are used for variable transformation. One method is the logarithm taking the log of the variable reduces the right skewness of the variable.
np.log(df['var1']).plot.hist()
Data set and jupyter notebook for data exploration of house price prediction is given below:-
コメント