Univariate Analysis
What is a univariate analysis? What is the use of univariate analysis?
univariate analysis is when we focus on a single variable at a time, summarise the variable, and use this summary to discover insights and anomalies. Exploration methods on the type of variables. Let's discuss univariate analysis for two different variable
Univariate analysis for continuous variables
Univariate analysis for continuous variables describes the central tendency and dispersion of variables such as mean, median, and mode. It tells about the distribution of the variable whether it is symmetric, right-skewed, left-skewed. It also helps in identifying missing values and outliers.
Methods of performing univariate analysis of continuous variable
Tabular methods:- Mean, Median, standard deviation and missing values
describe() function returns all required result in tabular form.
df.describe()
Graphical methods:- Distribution of variables, presence of outliers
The histogram is used to identify the distribution of continuous variables. Boxplot is used to identify outliers.
df['var1'].plot.hist()
df['var1'].plot.box()
Univariate analysis for categorical variables.
Univariate analysis of categorical variables is used to identify the absolute frequency of each category. Sometimes it is more to find a proportion of different categories in a categorical variable through univariate analysis. Suppose you want to find a number of the house in house prediction dataset which has a parking facility or how many percentages of the house has a parking facility. It can be found through univariate analysis.
Methods of performing univariate analysis of categorical variable
Tabular method- frequency tables
Value_counts function is used to find the frequency table
df['var1'].value_counts()
Graphical method- Barplots
Barplot is used to visualize the frequency of the table.
df['var1'].value_counts.plot.bar()
Bivariate Analysis
What is a bivariate analysis? What is the use of bivariate analysis?
Bivariate analysis is when we explore two-variable together for their empirical relationship or to check whether two variables are associated with each other or not. The bivariate analysis helps in prediction one may be used to infer others. It also helps in detecting outliers.
Types of Bivariate analysis
Continuous-continuous Variables:- This type is used to identify the relationship between two continuous variables. Example. Does the weight of a person increase with its height? It can be found through an analysis test correlation. Correlation is used to identify the unique relationship between two continuous variables. Two variables have a positive correlation when the value of correlation is positive.
df['var1'].corr(df['var2'])
Categorical - continuous analysis.:- This type is used to identify the relationship between continuous and categorical variables. Example:- is the mean age of a male is different from the mean age of females? Another analysis test which is known as the T-test used to solve this problem.
df.groupby('sex')['Age'].mean().plot.bar()
#importing the scipy library for ttest
from scipy.stats import ttest_ind
males= df[df['sex']=='male']
females= df[df['sex']=='female']
ttest_ind(males['Age'], females['Age']
Categorical- categorical analysis:- This type is used to identify the relationship between two categorical variables. Example:- Does gender have any effect on survival rates in a titanic problem? Analysis test which is known as the Chi-square test used to solve this problem.
pd.crosstab(df['sex'],df['survived'])
from scipy.stats import chi2_contingency
chi2_contingency(pd.crosstab(df['sex'],df['survived']))
Data set and jupyter notebook for data exploration of titanic problem is given below:-
Comments