Reasons for missing values in data.
Missing values in data are due to many reasons. one of these is non-response, like when you collect data on age, some women don't like to answer. The next reason for the missing value is an error in data collection. These errors can be due to fault in the system. Another reason is an error in reading data. Error in reading data is due to an error in reading some special character that has a different meaning.
Types of missing values
Missing completely at random:- Missing values that have no relation to the variable in which missing values exist and other variables exist.
For example, all fare values in titanic problems are equally likely to be missing not dependent on any other particular age group.
Missing at random:- Missing values that have no relation to the variable in which missing values exist, but they have a relation with variable other than in which missing values exist.
for example, all fare values are equally likely to be missing, but the missing values are for the people whose age is less than 60.
Missing not at random:-Missing values that have relation to the variable in which missing values exist. For example:- The value of fare which has lower values is missing.
How to identify and deal with missing values
The describe and IsNull function is used to identify the missing values. The describe function is used for continuous variable and the IsNull function is used for both categorical and continuous variables.
#identify missing values of continuous variables
df.describe()
#identify missing values of all variables
df.isnull()
The two methods to deal with missing values are imputation and deletion
Imputation:- Different methods are used for imputing missing values depending on the type of variables. Mean, median, regression models are used for imputing missing values of continuous variables. Model, classification models are used for imputing missing values of categorical variables.
#imputing missing values
df['var1'].fillna(df['var1'].mean())
Deletion:- If there are any missing values present in a row or column, the whole column or row is deleted in this method. This results in a loss of data.
#drop all rows where any missing values are present
df.dropna()
#dropping column with missing values
df.dropna(axis=1)
Comments