top of page

Variable transformation - Data Exploration

Writer's picture: TECH BUDDYTECH BUDDY



What is variable transformation?


Variable transformation is a process by which we replace a variable with some function of that variable. Example:- replacing a variable x with its logarithm. Variable transformation changes the distribution or a relationship of a variable with others.


Use of variable transformation


Variable transformation is used to change the scale of a variable. For example, 20 variable is measured in km and 2 are measured in miles. we can convert these two in km by the variable transformation. This transformation does not change the shape of variable.



Variable transformation is used to convert non- linear relationships into linear relationships. The existence of a linear relationship is easier to comprehend compared to non-linear relationships. These transformations improve the model. log transformation is commonly used transformation techniques.





Variable transformation is used to create symmetric distributions from skewed distributions. Symmetric distribution can easily be interpreted. Some modeling techniques require normal

distribution of variables.


Common methods of variable transformation



Logarithm:- This method is commonly used to change the shape of the distribution. Taking the log of variable reduces the right skewness of the variable. It can not be applied to zero or negative values as well.

#import numpy as np
#import matplotlib.pyplot as plt
np.log(df['var1']).plot.hist()



Squared root:- Used for a right-skewed variable with positive values only. It cannot be applied to negative values

np.sqrt(df['var1']).plot.hist())

Cube root:- This is used for a right-skewed variable with positive or negative values. This method is not as significant as the log transformation method.

np.cubrt(df['var1']).plot.hist())

Binning:- This method is used for converting continuous variables to the categorical variables. This technique is based on business understanding. for example, we can classify income in three categories namely high, average, low. We can also perform co-variate binning which depends on the value of more than one variable.

bin=[0,20,80]
group=['children', 'Adult']
df['type']=pd.cut(df['Age'],bins,labels=group)
df['type'].value_counts()

Dataset and jupyter notebook file for practicing variable transformation

















43 views0 comments

Recent Posts

See All

Comments


Post: Blog2_Post
bottom of page