Normal 0 false false false EN-US X-NONE X-NONE /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-qformat:yes; mso-style-parent:""; mso-padding-alt:0in 5.4pt 0in 5.4pt; mso-para-margin-top:0in; mso-para-margin-right:0in; mso-para-margin-bottom:10.0pt; mso-para-margin-left:0in; line-height:115%; mso-pagination:widow-orphan; font-size:11.0pt; font-family:"Calibri","sans-serif"; mso-ascii-font-family:Calibri; mso-ascii-theme-font:minor-latin; mso-fareast-font-family:"Times New Roman"; mso-fareast-theme-font:minor-fareast; mso-hansi-font-family:Calibri; mso-hansi-theme-font:minor-latin;} This white paper will discuss some techniques to choose one transformation of variables while using Regression Model among many others. This paper will also discuss sanity checks.
How to choose which transformation to use while using Regression Model Sandeep Das, Senior Consultant of Genpact Kolkata, INDIA Abstract: Taking transformation of variables is quite common practice while building model. In real life scenario some time it becomes evident to take variable transformation to improve the model performance. One can try out what ever transformation one want to take but that should be logical and meaningful with respect to business sense and value addition due to use of transformation needs to be significant with respect to no transformation situation i.e if the transformation is adding marginal value in terms of explanatory power then the original variable is recommended. In this writing we will discuss some techniques to choose one transformation among many others. We will also discuss sanity checks that we need to go through. The Concept: Suppose lets take a variable X. Let the transformation function used on this variable is 'f' So transformed variables if f(X). The first order derivative (f') of this function could be +ve or -ve i.e if we have chosen square transformation then f'>0 (assuming the values of 'X' are greater than 1) where as if we take inverse transformation then f'<0 (provided all values in 'X' are non zero). Example and Interpretation of Transformation: Variable Nature Log Square Root Inverse Out Continuous change in log of a variable can be For variables like Not recommended Standing interpreted as rate of change in income or balance as Outstanding Balance outstanding Balance when the which can take variable may take variable changes by 1 unit huge values it value zero for a helps to scale down number of the huge values. accounts Utilization Percentage Will shift the transformed value Helps to Skew Helps to assign to negative if it is a proper faction Distribution higher weight to or if % is less than 100% lower values Inquires Number Helps to scale down fluctuations Effect depends on Helps to assign and range value of the higher weight to variable < = > 1 lower values
The Problem: Theoretically we can take as many types of transformations i.e if we have "K" number of transformations in mind and we have "N" number of variables then we generate N X K number of derived variables. Now question become how to cut this huge list down that is how to choose the best among these? Way Out - Proposed Solution: To solve this problem we first need to decide what should be our basis of transformation reduction and which will be dependent on the types of modeling we are doing. Step1: Following question to be answered Q1: Objective - Transformation of Y (Dependent) or X (Independent) This is crucial as objective of transformation is generally differs for Dependent(Y) and Independent (X). When we take transformation on Y then in most of the cases our objective is to minimize fluctuation or Scale down (up) values or tackle sqewness distribution pattern. In this case it is advisable to choose transformation looking into univariate distribution alone. Box Cox would help to arrive at a transformation following the normal distribution.Where as while choosing transformation for Independent variables bivariate or multivariate analysis should be more prioritized than univariate as there normally the objective is to choose a transformation will increase the predictive power of model in other words minimize residual. For continuous independent variables, we could look at PROC GAM plots to see non-linear / piecewise linear fits are appropriate. This would also sometimes improve the fit. However the challenge is to unearth the actual pattern and making business sense out of it. Q2: Nature of Variable 'X's could be Discrete or Continuous. If Discrete then only binning or creating Dummy is the option, where as for continuous variables we can try out different transformation. While taking transformation of continuous variables it would be advisable to perform pre-modeling steps like Outlier detection, missing value treatment before choosing transformation. For continuous independent variables, we could look at PROC GAM plots to see non-linear / piecewise linear fits are appropriate. This would also sometimes improve the fit. However the challenge ... [download for more]