This technique is a time reducing and one which searches through various transformations to check for the best fit of a variable to the dependent variable. On comparing it with a very accepted method of a reputed firm, we have seen this new technique working wonders
Variable Reduction in Logistic & Choosing the correct transformations Arijit Das, Senior Consultant of Genpact, Kolkata, INDIA ABSTRACT: This paper provides a powerful tool to deal with the problem of plenty where we have a large number of variables and the regression function used is Logistic. This method weaves through huge number of transformations of each variable and helps you choose the best form of each variable and all this while reducing the variables to be considered at a very fast pace and helping you get a very robust and statistically sound model which can stand the test of time. UNDERLINED BENEFITS: This method that I will discuss of choosing the transformations and reduction of variables from the many transformations to the vital few has proved to be very effective in a very competitive card environment of US where external data from agencies and internal data is robust and in plenty. This technique is a time reducing and one which searches through various transformations to check for the best fit of a variable to the dependent variable. On comparing it with a very accepted method of a reputed firm, we have seen this new technique working wonders! (Figure 1)
Figure 1 INTRODUCTION: As students of statistics, we know that in logistic regression, given the equation form, has two major problems to deal around - first is choosing the right variable and second, to get around the problem of multicollinearity (MC). In this paper, I will be concerned with cases where there are huge number of observations and huge number of variables - typically the problem of plenty. And, in such situations, MC is not a problem at all as Kent Leahy aptly says, " A common solution to get rid of MC, therefore, has been to delete one or more of the offending collinear model variables or to use factor or principal components analysis to reduce the amount of redundant variation present in the data. MC, however, is not always harmful, and deleting a variable or variables under such circumstances can be the real problem. Unfortunately, this is not well understood by many in the industry, even among those with substantial statistical backgrounds. " It should be well appreciated that if there was no correlation between predictors, this regression form would have been reduced to a mere method of processing a series of bivariate regressions - thus it is these relationships between variables that actually give life to logistic regression. FIT THE CURVE TO THE DATA: When there is very high number of variables, choosing the elite few is a problem and the techniques of factor or principal component is not useful here. What I propose is if we know that logistic is the functional form for the predicted variable which is many times the case in marketing problems of response (1/0) kind of data, then the best way to find out which variable is significant is to use logistic itself. For this, I have built a simple code with known options in SAS which will help you use it like an effective tool and it has delivered very high impact projects and has performed extremely well in very competitive credit card acquisition campaigns and lifecycle campaigns. STEP TO EFFECTIVE TRANSFORMATIONS: This step takes a variable and makes 19 transformations including basic squares, cubes, their roots, inverses, logarithmic and some sin, cosine transformations as well - the idea is to make it as exhaustive as possible - these are some which I have seen have come significant in the projects/models I have built. Consider a dataset called sample which has all the variables and the Y variable called response. Then, in the first step make transformations of the variable concerned, say, var2 (Ref: Figure 2). /*/*/*/*Part One of Code*/*/*/*/ data test1; set sample(keep = response var2); var2_sq = var2**2; /*squared*/ var2_cu = var2**3; /*cubed*/ var2_sqrt = sqrt(var2); /*square root*/ var2_curt = var2**.3333; /*cube root*/ var2_log = log(max(.0001,var2)); /*log*/ var2_exp = exp(max(.0001,var2)); /*exponent*/ var2_tan = tan(var2); /*tangent*/ var2_sin = sin(var2); /*sine*/ var2_cos = cos(var2); /*cosine*/ var2_inv = 1/max(.0001,var2); /*inverse*/ var2_sqi = 1/max(.0001,var2**2); /*squared inverse*/ var2_cui = 1/max(.0001,var2**3); /*cubed inverse*/ var2_sqri = 1/max(.0001,sqrt(var... [download for more]