Why transform variables in regression




















A residual plot like this means that you weren't paying any attention at all to the data before conducting the analysis. The first step to ANY analysis should always be to plot and examine the data. The pattern of residuals seen above is the result of trying to fit a straight line to a curvilinear relationship. It might seem like this would be a difficult mistake to make, but I have seen it done more than once.

The data a relationship between sunfish mass and sunfish length are plotted below, with the ill-advised regression line from the above residual plot:. Clearly, the assumption of a linear relationship is violated in this example.

Unlike transformations that seek to stabilize the variance, or improve normality, when transforming data to make a relationship linear, it is generally the independent variable X that is transformed. This is an important point. I have seen a lot of cases where transformations were applied for no particular reason, or because they were common transformations.

Transformation of the data should be done only to correct a known issue with the data. For regression, it is the independent variable X that is first transformed to try and meet the linearity assumption. If this fails, transformation of the dependent variable may be attempted double log transformation, i.

If transformation succeeds in producing a linear relationship, then problems with normality or homoscedasticity are addressed by transformation of the dependent variable Y. We dealt with issues of normality and homogeneity first in this lesson, because they apply to all of the analyses we have done thus far, but hopefully it is an obvious point that one first has to determine whether the relationship can be transformed to be linear before addressing other issues with the data set.

Relationships that are not linear, but can be transformed to become linear are referred to as intrinsically linear. My approach to addressing transformations for linearity is to transform the independent variable in several ways, and simply plot the data to see which relationship appears to be the most linear.

For this data set the most common transformations for the independent variable failed to linearize the relationship, and so a double log plot was employed:. The double log transformation did make the relationship linear. If you tilt your head to the left, it may look as though there are variance issues, but the apparent wedge pattern in this case is simply the result of having very few observations of larger fish it's a demography thing The relationship determined for this analysis is:.

The equation takes this form because we did the regression analysis analysis on the logarithm of Y and the logarithm of X instead of on the actual values of Y and X. Make certain that you understand this before going further!

The slope and Y-intercept that were derived are for the relationship between the log of X and the log of Y. In order to calculate an estimate of Y mass for each value of X length , we just need to use a little algebra make each side of the equation an exponent of 10 to solve for Y:. This will allow us to plot the data at the original scale, and give us a curved regression line that will make us look oh so clever:.

In the preceding example, transformation of both the dependent variable and independent variable was required to achieve linearity. As mentioned previously, one should first attempt to make the relationship linear by transformation of the independent variable.

The figures below show examples of curvilinear relationships that can be made linear by transformation of the independent variable, so that you can get an idea of what transformations to try for specific patterns. The first example shows an intrisically linear function that can be made linear through square root transformation of the independent variable:.

If the animation does not work, or if you want to examine the individual graphs, they can be viewed HERE. Do not be concerned that the direction of the relationship changes with inverse transformation. Once you have completed the regression analysis, and completed the back transformation by solving for Y , your curved line should fit the data nicely.

One of the most common transformations is the log transformation. It is so popular that it often is applied without any real reason for doing so! We have covered the steps out of order, and so now would be a good time to summarize the steps involved in transformation for least-squares linear regression. The first step is always to graph the data. Suppose we repeat the analysis, using a quadratic model to transform the dependent variable. For a quadratic model, we use the square root of y, rather than y, as the dependent variable.

Using the transformed data, our regression equation is:. The residual plot above shows residuals based on predicted raw scores from the transformation regression equation. The plot suggests that the transformation to achieve linearity was successful. The pattern of residuals is random, suggesting that the relationship between the independent variable x and the transformed dependent variable square root of y is linear. And the coefficient of determination was 0.

The transformed data resulted in a better model. In the context of regression analysis , which of the following statements is true? A linear transformation increases the linear relationship between variables.

A logarithmic model is the most effective transformation method. A residual plot reveals departures from linearity. The correct answer is C. The results of fitting a simple regression model to the logged variables are shown below. The model has been given the name "Log-log model" rather than the default "Model 2". The slope coefficient of The rest of the output appears below. In addition to its theoretical support and the simple and unit-free interpretation of its slope coefficient, this model fits the data much better than the original one in a number of ways.

First, this model cannot make illogical negative predictions for sales : when the forecasts for log sales are transformed back into real units of cases by applying the exponential EXP function to them, they are necessarily positive numbers at all price levels. Second, the forecast errors of this model in real units are smaller on average than those of the original model, in root-mean-square terms, as shown by some additional calculations in the accompanying Excel file.

See columns M and O there. Third, there is less of a time pattern in the errors of this model : the lag-1 autocorrelation of the errors is only 0. If there is a strong time pattern in the errors, it means the model has overlooked or misidentified some property of the time pattern in the data, leaving room for improvement. Fourth, as seen in the line fit plot and residual-vs-predicted plot, the variance of the errors in log units is approximately the same for large and small predictions.



0コメント

  • 1000 / 1000