The idea here is to quickly run through the basics of regression, in particular consider this a revision of the concepts.

Revision Series

The idea behind this series is to quickly run through some basic concepts of Statistics and Data Science. The following article is a collection of quick notes that will help you revise the topic.

Regression Analysis

  • Form of Predictive Modeling
  • Investigates relationship between the dependent (target) and independent variable (s) (predictor).

Simple Linear Regression

Regression Revision 1
Regression Revision 3

Source.

  • Mathematically, Y = a + bX where
    • Y = Target/Predictor
    • X = Independent Variable
    • a = Constant
    • b = Coefficient
    • It means, A unit change in X causes a corresponding change in Y of the magnitude that is b times.
  • Line of Best Fit:
    • The regression line that minimizes the total difference between the actual and predicted values.
    • Let Y1 be a vector of the predicted target value, then Error e = Y – Y1
    • In other words, SSE (Sum of Squares of Error) = Sum ((e)^2) is what we try to reduce subject to the constraint that Total Sum of e is 0.
    • Total sum of squares (SST) = sum of squares due to regression (SSR) + sum of squared errors (SSE) where SST = Total Variation = Sum (Y – mean(Y))
  • Method Of OLS
    • Method of Ordinary least squares is a technique used to estimate the coefficients in the regression equation.
    • The least squares principle states that the sum of the squared distance between the observed values of your dependent variable and the values predicted is minimized. Thus giving the line of Best Fit.
    • Mathematical Properties that make OLS worth it:
      • Regression line always passes through the sample means of Y and X.
      • Mean of the estimated (predicted) Y value is equal to the mean value of the actual Y.
      • But most important is the required distribution of the error terms. This has been discussed further in the assumptions section.
  • R Square/ Adjusted R Square
    • R Square: coefficient of determination indicates how well the model explains the variation. R Square = SST/SSR.
      • Increases with every predictor added to a model. This can be completely misleading.
      • Too many terms makes the model prone to over-fitting.
    • Adjusted R Square: Penalizes if adding a new variable into the model does not have a significant impact in explaining the variation.
Regression Revision 2
Regression Revision 4

Source.

  • Assumptions of Regression:
    • Linearity (of Parameters) and Additivity of the relationship between dependent and independent variables:
      • Expected value of dependent variable is a straight-line function of each independent variable, holding other variables constant.
      • Effects of different independent variables on the expected value of the dependent variable are additive.
      • Diagnose:
      • Plot observed versus predicted values or residuals versus predicted values.
      • Points should be symmetrically distributed around a diagonal line in the former plot or around horizontal line in the latter plot, with a roughly constant variance. In multiple regression models, non-linearity or non-additivity may also be revealed by systematic patterns in plots of the residuals versus individual independent variables.
      • Treatment:
      • Applying a nonlinear transformation to the dependent and/or independent variables for example a log transformation for strictly positive data of a variable.
      • If a log transformation is applied to the dependent variable only, this is equivalent to assuming that it grows (or decays) exponentially as a function of the independent variables.
      • If a log transformation is applied to both the dependent variable and the independent variables, this is equivalent to assuming that the effects of the independent variables are multiplicative rather than additive.
      • Consider adding a non-linear regressor that is try a polynomial regression to see if it is a better fit. Such transformation is possible even for negative values.
      • If the relationship between Y and X is dependent on another regressor Z then you can try a product of X and Z as a regressor, assuming business logic permits.
    • Statistical Independence of the errors
      • When the residuals are autocorrelated, it means that the current value is dependent of the previous (historic) values and that there is a definite unexplained pattern in the Y variable that shows up in the disturbances (Random component of time series).
      • Diagnose:
      • Plot residual time series plot (residuals vs. row number) and a table or plot of residual autocorrelations. We want the autocorrelations to fall within the 95% Confidence Interval.
      • Durbin-Watson statistic provides a test for significant residual autocorrelation at lag 1. A rule of thumb is that test statistic values in the range of 1.5 to 2.5 are relatively normal. However, if the statistic has value >2.5 (<1.5) it points to negative (positive) autocorrelation.
      • To test for non-time-series violations of independence, residuals should be randomly and symmetrically distributed around zero under all conditions, and in particular there should be no correlation between consecutive errors
      • Treatment:
      • Consider adding lags of the dependent variable and/or lags of some of the independent variables to correct the minor cases of autocorrelation.
      • Differencing drives autocorrelations in the negative direction, and too much differencing may lead to artificial patterns of negative correlation that lagged variables cannot correct for. In case of significant negative autocorrelation, check if the variables have been over-differenced.
      • If autocorrelation is noticed for lags at a seasonal periodicity, then either seasonally adjust the variables, or use seasonal lags / seasonally differenced variables or add seasonal dummy variables.
      • Treatment specific to autocorrelation in time series might be due to a structural problem. Some time series models are based on the assumption of stationarity which can be achieved by the correct combination of differencing, logging, and/or deflating.
      • For non-time-series violations of independence, check if linearity assumption holds if there was any bias introduced due to omitted variables.
    • Homoscedasticity
      • Means constant variance that is it is a situation in which the error term is the same across all values of the independent variables.
      • Heteroscedasticity (violation of Homoscedasticity), might give too wide/narrow confidence intervals (C.I). With increasing variance of errors, the C.I will be unrealistically narrow. It may also result in inaccurate coefficients due to too much weight to a small subset of the data. To put it statistically, OLS estimates are no longer BLUE. That is, among all the unbiased estimators, OLS does not provide the estimate with the smallest variance.
      • Diagnose:
      • Plot of residuals versus predicted values or for time series data, a plot of residuals versus time. Check if the residuals follow an increasing trend over time or with the predicted values. To identify this, we wish to see errors that systematically get larger in one direction by a significant amount.
      • The Breush-Pagan test checks whether the variance of the errors from a regression is dependent on the values of the independent variables. In that case, heteroskedasticity is present. We can use Non-constant Variance Score Test (NCV) as well.
      • Goldfeld–Quandt test involves dividing a dataset into two parts or groups, and hence the test is sometimes called a two-group test.
      • Treatment:
      • Respecify the Model/Transform the Variables. Some combination of logging and/or deflating will often stabilize the variance in case of Time-series data. A log transformation applied to the dependent variable may be appropriate if the residual-versus-predicted plot shows that the size of the errors is proportional to the size of the predictions.
      • OLS assumes that errors are both independent and identically distributed; Use robust standard errors which relaxes the mentioned assumptions, however, as the standard errors do not change the coefficient estimates, but (because the standard errors are changed) the test statistics will give you reasonably accurate p values.Further, Weighted Least Squares (WLS) will also correct the problem of bias in the standard errors, and will also give you more efficient estimates. But WLS is difficult to implement leaving robust standard errors as the more commonly used solution.
      • Seasonal patterns in the data are a common source of heteroscedasticity in the errors.Larger errors will be exist in seasons where activity is greater, that is a seasonal pattern of changing variance will be visible on the residual-vs-time plot. A log transformation is often used to address this problem. Other ways are to carry out seasonal adjustments to the data.
    • Normality of residuals
      • Sometimes the error distribution is “skewed” by the presence of a few large outliers. Since parameter estimation is based on the minimization of squared error, a few extreme observations can exert a disproportionate influence on parameter estimates.
      • Calculation of confidence intervals and various significance tests for coefficients are all based on the assumptions of normally distributed errors. If the error distribution is significantly non-normal, confidence intervals may be too wide or too narrow.
      • Diagnose:
      • Consider, a normal probability plot or normal quantile plot of the residuals. If the distribution is normal, the points on a normal quantile plot should fall close to the diagonal reference line. We can also evaluate the skewness and kurtosis of the variable in question.
      • There are also a variety of statistical tests for normality, including the Kolmogorov-Smirnov test, the Shapiro-Wilk test, the Jarque-Bera test, and the Anderson-Darling test.
      • In the case of testing for normality of the distribution, Kolmogorov-Smirnov test takes samples which are standardized and compared with a standard normal distribution. Studies suggest that the test is less powerful for testing normality than the Shapiro–Wilk test or Anderson–Darling test. However, these other tests have their own disadvantages such as the Shapiro–Wilk test is known not to work well in samples with many identical values.
      • Treatment:
      • nonlinear transformation of variables might cure the non-normality/non-linearity. The dependent and independent variables in a regression model do not need to be normally distributed by themselves–only the prediction errors need to be normally distributed. Variables that are random but extremely asymmetric or long-tailed, might not fit into a linear model whose errors will be normally distributed. Normality (the idea Based on Central Limit Theorem (CLT)) of errors will not hold if the underlying sources of randomness are not interacting additively.
      • The problem with the error distribution could be due to one or two very large errors. Check if these are genuine data points.
      • Two or more subsets of the data having different statistical properties, in which case separate models should be built or a dataset removed entirely.
    • Multicollinearity
      • When one predictor variable in a multiple regression model can be linearly predicted from the others with a substantial degree of accuracy.
      • In the presence of multicollinearity, the estimate of one variable’s impact on the dependent variableYYwhile controlling for the others tends to be less precise. Also, small changes to the input data can lead to large changes in the model, even resulting in changes of sign of parameter estimates.
      • Multicollinearity might lead to over-fitting in Regression Models.
      • Diagnose:
      • Consider, Variance Inflation Factor (VIF) a metric computed for every variable that goes into a linear model. If the VIF of a variable is high, it means the information in that variable is already explained by other variables currently present in the given model. Thus commenting if that variable is redundant. Lower the VIF (<2) the better.
      • A statistical test such as Farrar–Glauber test can be used which checks for orthogonality of the variables. If the variables are orthogonal then there is no multicollinearity.
      • Treatment:
      • Either iteratively remove the variable with the highest VIF or see correlation between all variables and keep only one of all highly correlated pairs.
      • Obtain more data, if possible.
      • Mean-center the predictor variables.
      • Ridge regression or principal component regression or partial least squares regression can be used.

What’s Next?

Regression is a vast topic and though some of the basic concepts are touched upon above, there is still a lot to discuss to further grasp and develop a better understanding of regression. In the next section we will talk about the different types of regression.

Leave a Reply