Linear regression modeling is one of the most frequently used supervised learning technique. It is useful when the dependent variable is continuous (ratio or interval scale) and there exists a linear relationship between the dependent and independent variables. This post is a quick guide to perform linear regression in R and how to interpret the model results.
In the example, “Longley” dataset is used to illustrate linear regression in R. Longley” dataset describes 7 economic variables observed from 1947 to 1962 used to predict the number of people employed yearly.
## Load the dataset. We will be using longley dataset for analysis. The longley dataset describes 7 economic variables observed from 1947 to 1962 used to predict the number of people employed yearly. data("longley") ## Check the data (first 5 rows) head(longley,5)
## GNP.deflator GNP Unemployed Armed.Forces Population Year Employed ## 1947 83.0 234.289 235.6 159.0 107.608 1947 60.323 ## 1948 88.5 259.426 232.5 145.6 108.632 1948 61.122 ## 1949 88.2 258.054 368.2 161.6 109.773 1949 60.171 ## 1950 89.5 284.599 335.1 165.0 110.929 1950 61.187 ## 1951 96.2 328.975 209.9 309.9 112.075 1951 63.221
## Check the structure of the data str(longley)
## 'data.frame': 16 obs. of 7 variables: ## $ GNP.deflator: num 83 88.5 88.2 89.5 96.2 ... ## $ GNP : num 234 259 258 285 329 ... ## $ Unemployed : num 236 232 368 335 210 ... ## $ Armed.Forces: num 159 146 162 165 310 ... ## $ Population : num 108 109 110 111 112 ... ## $ Year : int 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 ... ## $ Employed : num 60.3 61.1 60.2 61.2 63.2 ...
## Summarize the data summary(longley)
## GNP.deflator GNP Unemployed Armed.Forces ## Min. : 83.00 Min. :234.3 Min. :187.0 Min. :145.6 ## 1st Qu.: 94.53 1st Qu.:317.9 1st Qu.:234.8 1st Qu.:229.8 ## Median :100.60 Median :381.4 Median :314.4 Median :271.8 ## Mean :101.68 Mean :387.7 Mean :319.3 Mean :260.7 ## 3rd Qu.:111.25 3rd Qu.:454.1 3rd Qu.:384.2 3rd Qu.:306.1 ## Max. :116.90 Max. :554.9 Max. :480.6 Max. :359.4 ## Population Year Employed ## Min. :107.6 Min. :1947 Min. :60.17 ## 1st Qu.:111.8 1st Qu.:1951 1st Qu.:62.71 ## Median :116.8 Median :1954 Median :65.50 ## Mean :117.4 Mean :1954 Mean :65.32 ## 3rd Qu.:122.3 3rd Qu.:1958 3rd Qu.:68.29 ## Max. :130.1 Max. :1962 Max. :70.55
The above output shows the data, its structure and the summary of the variables in the data. We will now build try to build a regression model using ‘lm()‘ function in R. The dependent variable in this model will be ‘Employed‘ and remaining 6 will be independent variables in the model. The model summary can be added to the output by using the ‘summary()‘ function.
## Fit the regression model to the data mod <- lm(formula = Employed~., data=longley) ## Model Summary summary(mod)
## ## Call: ## lm(formula = Employed ~ ., data = longley) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.41011 -0.15767 -0.02816 0.10155 0.45539 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -3.482e+03 8.904e+02 -3.911 0.003560 ** ## GNP.deflator 1.506e-02 8.492e-02 0.177 0.863141 ## GNP -3.582e-02 3.349e-02 -1.070 0.312681 ## Unemployed -2.020e-02 4.884e-03 -4.136 0.002535 ** ## Armed.Forces -1.033e-02 2.143e-03 -4.822 0.000944 *** ## Population -5.110e-02 2.261e-01 -0.226 0.826212 ## Year 1.829e+00 4.555e-01 4.016 0.003037 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.3049 on 9 degrees of freedom ## Multiple R-squared: 0.9955, Adjusted R-squared: 0.9925 ## F-statistic: 330.3 on 6 and 9 DF, p-value: 4.984e-10
Let’s analyze each component in the model output one by one:
The call in the output shows the function that was used in R to fit the data. The function used is fairly simple which takes the formula, which needs the dependent and the independent variables and the second argument is the dataset being used.
Residuals are the next component in the model summary. Residuals are the difference between the predicted values by the model and the actual values in the dataset. For the model to be good, the residuals should be normally distributed. From the output, we can see that the residuals are not normally distributed which means that certain predicted values by the model are far away from the actual values. We can further verify this by plotting the residual values (Q-Q plot, histogram with a normality curve).
The next component in the model summary is one of the most important components in the model as it gives us the model equation which can be used for predicting future values. The equation for multiple regression models is given by:
y → Dependent variable
ß0 → value for intercept
ß1 → slope for independent variable 1
ß2 → slope for independent variable 2
ßn → slope for independent variable n
x1 → independent variable 1
x2 → independent variable 2
xn → independent variable n
Coefficient – Estimate
The coefficient Estimate is the value of the coefficient that is to be used in the equation. The coefficients for each of the independent variable has a meaning, for example, 1.506e-02 for ‘GNP.deflator’ means that for every 1 unit change in ‘GNP.deflator’, the value of ‘Employed‘ increases by 1.506e-02. Based on the coefficients estimate, the equation for our model is:
Coefficient – Standard Error
The coefficient Standard Error measures the average amount that the coefficient estimates vary from the actual average value of our response variable. We need this to be minimal for the variable to be able to predict accurately.
Coefficient – t value
The coefficient t-value measures how many standard deviations our coefficient estimate can be far away from 0. We want this value to be high so that we can reject the null hypothesis (H0) which is ‘there is no relationship between dependent and independent variables’.
Coefficient – Pr(>t)
The Pr(>t) is computed from the t values. This is used for rejecting the Null Hypothesis (H00) stated above. Normally, the value for this less than 0.05 or 5% is considered to be the cut-off point for rejecting H0.
Residual Standard Error:
This is the measure of the quality of the fit of the regression model. Every linear model has an error term which is the reason why the prediction accuracy for the model is never 100%. The residual standard error is the average of the distances of the deviations for the predicted and the actual values.
Multiple R-squared statistics is the actual measure of the how well the data fit the model. R-square value explains what percentage of the variance in the independent (response) variable is being explained by the independent (predictor) variables. For our model, the R-squared value of 0.9955 or 99.55% means that 99.55% variance in the ‘Employed’ variable is being explained by the six dependent variables.
Adjusted R-squared is considered for evaluating model accuracy when the number of independent variables is greater than 1. Adjusted R-squared adjusts the number of variables considered in the model and is the preferred measure for evaluating the model goodness.
F-statistic is used for finding out if there exists any relationship between our independent (predictor) and the dependent (response) variables. Normally, the value of F-statistic greater than one can be used for rejecting the null hypothesis (H0: There is no relationship between Employed and other independent variables). For our model, the value of F-statistic, 330.6 is very high because of the limited data points. The p-value in the output for F-statistic is evaluated the same way we use the Pr(>t) value in the coefficients output. For the p-value, we can reject the null hypothesis (H0) as p-value < 0.05.
Relationship between R-squared and p-value in Regression
While trying to evaluate the model results, these two statistics are most frequently used and are most often confused with. To clear the confusion, there is no established relationship between the two.
R-squared tells how much variation in the response variable is explained by the predictor variables while p-value tells if the predictors used in the model are able to explain the response variable or not. If p-value < 0.05 (for 95% confidence), then the model is considered to be good.
Based on this, we have four different conditions for these two combined:
- low R-square and low p-value (p-value <= 0.05): This means that the model doesn’t explain much of the variation in the response variable, but still this is considered better than having no model to explain the response variable as it is significant as per the p-value.
- low R-square and high p-value (p-value > 0.05): This means that model doesn’t explain much variation in the data and is not significant. We should discard such model as this is the worst scenario.
- high R-square and low p-value: This means that model explains a lot of variation in the data and is also significant. This scenario is best of the four and the model is considered to be good in this case.
- high R-square and high p-value: This means that variance in the data is explained by the model but it is not significant. We should not use such model for predictions.