Understand Linear Regression Results in R

data science
R
statistics
machine learning
supervised learning
regression
Author

Flávia E. Rius

In this post I will write about the statistics and metrics of linear regression for you to finally understand all of the output from summary() R function applied to an lm() object, in a very practical way.

Disclaimer: I am NOT a statistician, so to all statisticians out there, I apologize for any misused term. My intention here is just to let people interpret their own regression models and understand what is going on with their data in a practical way.

About linear regression

There are two main uses of linear regression: statistical inference and prediction.

In the statistical inference line, a linear regression is performed to find a relationship of a dependent continuous variable and one or more variables of any type (continuous or categorical). You can, then, explain a relationship between variables.

As a predictive point of view, regression is a part of supervised learning, a type of machine learning, and can be used to predict new data based on the relationship found between two variables.

These two approaches are not mutually exclusive, they can be used together. You can both explain the relationship between variables and use your model to predict with new data input.

For more insights on this difference, I recommend this brief Nature discussion and example on RNA-seq data, especially the first four paragraphs.

Concerning the prior assumptions to perform a linear model, such as linearity and normality of the variables, mainly, this is a very discussed topic among statisticians and data scientists. In the machine learning side, the assumptions do not matter that much if the model has a good application in predicting the variable of interest. And for the inference side, the assumptions are more important for the model inference to be correct, but they are not written in stone. For example, not normally distributed variables might work well in a t-test or linear regression for a large sample size. At the same time, to perform linear regressions on gene expression data from RNA-seq, there are several transformations made in the data so it can be normal, linear, and homoskedastic.

As I said, it is a very much discussed topic and I suggest that you read more about your area of application to decide if you should check for assumptions or not. I won’t dive into assumptions in this post.

Running the linear regression

First of all, we need an analysis.

We’ll use the most classic dataset of R: mtcars. It contains measures as weight, miles per gallon, number of cylinders, etc, of 32 models of cars from 1974. You can explore the dataset more by typing ?mtcars in your R console.

Let’s check the variables distribution overall:

summary(mtcars)
      mpg             cyl             disp             hp       
 Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
 Median :19.20   Median :6.000   Median :196.3   Median :123.0  
 Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
 Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
      drat             wt             qsec             vs        
 Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
 1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
 Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
 Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
 3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
 Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
       am              gear            carb      
 Min.   :0.0000   Min.   :3.000   Min.   :1.000  
 1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
 Median :0.0000   Median :4.000   Median :2.000  
 Mean   :0.4062   Mean   :3.688   Mean   :2.812  
 3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
 Max.   :1.0000   Max.   :5.000   Max.   :8.000  

 

They are all numeric, but am and vs seem to be binary, and some seem to be categorical, as carb and cyl.

In our analysis to demonstrate the linear regression metrics and statistics, our question will be:

Does the weight of a car influences in how many miles it can go per gallon of gas?

To answer that, we will use the predictor (independent variable) weight or wt and the predicted (dependent variable) miles per gallon or mpg.

Let’s fit the model:

fit <- lm(mpg ~ wt, mtcars)

 

To obtain all statistics and some of the metrics for our model, we need to use the function summary() in R.

 

summary(fit) 

Call:
lm(formula = mpg ~ wt, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5432 -2.3647 -0.1252  1.4096  6.8727 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
wt           -5.3445     0.5591  -9.559 1.29e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared:  0.7528,    Adjusted R-squared:  0.7446 
F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

 

Okay, so we have this ocean of information and how can we answer our question? p value is low, and all three stars for coefficients, this seems nice.

Let’s see one by one, by order of appearance in the output of summary(fit).

Metrics and statistics

While metrics are the values to measure performance of the overall model, statistics are the coefficients of hypothesis tests and estimates of each variable. Below they will be explained.

To visualize some parts of the explanation, the linear regression plot will be illustrated below.

1. Residuals

 

# Residuals:
#     Min      1Q  Median      3Q     Max 
# -4.5432 -2.3647 -0.1252  1.4096  6.8727 

 

“Residual” was a term invented because an executive of a drug industry didn’t want to admit having “error” in their data when sending it to FDA (you can find this info on page 239 of the book The Lady Tasting Tea). So yes, these are the “errors” of your model. Not that there is something wrong, but the fitted line is not perfect, there are deviations from it to your real data because other factors and randomness affect your predicted variable (mpg) too. Those deviations are the residuals.

They are calculated subtracting the predicted data from the observed data. In our case, predicted (obtained using the model) miles per gallon minus observed (from mtcars) miles per gallon.

To see all of them, instead of a summary, run:

 

resid(fit)
          Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive 
         -2.2826106          -0.9197704          -2.0859521           1.2973499 
  Hornet Sportabout             Valiant          Duster 360           Merc 240D 
         -0.2001440          -0.6932545          -3.9053627           4.1637381 
           Merc 230            Merc 280           Merc 280C          Merc 450SE 
          2.3499593           0.2998560          -1.1001440           0.8668731 
         Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental 
         -0.0502472          -1.8830236           1.1733496           2.1032876 
  Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla 
          5.9810744           6.8727113           1.7461954           6.4219792 
      Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28 
         -2.6110037          -2.9725862          -3.7268663          -3.4623553 
   Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
          2.4643670           0.3564263           0.1520430           1.2010593 
     Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
         -4.5431513          -2.7809399          -3.2053627          -1.0274952 

 

It is possible to explore the residuals checking them for some assumptions of a linear regression, and hidden patterns in the data. This is a very long topic, so it will be left to another post.

2. Coefficients

Overall, the coefficients contain information about the predictors of your dependent variable. Below you can see a detailed explanation.

 

# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
# wt           -5.3445     0.5591  -9.559 1.29e-10 ***

 

a. Intercept

Intercept is the mean value of y when x is 0 in our model. In our example, how many miles a gallon would make in average for a car with zero lbs. Also represented as β0 in the linear regression equation:

y = β0 + β1x + ε

In our example it makes no sense at all. There is no car with 0 lbs. Therefore we can just ignore it and interpret the coefficient of interest, β1.

b. wt

Represented by the variable name, this is the angular coefficient, β1. It represents the slope of the line in the linear regression plot. This is how much y increases or decreases with the increase of one unit of x. In our example, it is how many miles per gallon decreases (estimate is negative = -5.3) with the increase of each 1000 lb weight (unit of measure of wt) of the car.

The columns for the coefficients are:

I. Estimate

Estimates are the values per se of β0 (Intercept) and β1 (wt). The explanation of them is above.

II. Std Error

Standard Error is the deviation of the estimate from its real value. The smallest this value is, more precise the estimate is. It tends to be smaller if you have a big number of observations, since it is the standard deviation divided by n. 

III. t value

This is the value of t in the Student’s t test for one sample. The alternative hypothesis tested is: is your estimate different than zero?, which relies on the distribution of your data being a t distribution, which is very close to a normal distribution. It is calculated by Estimate/Std Error.

IV. Pr(>|t|)

This is the p value for the t test applied to your estimate. If it is less than the considered alpha - the error you choose to accept (generally 0.05) - then you can say your estimates are very likely to be significant.

3. Signif. codes

This is just a legend for what the asterisks and dot mean concerning level of significance for the p values.

4. Degrees of freedom

# 30 degrees of freedom

 

Degrees of freedom (DF) are how many units of your data are “free to vary”.

Practically, this is how many observations you have minus the number of estimated parameters used in your model (in this case, intercept and weight). Which means that it is directly proportional to the number of samples, and inversely proportional to the number of parameters and variables in your analysis. Degrees of freedom are crucial to determine your t distribution shape and what will be the significance (p value) of the estimates.

5. Residual standard error

 

# Residual standard error: 3.046

 

Residual standard error (RSE) is the average deviation of predicted values (from the model) from observed values (the ones in your dataset).

It can be calculated with:

y <- mtcars$mpg
y_hat <- predict(fit)
df <- 30 # degrees of freedom

sqrt(sum((y - y_hat)^2)/df)
[1] 3.045882

 

Statistically speaking, it is the estimate of standard deviation of predicted values from real values; a measure of variation around the regression line.

Confidence intervals around predicted values are generated using the RSE, therefore it is an important metric of your model. Large RSE can generate inaccurate predictions.

6. Multiple R-squared

 

# Multiple R-squared:  0.7528

 

This is the famous R squared (R²). The proportion of variance of your dependent variable, y, which is explained by your model. Simplifying: proportion of variance explained.

Since it is a proportion, its value is between 0 and 1.

Our model explains 75% of variance in miles per gallon.

This does not mean that if you have an R² = 0.3 for example you have a poor regression analysis. Some variables are just partly explained by the predictor analyzed, and if your intention is to interpret the influence of a predictor in a response variable, it may be useful even with a low variance explained. This happens in genomics, with polygenic scores for example, when your analyzed genetics explains only a part of a trait or disease, while the rest is explained by other factors such as environment.

Also, this is not the best way to analyze if your model is good or not. RMSE (Root Mean Squared Error), which is just squaring residuals, averaging them, and getting the square root, is generally a better way to evaluate whether you have a less prone to errors model. The smallest the RMSE, the better your model is.

 

RMSE <- sqrt(mean(resid(fit)^2))

RMSE
[1] 2.949163

 

That means that in average there is an error of 2.9 miles per gallon in our model.

Have in mind what my boss always repeats: “All models are wrong, but some are useful”, a classical phrase by George Box.

7. Adjusted R-Squared

 

# Adjusted R-squared:  0.7446

 

When there is more than one predictor in your linear regression (multiple linear regression), there is always an increase in R² independent of increase in variance explained, due to just the addittion of a new predictor. Therefore the R² value is adjusted for that.

To compare multiple linear regression models, or models with different number of predictors, it is recommended to check adjusted R-squared instead of multiple R-squared for variance explained.

8. F-statistic

 

# F-statistic: 91.38 on 1 and 30 DF

 

F is the statistic from the F-test performed in your model. Basically the F value is used to see if there is any relationship between the response and predictors in a multiple linear regression. In the example discussed here, a simple linear regression, the estimate β1 = 0 is a better way to tell that there is no relationship between the response and predictor.

9. p-value

 

# p-value: 1.294e-10

 

This is the F-test p-value, which tells you if your F-test is significant. A p-value below the alpha you choose (for example, 0.05) means there is a high chance that at least one of your predictors is significantly associated with your dependent variable. It is more used for multiple regression models, which can be approached in another blog post.

Interpretation concerns

Some transformations can be used in the linear regression to make your model more interpretable.

For example, let’s say that, as me, you are not from a country that uses lbs as a weight metric, neither miles for a distance one, or gallons for a volume one.

Instead, you would like to interpret your model using kilograms, kilometers, and liters. Is this possible?

Yes!

Using the I() operator around the terms of the linear regression, just so you can transform from inside of the lm() function, we can transform the terms and get the proper interpretation. See below:

 

miles_to_kilometers <- 1.61
gallons_to_liters <- 3.79
lb_to_kg <- 0.45

fit_not_english <- lm(I(mpg*miles_to_kilometers/gallons_to_liters) ~ I(wt*lb_to_kg), mtcars)

summary(fit_not_english)

Call:
lm(formula = I(mpg * miles_to_kilometers/gallons_to_liters) ~ 
    I(wt * lb_to_kg), data = mtcars)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.92994 -1.00453 -0.05318  0.59878  2.91954 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       15.8388     0.7976  19.858  < 2e-16 ***
I(wt * lb_to_kg)  -5.0452     0.5278  -9.559 1.29e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.294 on 30 degrees of freedom
Multiple R-squared:  0.7528,    Adjusted R-squared:  0.7446 
F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

 

The interpretation is:

  • There is a reduction of 5 kilometers per liter for each 1000kg increase in the car weight.

Much better now, right?

;)

Closing remarks

I hope you have liked this post, and that it has shed light on you about how to interpret the regression models outputs from R.

I really missed this content when I was applying linear regression in R, so I really hope it helps you as much as it would have helped me when I was looking for it!

Read more