Regression with Two Independent Variables Using R

In giving a numerical example to illustrate a statistical technique, it is nice to use real data. In many situations, the reader can see how the technique can be used to answer questions of real interest. Our example here, however, uses real data to illustrate a number of regression pitfalls. Some of the data come from Yule and Kendall, An Introduction to the Theory of Statistics (14th ed.), Charles Griffin, London, 1950. Another variable was added in Tufte, Data Analysis for Politics and Policy, Prentice Hall, Englewood Cliffs, NJ, 1974. (Click on the title of the second book to get a free copy on-line.) Yule and Kendall had pointed out a nonsense correlation between the number of radio licenses (in millions) and the number of mental defectives (per 10,000 of estimated population) in Great Britain during the years 1924-1937. (In Great Britain, when you buy a radio you buy a license to use it, with the revenue going to support public broadcasting.) You can find the data at our site as a plain text file, or as an Excel spreadsheet. Download the text file now and save it to the directory where you installed R. To use the file in R you must define a variable to be equal to the contents of this file.

> radios <- read.table("radios.txt",header=TRUE)

Here radios <- can be read as "create a variable radios and in it store the results of...". The argument header=TRUE tells R that the first row of the file should be interpreted as variable names. You can then get a table of contents for what you have created in R with

> radios
   Year licenses defects lengths
1  1924    1.350       8       6
2  1925    1.960       8       6
3  1926    2.270       9       6
4  1927    2.483      10       6
5  1928    2.730      11       6
6  1929    3.091      11       7
7  1930    3.647      12       7
8  1931    4.620      16       7
9  1932    5.497      18       7
10 1933    6.260      19       8
11 1934    7.012      20       8
12 1935    7.618      21       8
13 1936    8.131      22       8
14 1937    8.593      23       8

First, are there any correlations here?

> cor(radios)
              Year  licenses   defects   lengths
Year     1.0000000 0.9862573 0.9815811 0.9434564
licenses 0.9862573 1.0000000 0.9920525 0.9425656
defects  0.9815811 0.9920525 1.0000000 0.9332053
lengths  0.9434564 0.9425656 0.9332053 1.0000000

We found it easiest to just apply the cor function to the entire data set. Here we can see, for example, that the number of licenses is highly correlated with Year (r=0.9862573) as is the number of defectives (r=0.9815811). This is very common with time series data. Any two variables that move together in a time series will show correlation, but there may be no real relationship between them other than that they happened to both be changing steadily during that time period. To sharpen the point, Tufte added a third variable: the number of letters in the first name of the person who was President of the Untied States during (most of) that year. As it happened, that variable was steadily increasing during this time period (r=0.9425656). As a result, we can predict the number of mental defectives using licenses or lengths.

> lm1 <- lm(defects ~ licenses, data=radios)
> summary(lm1)

lm(formula = defects ~ licenses, data = radios)

    Min      1Q  Median      3Q     Max 
-0.9024 -0.5181 -0.2144  0.4317  1.3014 

            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   4.5822     0.4233   10.82 1.51e-07 ***
licenses      2.2042     0.0807   27.31 3.58e-12 ***
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 0.7262 on 12 degrees of freedom
Multiple R-Squared: 0.9842,     Adjusted R-squared: 0.9828 
F-statistic:   746 on 1 and 12 DF,  p-value: 3.577e-12 

Here lm1 is a variable in which to store the results of fitting the model. The lm function (linear model) fits the model. The arguments are the dependent variable followed by "~`" followed by a list of independent variables. The additional (optional) argument data=radios tells R which data set to look in to find these variables.

> lm2 <- lm(defects ~ lengths, data=radios)
> summary(lm2)

lm(formula = defects ~ lengths, data = radios)

    Min      1Q  Median      3Q     Max 
-3.8571 -0.9571  0.1429  1.2179  3.1429 

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -26.4429     4.6242  -5.718 9.63e-05 ***
lengths       5.9000     0.6558   8.996 1.11e-06 ***
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 2.074 on 12 degrees of freedom
Multiple R-Squared: 0.8709,     Adjusted R-squared: 0.8601 
F-statistic: 80.93 on 1 and 12 DF,  p-value: 1.109e-06 

We see, for example, that 87.09% of the variability in the number of mental defectives is "explained by" the number of letters in the President's first name. Because this "explanation" lacks credibility, some prefer to refer to the variability as "shared" between the two variables. During this period, the two variables showed pretty much the same trend. (We leave it to the reader to explore the impact of Harry S. Truman becoming President in 1949.) If this all seems a little silly, just remember that the next time you see a high correlation, it could be equally silly.

For our purposes today, though, we will continue and illustrate the use of two independent variables in a single model.

> lm3 <- lm(defects ~ lengths + licenses, data=radios)
> summary(lm3)

lm(formula = defects ~ lengths + licenses, data = radios)

    Min      1Q  Median      3Q     Max 
-0.9135 -0.5461 -0.2124  0.4299  1.2721 

            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   5.1601     3.9387   1.310    0.217    
lengths      -0.1059     0.7174  -0.148    0.885    
licenses      2.2393     0.2521   8.882 2.38e-06 ***
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 0.7577 on 11 degrees of freedom
Multiple R-Squared: 0.9842,     Adjusted R-squared: 0.9813 
F-statistic: 342.6 on 2 and 11 DF,  p-value: 1.238e-10 

The F-test (sample F=342.6, p=1.238e-10=1.238X10-10=0.0000000001238) indicates that the model as a whole reflects a real association between the dependent variable and the independent variables as a group. The t-test for each coefficient indicate that licenses makes a strong contribution to the model (t=8.882, p=0.00000238) while lengths does not (t=-0.148, p=0.885). Note that when we fit a model with lengths alone as independent variable we got a very different result (t=8.996, p=0.00000111). This is due to the conditional nature of the coefficients in multiple regression. The variable lengths by itself shares much of its variability with defects, but once you include licenses there is very little additional shared variability. This is a common problem when we have lots a variability shared among the independent variables, something called "multicollinearity". Here we have r=0.9425656 between lengths and licenses. When we have multicollinearity, it is common for individual variables to change in significance and their coefficients to alter and even reverse sign from one model to the next, as happened here.

Regression analysis involves a number of assumptions. Checking those here is especially enlightening because doing so shows how to detect nonsense regressions that are not so obviously nonsense.

First, let's talk about the design of the study and the nature of the variables. This is not a random sample of years, it is fourteen consecutive years. Time series data such as this rarely constitutes a random sample. We also rarely have independence of observations for time series data. Instead, a given time period is likely to give results similar to the preceding and the next time period -- at least more similar than the results for distant time periods. This problem is called "autocorrelation". Nor is this any kind of experiment. We cannot control the number of radios licensed nor the number of letters in the President's name..

There are also some standard checks we can make of the data once it has been collected. The models above all assume that we have constant variance in the residuals. This is checked by plotting them against predicted y. First, extract these from the results of the fitted model and store them in variables.

> res <- residuals(lm3)
> fits <- fitted(lm3)

Then make a plot.

> plot(res ~ fits)

We'd like to see random scatter, and maybe we do on the left, but to the right the points make a wandering path -- probably because this is time series data.

Another assumption is that the relationship between the dependent variable and each independent variable is linear. We usually plot residuals against each independent variable to see if there are signs of curvature.

> plot(res ~ licenses, data=radios)
> plot(res ~ lengths, data=radios)

The first plot looks about the same as the previous one but here suggests possible curvature in the relationship with licenses. The second plot looks very different because there are only three different values of lengths during these years. There are very slight signs of curvature here (higher in the middle) and very slight signs of different variances at the three levels. However, with only four or five observations in each group, we should not take these too seriously.

Normality of the residuals can be assessed informally with simple univariate displays.

> stem(res)
  The decimal point is at the |

  -0 | 96665
  -0 | 4310
   0 | 4
   0 | 57
   1 | 23

This shows the residuals skewed toward high values, suggesting some sort of transformation for the dependent variable. This shows up as curvature in a normal probability plot.

> qqnorm(res)
> qqline(res)

This creates the plot and then adds a line to it to guide the eye in estimating whether the points follow a straight line.

The skewness could also be related to curvature in the relationship withlicenses. Note that in that residual plot we had clumps of low values at each end and a few high values in the middle. These high values are the tails on the stem and leaf.