In giving a numerical example to illustrate a statistical technique, it
is nice to use real data. In many situations, the reader can see how the
technique can be used to answer questions of real interest. Our example
here, however, uses real data to illustrate a number of regression
pitfalls. Some of the data come from Yule and Kendall, *An Introduction
to the Theory of Statistics* (14th ed.), Charles Griffin, London, 1950.
Another variable was added in Tufte, *Data
Analysis for Politics and Policy*, Prentice Hall, Englewood
Cliffs, NJ, 1974. (Click on the title of the second book to get a free
copy on-line.) Yule and Kendall had pointed out a nonsense correlation
between the number of radio licenses (in millions) and the number of
mental defectives (per 10,000 of estimated population) in Great Britain
during the years 1924-1937. (In Great Britain, when you buy a radio you
buy a license to use it, with the revenue going to support public
broadcasting.) You can find the data at our site as a plain
text
file, or as an
Excel
spreadsheet. Download the text file now and save it to the directory
where you installed R. To use the file in R you must define a variable to
be equal to the contents of this file.

> radios <- read.table("radios.txt",header=TRUE)

Here radios `<-` can be read as "create a variable `radios`
and in it store the results of...". The argument `header=TRUE`
tells R that the first row of the file should be interpreted as variable
names. You can then get a table of contents for what you have created in R
with

> radios Year licenses defects lengths 1 1924 1.350 8 6 2 1925 1.960 8 6 3 1926 2.270 9 6 4 1927 2.483 10 6 5 1928 2.730 11 6 6 1929 3.091 11 7 7 1930 3.647 12 7 8 1931 4.620 16 7 9 1932 5.497 18 7 10 1933 6.260 19 8 11 1934 7.012 20 8 12 1935 7.618 21 8 13 1936 8.131 22 8 14 1937 8.593 23 8

First, are there any correlations here?

> cor(radios) Year licenses defects lengths Year 1.0000000 0.9862573 0.9815811 0.9434564 licenses 0.9862573 1.0000000 0.9920525 0.9425656 defects 0.9815811 0.9920525 1.0000000 0.9332053 lengths 0.9434564 0.9425656 0.9332053 1.0000000

We found it easiest to just apply the `cor` function to the
entire data set. Here we can see, for example, that the number of
licenses is highly correlated with Year (*r*=0.9862573) as is the
number of defectives (*r*=0.9815811). This is very common with time
series data. Any two variables that move together in a time series will
show correlation, but there may be no real relationship between them other
than that they happened to both be changing steadily during that time
period. To sharpen the point, Tufte added a third variable: the number of
letters in the first name of the person who was President of the Untied
States during (most of) that year. As it happened, that variable was
steadily increasing during this time period (*r*=0.9425656). As a
result, we can predict the number of mental defectives using `licenses`
or `lengths`.

> lm1 <- lm(defects ~ licenses, data=radios) > summary(lm1) Call: lm(formula = defects ~ licenses, data = radios) Residuals: Min 1Q Median 3Q Max -0.9024 -0.5181 -0.2144 0.4317 1.3014 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.5822 0.4233 10.82 1.51e-07 *** licenses 2.2042 0.0807 27.31 3.58e-12 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.7262 on 12 degrees of freedom Multiple R-Squared: 0.9842, Adjusted R-squared: 0.9828 F-statistic: 746 on 1 and 12 DF, p-value: 3.577e-12

Here `lm1` is a variable in which to store the results of
fitting the model. The `lm` function (linear model) fits the
model. The arguments are the dependent variable followed by "~`"
followed by a list of independent variables. The additional (optional)
argument `data=radios` tells R which data set to look in to find
these variables.

> lm2 <- lm(defects ~ lengths, data=radios) > summary(lm2) Call: lm(formula = defects ~ lengths, data = radios) Residuals: Min 1Q Median 3Q Max -3.8571 -0.9571 0.1429 1.2179 3.1429 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -26.4429 4.6242 -5.718 9.63e-05 *** lengths 5.9000 0.6558 8.996 1.11e-06 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 2.074 on 12 degrees of freedom Multiple R-Squared: 0.8709, Adjusted R-squared: 0.8601 F-statistic: 80.93 on 1 and 12 DF, p-value: 1.109e-06

We see, for example, that 87.09% of the variability in the number of mental defectives is "explained by" the number of letters in the President's first name. Because this "explanation" lacks credibility, some prefer to refer to the variability as "shared" between the two variables. During this period, the two variables showed pretty much the same trend. (We leave it to the reader to explore the impact of Harry S. Truman becoming President in 1949.) If this all seems a little silly, just remember that the next time you see a high correlation, it could be equally silly.

For our purposes today, though, we will continue and illustrate the use of two independent variables in a single model.

> lm3 <- lm(defects ~ lengths + licenses, data=radios) > summary(lm3) Call: lm(formula = defects ~ lengths + licenses, data = radios) Residuals: Min 1Q Median 3Q Max -0.9135 -0.5461 -0.2124 0.4299 1.2721 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.1601 3.9387 1.310 0.217 lengths -0.1059 0.7174 -0.148 0.885 licenses 2.2393 0.2521 8.882 2.38e-06 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.7577 on 11 degrees of freedom Multiple R-Squared: 0.9842, Adjusted R-squared: 0.9813 F-statistic: 342.6 on 2 and 11 DF, p-value: 1.238e-10

The *F*-test (sample *F*=342.6, *p*=1.238e-10=1.238X10^{-10}=0.0000000001238)
indicates that the model as a whole reflects a real association between
the dependent variable and the independent variables as a group. The *t*-test
for each coefficient indicate that licenses makes a strong contribution to
the model (*t*=8.882, *p*=0.00000238) while lengths does not (*t*=-0.148,
*p*=0.885). Note that when we fit a model with `lengths`
alone as independent variable we got a very different result (*t*=8.996,
*p*=0.00000111). This is due to the conditional nature of the
coefficients in multiple regression. The variable `lengths` by
itself shares much of its variability with `defects`, but once you
include `licenses` there is very little additional shared
variability. This is a common problem when we have lots a variability
shared among the independent variables, something called "multicollinearity".
Here we have *r*=0.9425656 between `lengths` and `licenses`.
When we have multicollinearity, it is common for individual variables to
change in significance and their coefficients to alter and even reverse
sign from one model to the next, as happened here.

Regression analysis involves a number of assumptions. Checking those here is especially enlightening because doing so shows how to detect nonsense regressions that are not so obviously nonsense.

First, let's talk about the design of the study and the nature of the variables. This is not a random sample of years, it is fourteen consecutive years. Time series data such as this rarely constitutes a random sample. We also rarely have independence of observations for time series data. Instead, a given time period is likely to give results similar to the preceding and the next time period -- at least more similar than the results for distant time periods. This problem is called "autocorrelation". Nor is this any kind of experiment. We cannot control the number of radios licensed nor the number of letters in the President's name..

There are also some standard checks we can make of the data once it has
been collected. The models above all assume that we have constant variance
in the residuals. This is checked by plotting them against predicted *y*.
First, extract these from the results of the fitted model and store them
in variables.

> res <- residuals(lm3) > fits <- fitted(lm3)

Then make a plot.

> plot(res ~ fits)

We'd like to see random scatter, and maybe we do on the left, but to the right the points make a wandering path -- probably because this is time series data.

Another assumption is that the relationship between the dependent variable and each independent variable is linear. We usually plot residuals against each independent variable to see if there are signs of curvature.

> plot(res ~ licenses, data=radios)

> plot(res ~ lengths, data=radios)

The first plot looks about the same as the previous one but here suggests possible curvature in the relationship with licenses. The second plot looks very different because there are only three different values of lengths during these years. There are very slight signs of curvature here (higher in the middle) and very slight signs of different variances at the three levels. However, with only four or five observations in each group, we should not take these too seriously.

Normality of the residuals can be assessed informally with simple univariate displays.

> stem(res) The decimal point is at the | -0 | 96665 -0 | 4310 0 | 4 0 | 57 1 | 23

This shows the residuals skewed toward high values, suggesting some sort of transformation for the dependent variable. This shows up as curvature in a normal probability plot.

> qqnorm(res) > qqline(res)

This creates the plot and then adds a line to it to guide the eye in estimating whether the points follow a straight line.

The skewness could also be related to curvature in the relationship with`licenses`.
Note that in that residual plot we had clumps of low values at each end
and a few high values in the middle. These high values are the tails on
the stem and leaf.