 Rock Street, San Francisco

Linear regression is used for predicting a quantitative response Y on the basis of a
single or multiple predictor variables X. It is a well-known and very popular method
to model the statistical relationship between the response and explanatory variable.
When modeling this relationship the values of the explanatory variables are known
and they are used to describe the response variables as best as possible. A proper
model that accurately captures the relationship between these variables can be used
to make predictions on data that has not been used to build the model. Simple linear
regression is a model that contains only one explanatory variable. The formula of
simple linear regression is:
Y = ?0 + ?1 × X + e,
where
Y is the response variable,
X represents the predictor variable,
?0 is called the overall intercept or the overall population mean of the response
variable,
?1 is the average effect on Y of a one unit increase in X, holding all other predictors
fixed. It is also called the slope term,
e is the error term that represents the deviation between the observed and predicted
values.
?0 and ?1 are the unknown constants that the model has to estimate based on the
data James et al., 2013.
The goal is to obtain coefficient estimates of ?ˆ
0 and ?ˆ
1 so that the linear model
fits the available data well, such that yˆi ? ?ˆ
0 + ?ˆ
1 × xi
for i = 1, . . . , n, where yˆi
indicates the prediction for every i observation. In other words, the aim is to find
an intercept ?ˆ
0 and slope ?ˆ
1 such that the resulting line is as close as possible to the
data points. The most common approach to do this involves minimizing the least
squares criterion (see section 2.3.1).
Several assumptions are made in linear models: the residuals are independent,
the residuals are normally distributed, the residuals have a mean of 0 at all values of
X, the residuals have constant variance, the model is linear in the parameters. When
applying linear models it is of importance to make sure that these assumptions are
met, otherwise the statistical inference based on these results may not be adequate.
2 Chapter 1. Theoretical background
A variance-covariance matrix is a way to better demonstrate homogeneous variance
and independence of the residuals:
V = cov =
?
????
?
2 0 . . . 0
0 ?
2
. . . 0
.
.
.
.
.
.
.
.
.
.
.
.
0 0 . . . ?
2
?
????
In the variance-covariance matrix the diagonal values are the variances and if they
are all the same then this is a representation of variance homogeneity. The zeros in
matrix represent that there is no correlation, dependence between residuals.
Although the standard linear model assumes independent errors and heteroscedastic
variance among other assumptions, data cannot always satisfy these assumptions
so therefore more complex algorithms are also available. Moreover, it is possible to
add the correct variance or correlation structure to the model so that the correlation
and/or variance structure is explicitly modeled. Next, we will introduce the mathematical
notations of different variance and correlation structures and how to include
them in linear models. 