Linear regression is used for predicting a quantitative response Y on the basis of a

single or multiple predictor variables X. It is a well-known and very popular method

to model the statistical relationship between the response and explanatory variable.

When modeling this relationship the values of the explanatory variables are known

and they are used to describe the response variables as best as possible. A proper

model that accurately captures the relationship between these variables can be used

to make predictions on data that has not been used to build the model. Simple linear

regression is a model that contains only one explanatory variable. The formula of

simple linear regression is:

Y = ?0 + ?1 × X + e,

where

Y is the response variable,

X represents the predictor variable,

?0 is called the overall intercept or the overall population mean of the response

variable,

?1 is the average effect on Y of a one unit increase in X, holding all other predictors

fixed. It is also called the slope term,

e is the error term that represents the deviation between the observed and predicted

values.

?0 and ?1 are the unknown constants that the model has to estimate based on the

data James et al., 2013.

The goal is to obtain coefficient estimates of ?ˆ

0 and ?ˆ

1 so that the linear model

fits the available data well, such that yˆi ? ?ˆ

0 + ?ˆ

1 × xi

for i = 1, . . . , n, where yˆi

indicates the prediction for every i observation. In other words, the aim is to find

an intercept ?ˆ

0 and slope ?ˆ

1 such that the resulting line is as close as possible to the

data points. The most common approach to do this involves minimizing the least

squares criterion (see section 2.3.1).

Several assumptions are made in linear models: the residuals are independent,

the residuals are normally distributed, the residuals have a mean of 0 at all values of

X, the residuals have constant variance, the model is linear in the parameters. When

applying linear models it is of importance to make sure that these assumptions are

met, otherwise the statistical inference based on these results may not be adequate.

2 Chapter 1. Theoretical background

A variance-covariance matrix is a way to better demonstrate homogeneous variance

and independence of the residuals:

V = cov =

?

????

?

2 0 . . . 0

0 ?

2

. . . 0

.

.

.

.

.

.

.

.

.

.

.

.

0 0 . . . ?

2

?

????

In the variance-covariance matrix the diagonal values are the variances and if they

are all the same then this is a representation of variance homogeneity. The zeros in

matrix represent that there is no correlation, dependence between residuals.

Although the standard linear model assumes independent errors and heteroscedastic

variance among other assumptions, data cannot always satisfy these assumptions

so therefore more complex algorithms are also available. Moreover, it is possible to

add the correct variance or correlation structure to the model so that the correlation

and/or variance structure is explicitly modeled. Next, we will introduce the mathematical

notations of different variance and correlation structures and how to include

them in linear models.

x

Hi!

I'm Dora!

Would you like to get a custom essay? How about receiving a customized one?

Check it out