 Rock Street, San Francisco

Supervised
Learning Coursework 2, GI01/M055, 2017/2018

Solution to
Question 1 :

We Will Write a Custom Essay Specifically
For You For Only \$13.90/page!

order now

(a)

·
A prior distribution – probability distribution over a function dp(f)

·
An additive noise model – inaccuracy in the output observations.

The Noise for each measure assumed to be independent.

(b)

According
to the Bayes theorem – Posterior probability of a parameters (f) given an input
data (X) is the product of the Likelihood function and Prior probability dP(f)  (prior knowledge before performing the
experiment) divide by probability of input data.

Let
us assume the inputs data x = {Xi, where i = 1,2,3…..N} and its
corresponding  random function of f  = { fi, where i = 1,2,3…..N }

dPpost(f|X,
y) = (dP(f) P(y|X, f))/p(y|X)

We
can eliminate the denominator p(y|X) since there is no dependency on the
parameter ‘f’.

Therefore:

dPpost(f|X,
y) = (dP(f) P(y|X, f))/p(y|X)  ? dP(f) P(y|X, f)

dPpost(f)
= dP(f) P(y|X, f))

(c)

We
know that in the case of Gaussian process regression the prior and additive
noise model are Gaussian.

Let
=
, where

P(y|X,
f)) =   ,
where X Î  and y Î

The
data corrupted by likelihood function with zero mean and some variance

Prior
is given by dP(f) for symmetrical Gaussian distribution – centred at the origin
over the weight vector w.

(d)

For
a finite collections of random variable stochastic Gaussian process has
multivariate normal distribution while for infinite random variable it has a
joint distribution.  In our case, it is a
1-dimensional Gaussian(normal) distribution know as marginal distribution.

(e)

Marginal
distribution is the integral of prior multiplied by additive noise model.

Let
us assume the inputs data x = {Xi, where i = 1,2,3…..N}

its
corresponding random function of f = { fi, where i = 1,2,3…..N } and
output v    alues yi …… yn

P(y|X,
f)) =

The
posterior distribution:

By
manipulating the above expression, we can get the mean and covariance matrix.

The
argmin of the weight vector( ) would maximise the posterior i.e it would
give us the mean of the distribution.

is the mean distribution

,

where  is standard optimiser,

is the identity matrix which is

And the covariance  can be expressed as:

(f)

The
full Bayesian inference adds up all the posterior distribution output and then
takes the average as its output. The maximum a posteriori probability (MAP) picks
a weight vector which maximises the posterior distribution.  If the prior and additive noise model are
Gaussian then the Posterior is also a Gaussian. In linear Gaussian, the
convergence of the posterior mean and the MAP estimator coincide.  The sample mean of random linear function is equivalent
to the MAP mean of Gaussian for any value of standard deviation.

Since
we have prior knowledge about pervious posterior distribution, we can use it to
enhance the MAP hence we can get the complete posterior distribution – as shown
above in part ‘e’ the Gaussian have a mean of  in high dimension which is more accurate than sample mean
from the linear function.

Solution to
Question 2 :

(a)
When the value of a =b =1, the Beta distribution
is equivalent to uniform distribution. If we integrate B(a,b) and compute in interval of
0,1 we get 0.5 mean and a /a +b gives 0.5 mean as well.

(b)

Let X be the data X =
{1,….n}

H ={h:i?X ?0,1}=0,1X

Prior given as, P(h) =

and noise model P(y = 1|h,(i,y)) = h(i) and P(y = 0|h,(i,y)) = 1?
h(i)

Posterior distribution is
the proportional to the product of prior and noise model.

Prior of the beta
distribution is given by

P(h) =

Ppost =

,

This means
Beta is conjugate to the Bernoulli.

(c)

The mode is equivalent to the
maximum a posterior distribution, so

Mode of the posterior

The maximum
likelihood estimate is the empirical probability, i.e. ,

Mean of the
posterior E(p|X) =

For uniform distribution
where a =b =1

P(y = 1|h,(i,y)) = h(i) = 0.3

P(y = 0|h,(i,y)) = 1-h(i) = 0.7

Therefore,

Mean = 0.4333

Mode = = 0.3

Solution to
Question 3 :

(a)

Overfitting or High variance is when the learned hypothesis fits
training data well but fails to generalize for unseen dataset. The model might
be too wiggly which is not a good model to predict future(test) dataset. In
other words, the predictor is too complex and not consistent for unseen data.

(b)

Empirical risk minimization(ERM) – the learner algorithm gets X training
data input and model function H which has an output predictor H(x) from unknown
distribution D. The goal of the learner algorithm is to minimize hypothesis H*
for fixed function H for which E(L(H(x), y)) is minimum.  For such unknown distribution, we minimize
the empirical risk over the given training dataset such technique in statistical
learning algorithm known as Empirical risk minimization(ERM.)

(c)

Structural Risk Minimization(SRM) is on finite dataset, the task
of selecting best model class function within a nested family of space .  The model may be too complex at HQ  and poor at generalizing it to a new
dataset which would consequent an overfitting problem. The SRM addresses
overfitting problem by solving such problem by balancing the model’s complexity
and training data error.

Let  in Hq minimizer be

As we increase q, the   gets reduced, the   function increase in between we get a
balanced model.

(d)

Regularization addresses overfitting by penalizing complex learning
algorithm – it selects one complex hypothesis class and adds regularizer function  to empirical error to prevent overfitting.

Our goal is to choose best learning algorithm among many possible
hypothesis space – the regularization parameter can be chosen best   in case of ridge regression or best  in .

If we have large dataset – we can split the dataset into three
different parts:

Training dataset – where we model our functions using different
learning algorithms.

Development or Validation dataset – on this we tune the parameters
on held out dataset (development or validation set.)

Test
dataset – to test best model we have developed on the training dataset.

(e)

The kernel trick enables us to work with linear function in
infinite dimensional feature space.

subject to yi?w,xi??1??i,?i ?0, ?i = 1,…,m.??

Hinge loss function is defined as:

Basically, we minimize the hinge loss while minimizing the norm of
the weight vector(w) that corresponds to maximising the margin. Maximising the
margin gives a good generalization learning algorithm even though we have a
high dimensional feature space. The nonnegative xi(?i) measures the hinge loss – computes
the amount by which fails to meet a margin of 1.

(f)

By changing the hinge loss function, we can create a linear
programming boosting.   The primal

subject to yiHia?p??i,?i ?0, ?
= 1,…,m.?

and the dual would have the
following form:

;

subject to

where  is the distribution,

Optimal solution can
be achieved by doing the dual programming iteratively. We assume the set of weak
learners are finite. The boosting algorithm would give optimum error bound.

We might have more than one weak learner’s optimal solution which
are approximate to the cost function.

Solution to
Question 4 :

(a)
In statistical learning theory generalisation error is a measure how
well a predictive of a learning algorithm perform on unseen dataset.  The expected error is approximately equivalent
to the empirical error and the error reduces when we increase the dataset. Since
the generalisation of the learning algorithm is done using randomly drawn from
a finite independent and identically distributed samples the model might be
sensitive. One way of overcoming this problem is by relating the
training(given) error to the generalisation error – which possibly would avoid
overfitting.

There are many theories for
generalisation error – each theory has strengths and weaknesses.

(b)

For a fixed size of ‘m’
training datasets

None linearity in the lost
function would give a small difference in the average of the distribution.

Analysing the mean using

,

If we use consistency of
classification method for big dataset

where   if p(x,1)>p(x,0), otherwise

This might not be a good solution when we have small sample size.

We are taking limit m tends to infinity so for small m the Bayes risk might
fail.

The 95th percentile is 0.15 which is above the mean value 0.08.

This means the probability we have been misled by 0.05 which is less than the
mean value.

(c)

Structural risk minimization(SRM) is a best learning algorithm selection
procedure – choosing the model that minimizes VC upper bound on risk.

,

For fixed datasets, as the complexity of the learning algorithm C(Hi)
increases, the training error (eˆrr(c)) will decrease but fails to generalize for unseen dataset.  On the other hand, as the VC dimension
increases,   will increase since this depends on the VC
dimension. SRC chooses a best model from the sequence of hypothesis ( which minimizes the right-hand side of the
above inequality.

As the complex models might over fit and at the same time the more
training set we have the lower empirical error will be. SRC resolves the trade-off
by providing a measure of characterization between complexity and empirical
error – by
balancing the model’s complexity and training set (empirical) error.

(d)

As discussed above(b)  is the probability that the training data misled.

Assuming
Hk is finite and confidence at least 1-, where  is small positive number, we can compute the likelihood
of being misled by evaluating training error equals to zero. In our case,

Since the functions are drawn
randomly from a distribution we must add prior weight. By applying Hoeffding’s inequality
to the randomly drawn probability we can get the generalization error is
bounded – for model k.

Reference:

I have used the lecture notes
for this assignment.

https://moodle.ucl.ac.uk/course/view.php?id=11543 