Site Loader
Rock Street, San Francisco

Learning Coursework 2, GI01/M055, 2017/2018



Solution to
Question 1 :

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!

order now


A prior distribution – probability distribution over a function dp(f)

An additive noise model – inaccuracy in the output observations.

The Noise for each measure assumed to be independent.




to the Bayes theorem – Posterior probability of a parameters (f) given an input
data (X) is the product of the Likelihood function and Prior probability dP(f)  (prior knowledge before performing the
experiment) divide by probability of input data.


us assume the inputs data x = {Xi, where i = 1,2,3…..N} and its
corresponding  random function of f  = { fi, where i = 1,2,3…..N }


y) = (dP(f) P(y|X, f))/p(y|X)


can eliminate the denominator p(y|X) since there is no dependency on the
parameter ‘f’.



y) = (dP(f) P(y|X, f))/p(y|X)  ? dP(f) P(y|X, f)


= dP(f) P(y|X, f))




know that in the case of Gaussian process regression the prior and additive
noise model are Gaussian.

, where


f)) =   ,
where X Î  and y Î

data corrupted by likelihood function with zero mean and some variance


is given by dP(f) for symmetrical Gaussian distribution – centred at the origin
over the weight vector w. 





a finite collections of random variable stochastic Gaussian process has
multivariate normal distribution while for infinite random variable it has a
joint distribution.  In our case, it is a
1-dimensional Gaussian(normal) distribution know as marginal distribution.




distribution is the integral of prior multiplied by additive noise model.

us assume the inputs data x = {Xi, where i = 1,2,3…..N}

corresponding random function of f = { fi, where i = 1,2,3…..N } and
output v    alues yi …… yn


f)) =  


posterior distribution:






manipulating the above expression, we can get the mean and covariance matrix.


argmin of the weight vector( ) would maximise the posterior i.e it would
give us the mean of the distribution.

 is the mean distribution


where  is standard optimiser,

             is the identity matrix which is


And the covariance  can be expressed as:





full Bayesian inference adds up all the posterior distribution output and then
takes the average as its output. The maximum a posteriori probability (MAP) picks
a weight vector which maximises the posterior distribution.  If the prior and additive noise model are
Gaussian then the Posterior is also a Gaussian. In linear Gaussian, the
convergence of the posterior mean and the MAP estimator coincide.  The sample mean of random linear function is equivalent
to the MAP mean of Gaussian for any value of standard deviation.


we have prior knowledge about pervious posterior distribution, we can use it to
enhance the MAP hence we can get the complete posterior distribution – as shown
above in part ‘e’ the Gaussian have a mean of  in high dimension which is more accurate than sample mean
from the linear function.





Solution to
Question 2 :

When the value of a =b =1, the Beta distribution
is equivalent to uniform distribution. If we integrate B(a,b) and compute in interval of
0,1 we get 0.5 mean and a /a +b gives 0.5 mean as well.




Let X be the data X =

H ={h:i?X ?0,1}=0,1X

Prior given as, P(h) =   

and noise model P(y = 1|h,(i,y)) = h(i) and P(y = 0|h,(i,y)) = 1?

Posterior distribution is
the proportional to the product of prior and noise model.

Prior of the beta
distribution is given by

P(h) =


Ppost =





This means
Beta is conjugate to the Bernoulli.  




The mode is equivalent to the
maximum a posterior distribution, so

Mode of the posterior  


The maximum
likelihood estimate is the empirical probability, i.e. ,

Mean of the
posterior E(p|X) = 


For uniform distribution
where a =b =1

P(y = 1|h,(i,y)) = h(i) = 0.3

P(y = 0|h,(i,y)) = 1-h(i) = 0.7


Mean = 0.4333

Mode = = 0.3








Solution to
Question 3 :


Overfitting or High variance is when the learned hypothesis fits
training data well but fails to generalize for unseen dataset. The model might
be too wiggly which is not a good model to predict future(test) dataset. In
other words, the predictor is too complex and not consistent for unseen data.






Empirical risk minimization(ERM) – the learner algorithm gets X training
data input and model function H which has an output predictor H(x) from unknown
distribution D. The goal of the learner algorithm is to minimize hypothesis H*
for fixed function H for which E(L(H(x), y)) is minimum.  For such unknown distribution, we minimize
the empirical risk over the given training dataset such technique in statistical
learning algorithm known as Empirical risk minimization(ERM.)


Structural Risk Minimization(SRM) is on finite dataset, the task
of selecting best model class function within a nested family of space .  The model may be too complex at HQ  and poor at generalizing it to a new
dataset which would consequent an overfitting problem. The SRM addresses
overfitting problem by solving such problem by balancing the model’s complexity
and training data error.

Let  in Hq minimizer be

As we increase q, the   gets reduced, the   function increase in between we get a
balanced model.



Regularization addresses overfitting by penalizing complex learning
algorithm – it selects one complex hypothesis class and adds regularizer function  to empirical error to prevent overfitting.



Our goal is to choose best learning algorithm among many possible
hypothesis space – the regularization parameter can be chosen best   in case of ridge regression or best  in .

If we have large dataset – we can split the dataset into three
different parts:

Training dataset – where we model our functions using different
learning algorithms.

Development or Validation dataset – on this we tune the parameters
on held out dataset (development or validation set.)

dataset – to test best model we have developed on the training dataset.




The kernel trick enables us to work with linear function in
infinite dimensional feature space.



subject to yi?w,xi??1??i,?i ?0, ?i = 1,…,m.??

Hinge loss function is defined as:



Basically, we minimize the hinge loss while minimizing the norm of
the weight vector(w) that corresponds to maximising the margin. Maximising the
margin gives a good generalization learning algorithm even though we have a
high dimensional feature space. The nonnegative xi(?i) measures the hinge loss – computes
the amount by which fails to meet a margin of 1. 




By changing the hinge loss function, we can create a linear
programming boosting.   The primal



subject to yiHia?p??i,?i ?0, ?
= 1,…,m.?


and the dual would have the
following form:



subject to  


where  is the distribution,

     is additional weak learner,

        Optimal solution can
be achieved by doing the dual programming iteratively. We assume the set of weak
learners are finite. The boosting algorithm would give optimum error bound.

We might have more than one weak learner’s optimal solution which
are approximate to the cost function.





Solution to
Question 4 :

In statistical learning theory generalisation error is a measure how
well a predictive of a learning algorithm perform on unseen dataset.  The expected error is approximately equivalent
to the empirical error and the error reduces when we increase the dataset. Since
the generalisation of the learning algorithm is done using randomly drawn from
a finite independent and identically distributed samples the model might be
sensitive. One way of overcoming this problem is by relating the
training(given) error to the generalisation error – which possibly would avoid

There are many theories for
generalisation error – each theory has strengths and weaknesses.



For a fixed size of ‘m’
training datasets



None linearity in the lost
function would give a small difference in the average of the distribution.

Analysing the mean using
traditional statistics



If we use consistency of
classification method for big dataset


where   if p(x,1)>p(x,0), otherwise

This might not be a good solution when we have small sample size.

We are taking limit m tends to infinity so for small m the Bayes risk might

The 95th percentile is 0.15 which is above the mean value 0.08.

This means the probability we have been misled by 0.05 which is less than the
mean value.


Structural risk minimization(SRM) is a best learning algorithm selection
procedure – choosing the model that minimizes VC upper bound on risk.


For fixed datasets, as the complexity of the learning algorithm C(Hi)
increases, the training error (eˆrr(c)) will decrease but fails to generalize for unseen dataset.  On the other hand, as the VC dimension
increases,   will increase since this depends on the VC
dimension. SRC chooses a best model from the sequence of hypothesis ( which minimizes the right-hand side of the
above inequality.  

As the complex models might over fit and at the same time the more
training set we have the lower empirical error will be. SRC resolves the trade-off
by providing a measure of characterization between complexity and empirical
error – by
balancing the model’s complexity and training set (empirical) error.



As discussed above(b)  is the probability that the training data misled.

Hk is finite and confidence at least 1-, where  is small positive number, we can compute the likelihood
of being misled by evaluating training error equals to zero. In our case, 


Since the functions are drawn
randomly from a distribution we must add prior weight. By applying Hoeffding’s inequality
to the randomly drawn probability we can get the generalization error is
bounded – for model k.





I have used the lecture notes
for this assignment.






Post Author: admin


I'm Dora!

Would you like to get a custom essay? How about receiving a customized one?

Check it out