Optimisers sit at the heart of all neural network architectures, defining the nature of the traver-sal of the search space.
A relatively recent innovation Kingma & Ba (2014) is the Adamoptimisers. Based on their findings Adam compares favourably with other established optimis-ers such as the adaptive learning-method family of algorithms (e.g.
AdaDelta (Zeiler 2012)). Italso demands relatively little tuning, with a default set of hyperparameters typically sufficientto avoid a vanishing learning rate or poor, but locally optimal, solutions. This reduced needfor tuning is particularly attractive given the thesis’ anticipated need for experimentation.2.5.4 Activation FunctionsA significant challenge observed in all deep neural networks is that of a vanishing gradient.
Thechallenge often arises from the use of sigmoid-like squashing functions (and its relatives) withindeep networks. If one examines the plot of an ‘S’ shape of the popular sigmoid or tanh functionit is evident that the gradients of the curves rapidly tend to zero for large/small input values;as a result as the corresponding neuron’s inputs approach this limit the network becomes anincreasingly ineffectual learner. This phenomena is compounded by the multiplicative natureof the weights in backpropagation with the end result being, especially in a deep network, avanishingly small gradient. This makes the search for a global minima prohibitively time con-suming if not practically impossible (Nair & Hinton 2010).The rectified linear unit (ReLU) devised by Nair and Hinton seeks to address this issues bysupplying a finite gradient no matter the input values (Dahl et al.
2013). However, ReLUactivation functions are not without issue; their sharp inflection point presenting a problemfor optimisers during training. To address this issue the ELU (exponential linear unit) wasdevised by (Clevert et al. 2015). This innovation addresses the vanishing gradient problem butalso smooths the gradient transition at the inflection point of the activation function makinggradient descent more efficient during training. For context example plots of the activationsare provided in figure 2.
3 below:2.5.5 RegularisationRegularisation is a common machine learning technique used to penalise overfitting of a model,effectively formalising Occam’s razor mathematically. A variety of methods are available, withthe classical L1 and L2 regularisation5still readily employed within CNNs.