Site Loader
Rock Street, San Francisco

p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica}
p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 10.0px Helvetica}
p.p3 {margin: 0.0px 0.0px 0.0px 0.0px; font: 9.0px Helvetica}
span.s1 {font: 24.0px Helvetica}
span.s2 {color: #002486}

Abstract
Automatic speech recognition (ASR) is the translation,
through some methodologies, of human
speech into text by machines and plays an important
role nowadays. In this research review we
examine four dierent artificial neural network
architectures that are used in speech recognition
field and we investigate their performance in different
cases. We analyze the state-of-art deep
neural networks (DNNs), that have evolved into
complex structures and they achieve significant
results in a variety of speech benchmarks. Afterwards,
we explain convolutional neural networks
(CNNs) and recurrent neural networks (RNNs)
and we explore their potential in this field. Finally,
we present the recent research in highway
deep neural networks (HDNNs) that seem to
be more flexible for resource constrained platforms.
Overall, we critically try to compare these
methods and show their strengths and limitations.
Each method has its benefits and applications and
from them we try to draw some conclusions and
give some potential future directions.
I. Introduction
Machine Learning (ML) is a field of computer science
that gives the computers the ability to learn through
dierent algorithms and techniques without being programmed.
ASR is closely related with ML because it uses
methodologies and procedures of ML 1, 2, 3. ASR has
been around for decades but it was not until recently that
there was a tremendous development because of the advances
in both machine learning methods and computer
hardware. New ML techniques made speech recognition
accurate enough to be useful outside of carefully controlled
environments, so it could easily be deployed in many electronic
devices nowadays (i.e. computers, smart-phones)
and be used in many applications such as identifying and
authenticating a user via of his/her voice.
Speech is the most important mode of communication
between human beings and that is why from the early part
of the previous century, eorts have been made in order
to make machines do what only humans could perceive.
Research has been conducted through the past five decades
and the main reason was the desire of making tasks automated
using machines 2. Many motivations using dierent
theories such as probabilistic modeling and reasoning,
pattern recognition and artificial neural networks aected
the researchers and helped to advance ASR.
The first single advance in the history of ASR occurred
in the middle of 70’s with the introduction of the
expectation-maximization (EM) 4 algorithm for training
hidden Markov models (HMMs). The EM technique gave
the possibility to develop the first speech recognition systems
using Gaussian mixture models (GMMs). Despite all
the advantages of the GMMs, they are not able to model
eciently data that lie on or near a nonlinear surface in the
data space (i.e. sphere). This problem could be solved by
artificial neural networks because they can capture these
non-linearities in the data but the computer hardware of
that era did not allow us to build complex neural networks.
As a result, in the beginning most speech recognition systems
were based on HMMs. Later the neural network and
hidden Markov model (NN/HMM) hybrid architecture 5
was used for ASR systems. After 2000s and over the last
years the improvement of computer hardware and the invention
of new machine learning algorithms made possible the
training for DNNs. DNNs with many hidden layers have
been shown to achieve comparable and sometimes much
better performance than GMMs in many dierent databases
(with speech data) and in a range of applications 6. After
the huge success of DNNs, researchers try other artificial
neural architectures such as recurrent neural networks with
long short-term memory units (LSTM-RNNs) 7, deep
belief networks and CNNs, and it seems that each one of
them has its benefits and weaknesses.
In this literature review we present four types of artificial
neural networks (DNNs, CNNs, RNNs and HDNNs).
We analyze each method, we explain how they are used
for training and what are their advantages and disadvantages.
Finally we compare these methods in the context of
ASR, identifying where each one of them is more suitable
and what are their limitations. Finally, we draw some conclusions
from these comparisons and we carefully suggest
some probable future directions.
II. Methods
 A. Deep Neural Networks
Deep neural networks (DNNs) are feed-forward artificial
neural networks with more than one layer of hidden
units. Each hidden layer has a number of units (or neurons)
Informatics Research Review (s1736880)
each of which takes all outputs of the lower layer as input
and passes them through a linearity. After that we apply
a non linear activation function (i.e. sigmoid function, hyperbolic
tangent function, some kind of rectified linear unit
function (ReLU 8, 9), or exponential linear unit function
(ELU 10)) for the final transformation of the initial inputs.
Moreover, when we want to deal with a multi-class classification
problem, the posterior probability of each class
can be estimated using an output softmax layer. For the
training process of DNNs we usually use the gradient descent
method and the back propagation technique 11. For
large training sets, it is typically more convenient to compute
derivatives on a mini-batch of the training set rather
than the whole training set (this is called stochastic gradient
descent). As cost function we often use the cross-entropy
(CE) in order to have a comparison meter between the output
of the network and the actual output but the choice of
the cost function actually depends on the case.
An impediment we had with complex DNNs architectures
is their optimization. In addition, we need to cope
with the overfitting eect (it appears due to the deep architecture
of the network). These two problems forced to
invent some alternative methodologies and these are called
pretraining methods. One such a popular method is the
restricted Boltzmann machines (RBMs) 12. If we use a
stack of RBMs then we can construct a deep belief network
(DBN) (it is dierent from the dynamic Bayesian network).
The purpose of this is to add an initial stage of generative
pretraining. The pretraining is very important for DNNs
because it reduces overfitting and it also reduces the time
required for discriminative fine-tuning with propagation
(this is helpful for the optimization problem).
DNNs in the context of ASR play a major role. Many
architectures have been used by dierent research groups in
order to gain better and better accuracy in acoustic models.
You can see some methodologies in the article 6 that it
presents some significant results and shows that DNNs in
general achieve higher speech recognition accuracy than
GMMs on a variety of speech recognition benchmarks such
as TIMIT and some other large vocabulary environments.
The main reason is that they take advantage from the fact
that they can handle the non-linearities in the data and so
they can learn much better models comparing to GMMs.
However, we have to mention that they use many model parameters
in order to achieve a good enough speech accuracy
and this is sometimes a drawback. Furthermore, they are
complex enough and need many computational resources.
Finally, they have been criticized because they do not preserve
some specific structure (we can use dierent structures
until we achieve a significant speech accuracy), they
are dicult to be interpreted (because they have not some
specific structure) and they possess limited adaptability (we
use dierent approaches for dierent cases). Besides all
of these disadvantages they remain the state-of-the-art for
speech recognition the last few years and they have given
us the most reliable and consistent results overall.
 B. Convolutional Neural Networks
Convolutional neural networks (CNNs) can be regarded
as DNNs with the main dierence that instead of using
fully connected hidden layers (as it happens in DNNs; full
connection with all the possible combinations among the
hidden layers) they use a special network structure, which
consists of convolution and pooling layers 13, 14, 15.
Furthermore, we have to mention that CNNs have three key
properties; the local receptive fields (locality property) in
which hidden units are connected to local patches of the
layer below, weight sharing which enables the construction
of feature maps and pooling which condenses information
from the previous layer.
Local receptive fields with shared weights result in a
feature map and we can have multiple feature maps in a
hidden layer (it depends on how complex we want to be
the convolutional layer). So, firstly we need the data to
be organized as a number of feature maps in order to be
passed in each convolutional layer. We have the ability
to stack many convolutional layers for more complex architectures
and each one of them is followed by a pooling
layer in order to reduce the resolution of the feature maps.
This process continues depending on how deep we want
to be our network (maybe we could achieve higher speech
accuracy with an increased number of the hidden layers
in this structure). You can see an example of the whole
process and the usage of convolution and pooling layers in
the paper 15. We usually use stochastic gradient descent
and the back propagation technique to train our model. In
the end, we typically apply a fully connected hidden layer
(no sharing weight) and a softmax output layer.
In the beginning CNNs were used in image recognition
but later they have been tested on speech recognition field
too. In the context of ASR, one significant problem we
have, when we want to transform our speech data in feature
maps, concerns frequency because we are not able to use the
conventional mel-frequency cepstral coecient (MFCC)
technique 16. The reason is that this technique does not
preserve the locality of our data (in the case of CNNs),
although we want to preserve locality in both frequency
and time. Hence, a solution is the use of mel-frequency
spectral coecients (MFSC features) 15. Moreover, as
it happens for DNNs with RBMs, there is a respective
procedure CRBM 17 for CNNs that allow us pretraining
our data in order to gain in speech accuracy and reduce
the overfitting eect. In the paper 15, the authors also
examine the specific case of a CNN with limited weight
sharing for ASR (LWS model) and they propose to pretrain
it by modifying the CRBM model.
In conclusion, CNNs have three major properties: locality,
weight sharing, and pooling. Each one of them has
the potential to improve speech recognition performance.
These properties can reduce the overfitting problem and
they can add robustness against non-white noise. In addition,
they can reduce the number of network weights to be
learned. Both locality and weight sharing are significant
Informatics Research Review (s1736880)
factors for the property of pooling which is very helpful in
handling small frequency shifts that are common in speech
signals 15. In general, CNNs seem to have a relative better
performance in ASR taking advantage from their special
network structure.
 C. Recurrent Neural Networks
We often wish to model data that is a sequence or trajectory
through time, such as audio signals, sequences
of words or currency exchange rates. Modelling sequential
data means handling invariances across time and that the
current state depends on the past. So, we need to share data
across time and this happens through the recurrent neural
network architecture (RNNs).
RNNs are feed-forward artificial neural networks that
contain cyclic connections that aect the current state of
the hidden units with their previous one (they work as memories
for the system). We can train a RNN by unfolding
and back-propagating through time (BPTT technique), summing
the derivatives for each weight as we go through the
sequence. However, BPTT involves taking the product of
many gradients (as in a very deep network) and this can lead
to vanishing or exploding gradients. As a result, maybe the
training will not be eective. A solution is to use modified
optimization algorithms such as RMSProp 18 or modified
hidden unit transfer functions such as long short term
memory (LSTM).
The LSTM contains special units called memory blocks
in the recurrent hidden layer. The memory blocks contain
an internal recurrent state (memory cell) combined with
the previous state and the LSTM input. In addition, there
two gates which are weights dependent on the current input
and the previous state. The first one is the input gate which
controls how much input to the unit is written to the internal
state. The second one is the forget gate which controls how
much of the previous internal state is written to the internal
state. Both allow the network to control what information
is stored and overwritten at each step.
RNNs have been used in dierent architectures. Firstly,
influenced from the NN/HMM hybrid architecture, the
respective RNN/HMM architecture had been tried without
any significant success comparing to the conventional
DNNs 19. Later, instead of RNN/HMM architecture an
end-to-end training of RNNs for speech recognition have
seemed more promising. In this direction, the combination
with the long short term memory units (LSTM technique)
showed that RNNs can outperform DNNs either on TIMIT
benchmark 20 or on large scale vocabulary speech recognition
7.
 D. Highway Deep Neural Networks
Highway deep neural networks (HDNNs) are deep feedforward
neural networks 21. They are distinguished
from the conventional DNNs for two main reasons. Firstly
they use much less model parameters and secondly they
use two types of gate functions to facilitate the information
flow through the hidden layers.
HDNNs are multi-layer networks with many hidden
layers. In each layer we have the transformation of the
initial input or of the previous hidden layer with the corresponding
parameter of the current layer (they are combined
in a linear way) followed by a non-linear activation function
(i.e. sigmoid function). In the output layer we usually use
the softmax function in order to obtain the posterior probability
of each class given our initial inputs. The network
is usually trained by gradient descent to minimize a loss
function such as cross-entropy (CE function). So, we can
see that the architecture and the process are the same as in
DNNs that we described in subsection of DNNs.
The dierence from the standard DNNs is that highway
deep neural networks (HDNNs) were proposed to enable
very deep networks to be trained by augmenting the hidden
layers with gate functions 22. This augmentation happens
through the transform and carry gate functions. The first
scales the original hidden activations and the latter scales
the input before passing it directly to the next hidden layer
21.
Three main methods are presented for training, the
sequence training, the adaptation technique and the teacherstudent
training in the papers 21, 23, 24. Combining these
methodologies with the two gates it is demonstrated how
important role the carry and the transform gate play in the
training. The main reason is that the gates are responsible
to control the flow of the information among the hidden
layers. They allow us to achieve comparable speech recognition
accuracy to the classic DNNs but with much less
model parameters because we have the ability to handle the
whole network through the parameters of the gate functions
(which are much less comparing to the parameters of the
whole network). This outcome is crucial for platforms such
as mobile devices (i.e. voice recognition on mobiles) due to
the fact that we have not many disposal resources in these
devices.
III. Comparison of the Methods
These methods, that we presented, have their benefits
and limitations. In general, DNNs behave very well
and in many cases they present enough better performance
compared to GMMs on a range of applications. The main
reason is that they take advantage from the fact that they
can handle much better the non linearities in the data space.
On the other hand, their biggest drawback compared with
GMMs is that it is much harder to make good use of large
cluster machines to train them on massive data 6.
As far as the CNNs are concerned, they can handle
frequency shifts which are dicult to be handled within
other models such as GMMs and DNNs. Furthermore, it is
also dicult to learn such an operation as max-pooling in
standard artificial neural networks. Moreover, CNNs can
handle the temporal variability in the speech features as
well 15. On the other hand, the fine-tuning of the pooling
Informatics Research Review (s1736880)
size (carefully selection of pooling size) is very important
because otherwise we may cause phonetic confusion, especially
at segment boundaries. Despite the fact that CNNs
seem to have better accuracy than DNNs with less parameters,
computationally are much more expensive because of
the complexity of the convolution operation.
CNNs model invariances across space and RNNs model
invariances across time. The flexibility of sequential training
in RNNs with long short term memory units give them
an advantage to handle better the speech data over time.
Thus, they have better performance and they achieve a considerably
better speech accuracy comparing to the standard
DNNs.
HDNNs are compact comparing to conventional DNNs
because they use many fewer model parameters and this
happens because through the gate functions they can control
the behavior of the whole network using only the parameters
of the gates (which are much less comparing to the
parameters of the whole network). Moreover, with HDNNs
we can update our whole model by simply updating the
gate functions (it is not needed to update the parameters
of the whole network) and in this way we can gain considerably
in speech recognition accuracy. Hence, we have to
mention that the two gate functions play a major role in this
architecture. Although they are considered useful for resource
constrained platforms, their final model parameters
are still large enough 21. On the other hand, we cannot
conclude much for their general performance because they
are a recent proposal and it is needed more research to see
their overall benefits and limitations. However, the main
idea is to use them in order to have comparable ASR accuracy
with DNNs and simultaneously to reduce the model
parameters.
IV. Conclusions
Overall, we can say that DNNs are the state-of-theart
today because they behave very well on a range
of speech recognition benchmarks. However, other architectures
of artificial neural networks such as CNNs have
achieved comparable performance in the context of ASR.
Besides that, research continues to be conducted in this
field in order to find new methods, learning techniques and
architectures that will allow us to train our data sets more
eciently. This maybe means less parameters, less computational
power, less complex models, more structured models.
Ideally we would like to have a whole general model
that covers a lot of cases and not many dierent models that
applied in dierent circumstances. On the other hand, this
is probably dicult, so just distinct methodologies and specific
techniques for dierent cases may be our temporary or
even more our unique solution. In this direction, HDNNs or
other architectures may be used to deal with specific cases.
Many future directions have been suggested the last
few years for research in order to advance ASR. Some
probable suggestions are the use of unsupervised learning
(an example you can see in the papers 13, 17, 25) or
reinforcement learning for acoustic models 26. Another
potential direction is to search for new architectures or
special structures in artificial neural networks (i.e. in the
paper 27) or inventing new learning techniques and at the
same time improving our current algorithms.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

Post Author: admin

x

Hi!
I'm Dora!

Would you like to get a custom essay? How about receiving a customized one?

Check it out