Recognizing character and digit from documentssuch as photographs captured at a street level is a very important factor in developingdigital map. For example, google street view images included millions of geo-locatedimages. By recognizing images, we can develop a precise map which can improve navigationservices. Though normal character classification is already solved by computervision, but still recognizing digit or character from the natural scene likephotographs are still a complex issue. The reason behind this problem are non-contrasting backgrounds, lowresolution, blurred images, fonts variation, lighting etc. Traditional approachof doing this work was a two-step process.
First slice the image to isolateeach character and then perform recognition on extracted image. This used to bedone using multiple hand-crafted features and template matching. 1 The main purposeof this project is to recognize the street view house number by using a deepconvolutional neural network. For thiswork, I considered the digit classification dataset of house numbers which Iextracted from street level images.
5 This dataset is similar in flavor toMNIST dataset but with more labeled data. It has more than 600,000-digit imageswhich contain color information and various natural backgrounds. 5 To achievethe goal, I developed an application which will detect the number from images.A convolutional neural network model with multiple layers is used to train thedataset and detect the house digit numbers.
I used the traditionalconvolutional architecture with different pooling methods and multistagefeatures and finally got almost 92% accuracy. Street view number detection is callednatural scene text recognition problem which is quite different from printedcharacter or handwritten recognition. Research in this field was started in90’s, but still it is considered as an unsolved issue. As I mentioned earlierthat the difficulties arise due to fonts variation, scales, rotations, lowlights etc. In earlier years to deal with natural scene text identification sequentially,first character classification by sliding window or connected components mainlyused. 4 After that word prediction can be done by predicting characterclassifier in left to right manner. Recently segmentation method guided bysupervised classifier use where words can be recognized through a sequentialbeam search.
4 But none of this can help to solve the street view recognitionproblem.In recent works convolutional neural networks proves its capabilitiesmore accurately to solve object recognition task. 4 Some research has donewith CNN to tackle scene text recognition tasks.
4 Studies on CNN shows itshuge capability to represent all types of character variation in the naturalscene and till now it is holding this high variability. Analysis withconvolutional neural network stars at early 80’s and it successfully appliedfor handwritten digit recognition in 90’s. 4 With the recent development of computerresources, training sets, advance algorithm and dropout training deepconvolutional neural networks become more efficient to recognize natural scenedigit and characters. 3 Previously CNN used mainly to detecting a singleobject from an input image. It was quite difficult to isolate each characterfrom a single image and identify them. Goodfellow et al., solve this problem byusing deep large CNN directly to model the whole image and with a simplegraphical model as the top inference layer. 4 The rest of the paper isdesigned in section III Convolutional neural network architecture, section IVExperiment, Result, and Discussion and Future Work and Conclusion in section V.
Convolutional Neural Networks (CNN) is a multilayer network to handle complexand high-dimensional data, its architecture is same as typical neural networks.8 Each layer contains some neuron which carries some weight and biases. Eachneuron takes images as inputs, then move onward for implementation and reduceparameter numbers in the network. 7 The first layer is a convolutional layer.Here input will be convoluted by a set of filters to extract the feature fromthe input. The size of feature maps depends on three parameters: number offilters, stride size, padding. After each convolutional layer, a non-linearoperation, ReLU use. It converts all negative value to zero.
Next is pooling orsub-sampling layer, it will reduce the size of feature maps. Pooling can be differenttypes: max, average, sum. But max pooling is generally used. Down-sampling alsocontrols overfitting.
Pooling layer output is using to create featureextractor. Feature extractor retrieves selective features from the inputimages. These layers will have moved to fully connected layers (FCL) and theoutput layer. In CNN previous layer output considers as next layer input. For thedifferent type of problem, CNN is different.
The main objective of this projectis detecting and identifying house-number signs from street view images. Thedataset I am considering for this project is street view house numbers datasettaken from 5 has similarities with MNIST dataset. The SVHN dataset has morethan 600,000 labeled characters and the images are in .png format. Afterextract the dataset I resize all images in 32×32 pixels with three colorchannels. There are 10 classes, 1 for each digit.
Digit ‘1’ is label as 1, ‘9’is label as 9 and ‘0’ is label as 10. 5 The dataset is divided into threesubgroups: train set, test set, and extra set. The extra set is the largestsubset contains almost 531,131 images. Correspondingly, train dataset has73,252 and test data set has 26,032 images. Figure 3 is an example of the original,variable-resolution, colored house-number images where each digit is marked bybounding boxes.
Bounding box information isstored in digitStruct.mat file, instead of drawn directly on the images in thedataset. digitStruct.
mat file contains a struct called digitStruct with thesame length of original images. Each element in digitStruct has the followingfields: “name” which is a string containing the filename of the correspondingimage. “bbox” is a struct array that contains the position, size, and label ofeach digit bounding box in the image. As an example, digitStruct(100). bbox (1). height means the height of the 1stdigit bounding box in the 100th image. 5 This is very clear from Figure 3 that in SVHN datasetmaximum house numbers signs are printed signs and they are easy to read. 2Because there is a large variation in font, size, and colors it makes thedetection very difficult.
The variation of resolution is also large here.(Median: 28 pixels. Max: 403 pixels. Min: 9 pixels). 2 The graph belowindicates that there is the large variation in character heights as measured bythe height of the bounding box in original street view dataset.
That means thesize of all characters in the dataset, their placement, and characterresolution is not evenly distributed across the dataset. Due to data are notuniformly distributed it is difficult to make correct house number detection. In my experiment, I train amultilayer CNN for street view house numbers recognition and check the accuracyof test data. The coding is done in python using Tensorflow, a powerful libraryfor implementation and training deep neural networks.
The central unit of datain TensorFlow is the tensor. A tensor consists of a set of primitive valuesshaped into an array of any number of dimensions. A tensor’s rank is its numberof dimensions. 9 Along with TensorFlow used some other library function suchas Numpy, Mathplotlib, SciPy etc. I perform my analysis only using the trainand test dataset due to limited technical resources. And omit extra datasetwhich is almost 2.7GB. To make the analysis simpler delete all those datapoints which have more than 5 digits.
By preprocessing the data from theoriginal SVHN dataset a pickle file is created which being used in my experiment.For the implementation, I randomly shuffle valid dataset and then used thepickle file and train a 7-layer Convoluted Neural Network. At the verybeginning of the experiment, first convolution layer has 16 feature maps with5x5 filters, and originate 28x28x16 output.
A few ReLU layers are also addedafter each convolution layer to add more non-linearity to the decision-makingprocess. After first sub-sampling the output size decrease in 14x14x10. Thesecond convolution has 512 feature maps with 5×5 filters and produces 10x10x32output. By applying sub-sampling second time get the output size 5x5x32.
Finally, the third convolution has 2048 feature maps with same filter size. Itis mentionable that the stride size =1 in my experiment along with zero padding.During my experiment, I use dropout technique to reduce the overfitting.Finally, SoftMax regression layer is used to get the final output. Weights areinitialized randomly using Xavier initialization which keeps the weights in theright range. It automatically scales the initialization based on the number ofoutput and input neurons. After model buildup, start train the network and logthe accuracy, loss and validation accuracy for every 500 steps.
Once the processis done then get the test set accuracy. To minimize the loss, Adagrad Optimizer used. Afterreach in a suitable accuracy level stop train the network and save thehyperparameters in a checkpoint file. When we need to perform the detection, theprogram will load the checkpoint file without train the model again. Initially,the model produced an accuracy of 89% with just 3000 steps. It’s a greatstarting point and certainly, after a few times of training the accuracy will reachin 90%. However, I added some additional features to increase accuracy. First, addeda dropout layer between the third convolution layer and fully connected layer.
Thisallows the network to become more robust and prevents overfitting. Secondly, introducedexponential decay to calculate learning rate with an initial rate 0.05. It willdecay in each 10,000 steps with a base of 0.95. This helps the network to takebigger steps at first so that it learns fast but over time as we move closer tothe global minimum, it will take smaller steps. With these changes, the modelis now able to produce an accuracy of 91.
9% on the test set. Since there are alarge training set and test set, there is a chance of more improvement if themodel will train for a longer time. During my analysis, I reached an accuracy level ofalmost 92%.
After train the model first time the accuracy was 89%. After severaltimes of training it reached to 92%. As mentioned earlier that the saved checkpointfile will be restored later to continue training or to detect new images. Byusing the dropout, its confirm that the model is suitable and can predict most images.The model is tested over a wide range of input from the test dataset.To recognizehouse numbers this model can detect most of the images. From Figure 5 itappears that among ten house numbers it correctly recognizes seven housenumbers.
However, the model still gives incorrect output when the images areblurry or has any other noise. Due to limited resource I train the model fewtimes as it takes longer time to run. I believe there is a strong possibility toincrease the accuracy level if work with whole dataset. Also, the use of betterhardware and GPU can run the model faster.
In the experiment I proposed a multi-layerdeep convolutional neural network to recognize the street view house number. Thetesting done on more than 600,000 images and achieve almost 92% accuracy. From theanalysis it is vibrant that the model produces correct output for most images.However, the detection may fail if the Image is blurry, or contain any noise.Mostexciting feature of the project is to discover the performance of some appliedtricks like dropout and exponential learning rate decay on real data. As many variationof CNN architecture can be implemented, it’s very difficult to understand whicharchitecture will work best for any specific dataset. Determine the mostappropriate CNN architecture was very challenging aspect of this experiment. Themodel implemented in this project is relatively simple but does the job verywell and is quite robust.
However still some works need to be done to optimizeaccuracy level. As a future work, I will extend my experiment using another architectureof CNN along with hybrid technique and algorithms. And try to find out whichone gives better accuracy with minimum cost and less number of loss.
As well astry to incorporate the whole dataset in next experiment