Site Loader
Rock Street, San Francisco

Review on Generative Image Modeling using Style and Structure Adversarial Network

Introduction

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

3D Image Reconstruction using unsupervised learning approach has been facing a lot of setbacks (such as occlusion problem, unrealistic artefacts, jarring rendering artefact, elimination of fine structure etc.) since inception of Computer Vision despite the state of the art approaches employed 1, 2, 3. Most of the approaches adopted yielded little or no effects. The underlining issues lie in the approaches being used because most of the researchers do not consider the theory of Image formation 1. Image formation can be defined as an image structure (basic geometry of the scene) and the image’s style (texture of the object and illumination). This paper presents a new approach to 3D Image reconstruction called Style and Structure Generative Adversarial Network (S2-GAN).  S2-GAN consists of two components namely Structure Generative Adversarial Network which generates a surface normal map of an image and Style Generative Adversarial Network take both surface normal map and noise (z) as an input to produce an equivalent 3D reconstructed image as an output trained in a competition environment1.

Style and Structure Generative Adversarial Network is a modified unsupervised learning algorithm of Generative Adversarial Network algorithm which its influence is in the area of Robotic and Graphics application. The S2-GAN model can be used as Learned Rendering Engine (LRE) that render a corresponding image whenever a 3D input is being given to the model as input. S2-GAN model also has the capacity and capability of reconstructing a new image by modifying the structure of a 3D input Image 1, 2.  

Style and Structure Generative Adversarial Network Architecture

S2-GAN consists of two Generative Adversarial Networks. Fig 1 depicts the generative pipeline architecture of S2-GAN. In this architecture, the training of the Generative Adversarial Network model is split into two way process. Each of the networks is trained differently in the first instance and later being trained jointly 1, 2.

Style
GAN

Output2: Natural Indoor Scenes

Structure
GAN

Uniform Noise
Distribution

Output1: Surface Normal

Uniform Noise
Distribution

  

 

 

 

 

 

 

 

Fig 1: Generative Pipeline:    sample from uniform distribution is given to Structure-GAN as input to generates a surface normal map as output1.  The output of Structure-GAN (surface normal) with   given to the Style GAN as an input to generate a reconstructed image as output2.

 

The Structure-GAN consists of two networks namely Generator and Discriminator. The Generator (G) takes an input sample (uniform noise distribution) and generates a surface normal map. The Discriminator (D) will learn to classify the generated surface normal maps from real map acquired from depth as real or generated image 3, 4, 5. The purpose of the structure-GAN framework is to learn how to generate surface normal maps. The training of the structure-GAN is done using ground truth surface normal from kinect.

Given a sample X=(X1,………..Xm) and a set of sample z from uniform distribution as  = (z1,…………..zm), the loss function that maximize D can be written as:

                                                 (1)

Equation (1) determines whether the surface normal map comes from generated sample or real map while loss function that minimizes G can be written as follow:

                                                                                                    (2)        

Equation two is a loss function that is trying to fool D to classify the generated image as a real image.

 

Style-GAN

Style-GAN is a conditional Generative Adversarial Network 6, 7.  Style Generative Adversarial Network is train in parallel to generate images conditioned on the surface normal map.   The surface normal maps generated by Structure-GAN are used as conditional information to both generator and discriminator.

 

Given real RGB images X=(X1,…….…………….Xm),surface normal C= (C1,……………..Cm) and  sample from noise distribution  as inputs to the model, the loss functions for both Generator and Discriminator of the model will be reformulated as follows:

 

                                   (3)

                                                                                 (4)

 

                                                                                (a)

 

Structure-GAN(G)

Fc

uconv

conv

conv

conv

Conv

uconv

conv

uconv

conv

Input size

9

18

18

18

18

18

36

36

72

Kernel Number

9 x 9 x 64

128

128

256

512

512

256

128

64

3

Kernel Size

4

3

3

3

3

4

3

4

5

Stride

2(up)

1

1

1

1

2(up)

1

2(up)

1

Structure-GAN(D)

conv

conv

conv

conv

conv

fc

Input size

72

36

36

18

9

Kernel Number

64

128

256

512

128

1

Kernel Size

5

5

3

3

3

1

Stride

2

1

2

2

1

 

 

 

 

 

                                                   (b)

Style-GAN(G)

Conv

conv

conv

conv

conv

fc

Input size

128

64

32

16

8

Kernel Number

64

128

256

512

128

1

Kernel Size

5

5

3

3

3

1

Stride

2

2

2

2

1

 

 

 

 

 

                                                                                 (c)

Table 1: (a), (b), and (c) represent Network architecture for Structure-GAN (Generator), Structure-GAN (Discriminator) and Style-GAN (Discriminator) respectively.”conv” means convolutional layer, “uconv” means “deconvolutional layer, 2(up) stride indicate 2 x resolutions, “fc” means fully connected layer.

 

 

 

 

 

 

Fig 2: The architecture of the generator in Style-GAN

 

 

 

 

Fig 3: The Style-GAN. Generator G learns to generate RGB images from the ground truth surface normals and   inputs. The discriminator (D) takes the generated images, real images and their corresponding normal maps as inputs to perform classification. The FCN takes the generated images as inputs and predict the surface normal map.

Given the ground truth surface normal and  as inputs, the generator G learns to generate RGB images. The supervision comes from two networks:

i.                     the Discriminator network takes the generated images, real images and their corresponding normal maps as inputs to perform classification and

ii.                    the FCN takes the generated images as input and predict the surface normal maps.

 

 

 

Fig 3: Full model of S2-GAN: This model generates RGB images given ,  as input. During joint training learning, the loss from Style-GAN is also passed down to the Structure-GAN

Multi-task Learning with Pixel-wise Constraints

In order to remove noise and align the edges of the image generated by Style-GAN, FCN is introduced to the generator network to serve as a pixel-wise constraint to guide the generator to align the output with the input surface normal maps 8, 9, 10.

This assumption helps in enforcing each pixel to generate accurate surface normal.

i.                     If the generated image is real enough, it can be used for reconstructing the surface normal map.

A loss function will be connected to FCN to carry-out surface normal estimation. The loss function can be written as:

                                                                                     (5)

Where:

  is softmax loss and the output surface normal map is in   dimension

K =128 is in the same size of the input image.

   is the output of kth pixel in the ith sample.

The label for the kth pixel in sample i is     

                    

The FCN can be model by combining the loss function in eq. 4 and eq.5 together and train the model accordingly.

                             (6)

Where  represents the generated image given a batch of the surface normal maps C and noise  .

Training Style-GAN requires 3 steps processing compared to training the normal generative adversary network. The processes are:

i.                     Fix the generator G, optimize the discriminator D with eq. 3

ii.                   Fix the FCN and the discriminator D, optimize the generator G with eq.6

iii.                  Fix the generator G, fine-tune FCN using generated and real images.

Joint Learning for S2 –GAN

Structure-GAN and Style-GAN were first trained independently and jointly trained together as shown in figure 4.

In this architecture, generated normal and image are been taken by Style- GAN discriminator as negative sample while ground truth normal and real image as positive sample.   

Also for better RGB images generation, the Structure-GAN generator network does not only receive the gradients from the discriminator but the gradients were passed through the Style-GAN generator.

The loss function for the generator network Structure-GAN can be written by combining the equation 2 and equation 4 as follows:

Where  and  represent two set of sample drawn from uniform distribution for Structure-GAN and Style-GAN respectively.

represent adversarial loss from the discriminator of Structure-GAN

represent adversarial loss from the discriminator of Style-GAN

? is a leaning rate to prevent generated normal from over fitting to the task of generating RGB images via Style-GAN.

Experiments

Qualitative and quantitative evaluation of the quality of image generated and quality of unsupervised learning representation was conducted to show how reliable S2-GAN model is. S2-GAN model was tested using NYUv2 dataset 10 alongside other baseline models like DCGAN, DCGAN+LAPGAN, DCGANv2 and DCGANv2+LAPGAN. The result was quite awesome. S2-GAN model was trained using Adam optimizer with momentum term ?= 0.5, ?=0.999 and batch size M=1283.   Learning rate was set to 0.0002 and both Structure-GAN and Style-GAN were trained for 25 epochs. 5 more epochs was used to fine-tune FCN model. The learning rate for joint learning was set to  for Style-GAN and   for Structure-GAN for 5 epochs training. Figure 4 to 9 shows the results of the experiment. Fig 10 presents the result of SUN RGB-D dataset when used to train S2-GAN model and other models to compare their accuracy for scene classification and object detection. Table 2 shows detection results on NYU test set.

Fig 4: Left: 4 Generated Surface Normal maps          Right: 2 Pairs of rendering results on ground truth surface normal maps using the style-GAN without pixel-wise constraint.

Fig 5: Result of Style-GAN conditioned on ground truth surface normal (first 3 rows) and synthetic scenes (last 2 rows).

Fig 6: Comparison between models with and without pixel-wise constrains

Fig 7: Result comparison:  a. S2-GAN b. DCGAN      c. DCGAN+LAPGAN

 

Fig 8: Walking Latent space Result when fixing style and changing structure and vice versa

 

 

 

 

 

 

 

 

 

 

 

 

 

Fig 9: Nearest neighbours test on generated images.

 

 

 

Fig 10: (a) Maximum Norm of classification results on Generated images. (b) Number of fires over different thresholds for object detection on generated images. (c) Scene classification on SUN RGB-D with S2-GAN and other methods (non fine-turning).

 

Table 2: Detection results on NYU test set

Conclusion

S2-GAN model approach to Generative Image Modeling in image reconstruction has shown reliable and realistic results compared to some of the state of the art approaches used as baselines in this research work.

 

Evaluation of S2-Generative Adversarial Network with other Models

 

3D Image Reconstruction

Though the architecture of the research work is good and the researchers did justice to the aims and objective of the research, the research work laid emphasis to basic principle of image formation specifically the structure of 3D model and texture mapped onto structure but their result never produced 3D Images, rather it produced 2D Images. The researchers extend the work of Goodfellow et al. (2014) trying to simplify the image generation process by factoring out the generation of 3D structure and style which is an advantage for them to leverage on the depth estimation to reconstruct 3D Images but their architecture was able to produce 2D images.

 

Liu et al. 2015 proposed Deep Convolutional Neural Network for Depth Estimation from a Single Image. The researchers developed a Deep Convolution Neural field Model which estimates depth from a single image. The work jointly explored the capacity of deep CNN and continuous CRF in determining the depth of an image with no geometric prior and injection of any extra information which is similar to the S2-Generative Adversarial Network approach. Both approaches work on single images using the same NYUv2 dataset. Liu et al. was able to perform 3D images reconstruction without laying much emphasis on basic principle of image formation.

 

 

Scene Classification

Liu et al.2015 and S2-GAN scene classification were compared though the two approaches uses different datasets with depth value. Looking at the sophisticated nature of S2-GAN architecture and the dataset been trained on the SVM classifier, Liu et al. results is far better (0.86) when compared with S2-GAN result (0.35).  

 

 

New Approach to 3D Reconstruction using S2-Genarative Adversarial Network Approach

The S2-Genarative Adversarial Network architecture is a sophisticated approach to Generative Image Modeling which needed to be explored to the fullest. In view of its importance in 3D Reconstruction system, it will generate a good and reliable result if the model is being applied to multiple images.

 

I will like to restructure the problem statement as Generative Image Modeling using Style and Structure Adversarial Networks on Multiple Images.

 

The reconstruction problem can be viewed as passing multiple images into a S2-Generative Adversarial Networks model to generate 3D reconstruction image.

 

With this, we can simply carry out an experiment on S2-GAN architecture using any 3D Dataset. The output will be 3D reconstructed image with emphasis on the structure and texture of an image.

 

References

1 Wang, X. and Gupta, A. (2016, October). Generative image modeling using style and structure

      adversarial networks. In European Conference on Computer Vision (pp. 318-335). Springer 

      International Publishing.

2 Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S. and Bengio, Y.  

      (2014). Generative adversarial nets. In Advances in neural information processing systems (pp.

      2672-2680).

3 Radford, A., Metz, L., and Chintala, S. (2015). Unsupervised representation learning with deep  

      convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434.

4 Maas, A. L., Hannun, A. Y., & Ng, A. Y. (2013, June). Rectifier nonlinearities improve neural  

     network acoustic models. In Proc. ICML (Vol. 30, No. 1).

5 Xu, B., Wang, N., Chen, T., & Li, M. (2015). Empirical evaluation of rectified activations in        

     convolutional network. arXiv preprint arXiv:1505.00853.

6 Denton, E. L., Chintala, S., & Fergus, R. (2015). Deep Generative Image Models using a Laplacian

     Pyramid of Adversarial Networks. In Advances in neural information processing systems (pp. 1486-

    1494).

7 Mirza, M., & Osindero, S. (2014). Conditional generative adversarial nets. arXiv preprint

      arXiv:1411.1784.

8 Wang, X., Fouhey, D., & Gupta, A. (2015). Designing deep networks for surface normal

     estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (pp. 539-547).

9 Mathieu, M., Couprie, C. and  LeCun, Y. (2015). Deep multi-scale video prediction beyond mean 

     square error. arXiv preprint arXiv:1511.05440.

10 Silberman, N., Hoiem, D., Kohli, P., & Fergus, R. (2012). Indoor segmentation and support

        inference from rgbd images. Computer Vision–ECCV 2012, 746-760.

11 Liu, F., Shen, C., & Lin, G. (2015). Deep convolutional neural fields for depth estimation from a

         single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

         (pp. 5162-5170).

 

 

 

Post Author: admin

x

Hi!
I'm Dora!

Would you like to get a custom essay? How about receiving a customized one?

Check it out