Learning a Probabilistic Latent Space of Object Shapes via 3D Generative Adversarial Modeling

Contribution

This paper introduces a new framework called 3DGAN, which generates 3D objects from a probabilistic space using volumetric convolutions and GANs.

Generative Adversarial Nets (GAN): Refer here
Variational Autoencoder: An autoencoder network is actually a pair of two connected networks, an encoder and a decoder. An encoder network takes in an input, and converts it into a smaller, dense representation, which the decoder network can use to convert it back to the original input.
Volumetric Convolutions: Convolution layers for 3D input.

3D object understanding and generation is an important problem in the graphics and vision community.
With the help of adversarial training, the generator encapsulates the object structure implicitly and then synthesizes high quality 3D objects.
The generator establishes mapping from low dimensional probability space to a space of 3D objects, so that there’s no need of reference models or CAD models for generating 3D objects.
This network, when combined with a variational autoencoder can directly reconstruct a 3D object from a 2D image.
The discriminator provides a powerful 3D shape descriptor which, learned without supervision, has wide applications in 3D object recognition.

In 3D-GAN, a 200 dimensional latent vector z, randomly sampled from a probabilistic latent space, is converted to a 64 x 64 x 64 cube, by the generator G, representing an object G(z) in a 3D voxel space.
The discriminator D takes in 3D object image x and gives as output a confidence value D(x) of whether the input is real or synthetic.
Binary cross entropy is used as the loss function.
The discriminator usually learns faster, and this makes it hard for the generator to improve, as all samples it generates are correctly identified as synthetic with high confidence.
Therefore, to keep the training of both networks in pace, for each batch, the discriminator is updated only if its accuracy in the previous batch is less than 80%.

The 3D-VAE-GAN consists of three components: an image encoder E, a generator G, and a discriminator D.
The image encoder E takes 2D image as input and outputs the latent representation vector z.
Further operations are similar to that of 3D-GAN.
The loss function consists of three parts: an object reconstruction loss L_recon, a cross entropy loss L_3D-GAN, and a KL diversion loss L_KL.