Robots are currently being deployed in industries where repetitive manual work causes injuries or long-term illness to labourers. However, robots that can collaborate with society require much more intelligence and as a collaborative growing field, artificial intelligence has shown promising results in natural language processing (NLP), computer vision and reinforcement learning. However, exploring the creative and artistic world, initiating from nothing towards something meaningful was far from reality before the introduction of Generative adversarial network (GAN).
GAN consists of two deep networks, a generator which creates input for the network to learn and a discriminator which acts as a critic. We introduce a sample noise with multivariate normal distribution. We use a generator network to use this noise as input to create images. We don’t control the semantic meaning of the generator and intuitively, the noise input represents the latent features of the images. So, the generator network of GAN is multiple transposed convolutions to upsample the noisy input to generate an image. However, the generator can only produce a random noise as an output. The discriminator network comes into play here, where the discriminator allows the generator to create.
For example, if we use GAN to generate portraits that are painted in the same fashion as the famous artist, Leonardo Da Vinci, GAN will have a generator as an initial step to learn and output portrait painting from input noisy images. The discriminator on the other hand will look at both portrait paintings of Leonardo Da Vinci and generated images separately and learn what features make the portrait painting distinguishable from generated images. The discriminator provides feedback to the generator on whether the input image to its own network is a fake generated image or a real portrait painting and we alternatively train both networks till the generator is able to fool the discriminator with the generated image as the portrait painting of Leonardo Da Vinci.
Deep Convolutional Generative Adversarial Network (DCGAN) is one of the well-known algorithms with two models that are trained simultaneously by an adversarial process. The generator of DCGAN produces a random noisy image and upsampled till desired image size with LeakyReLU activation for each layer except for the output one which uses tanh. The discriminator on the other hand is a CNN-based image classifier to classify real and fake images.
Conditional Generative Adversarial Network (cGAN), called pix2pix, learns mapping between two images by conditioning on input images and generating corresponding output images. The generator is a U-Net based architecture while the discriminator is a convolutional PatchGAN classifier. This idea is specially interesting in robotics domain since the idea can be applied in wide range of applications including synthesising images from label maps, generating coloured image from grayscale image, mapping images with different weather condition across different dataset to get robust reconstruction, privacy-preserving applications, aerial images from maps as well as robotic arm being able to create pictures from sketches, and more.
Cycle-Consistent Generative Adversarial Network (CycleGAN) is another implementation of GAN, similar to the above Pix2Pix conditional GAN implementation but with the difference of an additional loss function and the use of unpaired training data. CycleGAN introduces unpaired image to image translation where it learns features of one image domain and maps these features to another image domain in the absence of any paired trained examples using a cycle consistency loss to enable training. So without a need for one-to-one mapping but only the source and target dataset, this allows not only imaging applications such as photo-enhancement, image-colorization and style transfer but also robotics applications.
While GAN is a more generalised model, there are other approaches that use similar set of processing steps to get interesting results in the imaging domain. Neural-Style-Transfer is one of the famous implementations which uses two input images, one as a content image and one as a style reference image and produces the content image in the essence of the style reference image. This allows a new style of image editing and filters for social media applications such as instagram and snapchat. This implementation is a pure optimization technique which acts on the output image to match the statistics of both content and style-reference image extracted from the intermediate layers of the convolution network.
The activations of immediate layers right after input layers represent low-level features like edges and textures. Similarly, the final few layers represent high-level features like parts of objects of an image. Neural-style-transfer uses VGG19 network architecture, a pertained image classification network and so, by intuition we can say that the network understands the image in its intermediate layers. While the content of the image is represented by the values of the intermediate feature maps, the style of an image can be described by the means and correlations across different feature maps by calculating the Gram matrix.
Deep Dream is not GAN, but an experiment to visualise the patterns learned by neural networks by over-interpreting and enhancing. It is implemented by forwarding an image through the network, calculating the gradient of that image with respect to the activations of a particular layer, modifying the image to increase such activations and enhancing the patterns seen by the network using InceptionV3. Fundamentally, a layer from 11 concatenated convolutional layers of the architecture is chosen and the loss is maximised and depending on which layers are chosen, the enhanced features in dream image will vary.
Use of GAN in robotic imaging is invincible because most of the interesting ideas of computer science evolves around robotic applications. A self-driving car currently faces difficulties due to different weather conditions. A tree in spring doesn’t look a like tree in winter which actually confuses navigation in the robotic system. Similarly, lighting conditions, shadows and visual challenges like noise and blur also introduce artefacts which can be removed via GAN by mapping all images to a specific condition. GANs can be used to map a city with the best features and/or find statically plausible unknowns that we wonder about in space and underwater. GANs and robotics together as one is a powerful tool of future.
- Gatys, L.A., Ecker, A.S. and Bethge, M., 2016. Image style transfer using convolutional neural networks. Computer Vision and Pattern Recognition, pp. 2414-2423.
- Isola, P., Zhu, J.Y., Zhou, T. and Efros, A.A., 2017. Image-to-image translation with conditional adversarial networks. Computer Vision and Pattern Recognition, pp. 1125-1134.
- Radford, A., Metz, L. and Chintala, S., 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434.
- Zhu, J.Y., Park, T., Isola, P. and Efros, A.A., 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. International Conference on Computer Vision, pp. 2223-2232.