In this post we are going to build a deep learning Java Application using deeplearning4j for the purpose of generating art. Beside being an attractive and fascinating topic neural style transfer give great insight in what deep convolution layers are learning. Feel free to run the application and try with your own images.
Table of Contents
What Is Neural Style Transfer?
Neural Style transfer is the process of creating a new image by mixing two images together. Lets suppose we have this two images below:
and the generated art image will look like below:
and since we like the art on the right image would like to transfer that style into our own memory photos. Of course we would prefer to save the photos content as much as possible and in same time transform them according to the art image style. This may look like :
We need to find a way to capture content and style image features so we can mix them together in way that the output will look satisfactory for the eye.
Deep Convolution Neural Networks like VGG-16 are already in a way capturing this features looking at the fact that they are able to classify/recognize a large variety of images(millions) with quite a high accuracy. We just need to look deeper on neural layers and understand or visualize what they are doing.
What are Convolution Networks learning?
Already a great paper offers the insight : Zeiler and Fergus, 2013 Visualizing and Understanding Convolutional Networks. They have developed quite a sophisticated way to visualize internal layersby using Deconvolutional Networks and other specific methods. In here we will focus only on the high level intuition of what neural layers are doing.
Lets first bring into the focus VGG-16 architecture we saw in Cat Image Recognition Application :
While training with images lets suppose we pick the first layer and start monitoring some of his units/neurons(9 to 12 usually) activation values. From all activation’s values lets pick 9 maximum values per each of the chosen units(9-12). For all of this 9 values we will visualize the patch of the images that cause those activation to maximize. In few words the part of image the is making those neurons fire bigger values.
Since we are just in the first layer the units capture only small part of the images and rather low level features as below:
It looks like the 1st neuron is interested in diagonal lines while the 3rd and 4th in vertical and diagonal lines and the 8th for sure likes green color. Is noticeable that all this are really small part of images and the layer is rather capturing low level features.
Lets move a bit deeper and choose the 2 layer:
This layer neurons start to detect some more features like the second detects thin vertical lines , the 6th and 7th start capturing round shapes and 14th is obsessed with yellow color.
Deeper into 3rd layer:
Well this layer for sure starts to detect more interesting stuff like 6th is more activated for round shapes that look like tires, 10th is not easy to explain but likes orange and round shapes while the 11th start even detecting some humans.
Even Deeper… into layer 4 and 5:
So deeper we go bigger part of the image neurons are detecting therefore capturing high level features(second neuron on 5 layer is really into dogs) of the image in comparison to low level layers capturing rather small part of image.
This gives a great intuition on what deep convolutional layers are learning and also coming back to our style transfer we have the insight how to generate art and keep the content from two images. We just need to generate a new image which when feed to neural networks as input generates more or less same activation values as the content(photo) and style(art painting) image.
Transfer Learning
One great thing about deep learning in general is the fact that is highly portable between application and even different programming languages and frameworks. The reason is simply because what a deep learning algorithm produces is just weights which are simply decimal values and they can be easily transported and imported on different environments.
Anyway for our case we are going to use VGG-16 architecture pre trained with IMAGENET. Usually VGG-19 is used but unfortunately it results too slow on CPU , maybe on GPU it will be better. Below java code:
private ComputationGraph loadModel() throws IOException { ZooModel zooModel = new VGG16(); ComputationGraph vgg16 = (ComputationGraph) zooModel.initPretrained(PretrainedType.IMAGENET); vgg16.initGradientsView(); log.info(vgg16.summary()); return vgg16; }
Load Images
At the beginning we have only content image and styled image so the combined image is rather a noisy image. Loading images is fairly easy task :
private static final DataNormalization IMAGE_PRE_PROCESSOR = new VGG16ImagePreProcessor(); private static final NativeImageLoader LOADER = new NativeImageLoader(HEIGHT, WIDTH, CHANNELS);
INDArray content = loadImage(CONTENT_FILE); INDArray style = loadImage(STYLE_FILE);
private INDArray loadImage(String contentFile) throws IOException { INDArray content = LOADER.asMatrix(new ClassPathResource(contentFile).getFile()); IMAGE_PRE_PROCESSOR.transform(content); return content; }
Please note that after loading the pixels we are normalizing(IMAGE_PRE_PROCESSOR) the pixels with the mean values from all images used during training of VGG-16 with ImageNet dataset. Normalization helps to speed up training and is something it is more or less always done.
Now is time to generate a noisy image :
private INDArray createCombinationImage() throws IOException { INDArray content = LOADER.asMatrix(new ClassPathResource(CONTENT_FILE).getFile()); IMAGE_PRE_PROCESSOR.transform(content); INDArray combination = createCombineImageWithRandomPixels(); combination.muli(NOISE_RATION).addi(content.muli(1.0 - NOISE_RATION)); return combination; }
As we can see from the code the combined image is not purely noisy but some part of it is taken from content(NOISE_RATION controls the percentage). The idea is taken from this tensor flow implementation and it is done for speeding up the training therefore getting good results faster. Anyway the algorithm eventually will produce more or less same results with pure noise images but it will just take longer and more iterations.
Content Image Cost Function
As we mention earlier we will use intermediate layers activation values produced by a neural network as a metric showing how similar two images are. First lets get those layer activation’s by doing forward pass for the content and combined image using VGG-16 pretrained model:
Map<String, INDArray> activationsContentMap = vgg16FineTune.feedForward(content, true);
Map<String, INDArray> activationsCombMap = vgg16FineTune.feedForward(combination, true);
Now per each image we have a map with layer name as key and activation’s values on that layer as value. We will choose a deep layer(conv4_2) for our content image cost function because we want to capture as high level as possible features. The reason we choose a deep layer is because we would like the combine image or the generated image to retain the look and shape of content. In same time we choose only one layer because we don’t want the combine image to look exactly like content but rather leave some space for the art.
Once we have activation’s for chosen layer for both images content and combine is time to compare them together and see how similar they are. In order to measure their similarity we will use their squared difference divided by activation dimensions as described by this paper:
Fij denotes the combine image layer activation values and Pij content image layer activation values. Basically is just the euclidian distance between two activation’s in particular layer.
What we want is that ideally the difference to be zero. In few words minimize as much as possible the difference between images features on that layer. In this way way we transferred features captured by that layer from content image to combine image.
The implementation in java of the cost function will look like below:
public double contentLoss(INDArray combActivations, INDArray contentActivations) { return sumOfSquaredErrors(contentActivations, combActivations) / (4.0 * (CHANNELS) * (WIDTH) * (HEIGHT)); }
public double sumOfSquaredErrors(INDArray a, INDArray b) { INDArray diff = a.sub(b); // difference INDArray squares = Transforms.pow(diff, 2); // element-wise squaring return squares.sumNumber().doubleValue(); }
The only non essential difference with the mathematical formula from the paper is division of the activation dimension rather than with 2.
Style Image Cost Function
The approach for style image is quite similar with the content image in way that we will still use neural layers activation’s values difference as similarity measurement of images. Anyway there some difference with the cost function for style images in how the activation’s values are processed and calculated.
Recalling from previous convolution layers post and cat recognition application a typical convolution operation will results in an output with several channels(3rd dimension) beside height and width(e.x 16 X 20 X 356 , w X h X c). Usually convolution shrinks width and height and increases channels.
Style is defined as the correlation between each of units across channels in a specific chosen layer. E.x if we have a layer with shape 12X12X32 than if we pick up the 10th channel all 12X12=144 units of the 10th channel will be correlated with all 144 units of each of the other channels like 1,2,3,4,5,6,7,8,9, 11,12….32.
Mathematically this is called more specifically the Gram Matrix(G) and is calculated as the multiplication of the units values across channels in a layer.If values are almost the same than the Gram will output a big value in contrast when they values are completely different. So gram signals captures how related different channels are with each other(like correlation intuition). From the paper it will look like below:
l the chosen layer, k is an index that iterates over channels in a layer, notice k* is not iterating because this is the channel we compare with all other channels, i and j are referring to the unit
The implementation in java looks like below:
public double styleLoss(INDArray style, INDArray combination) { INDArray s = gramMatrix(style); INDArray c = gramMatrix(combination); int[] shape = style.shape(); int N = shape[0]; int M = shape[1] * shape[2]; return sumOfSquaredErrors(s, c) / (4.0 * (N * N) * (M * M)); }
public INDArray gramMatrix(INDArray x) { INDArray flattened = flatten(x); INDArray gram = flattened.mmul(flattened.transpose()); return gram; }
Once we have the Gram Matrix we do the same as for the content calculate the Euclidean distance so the squared difference between Gram Matrices of the combine and style images activation’s values.
Gijl is denoting the combine gram values and Aijl the style gram values on specific layer l.
There is one last detail about the style cost function, usually choosing more than one layer gives better results. So for final style cost function we are going to choose 4 layers and add them together:
E is just the equation above and w1 denotes to a weight per layer so we are controlling the impact or the contribution of each layer. We maybe want lower layer to contribute less than upper layer but still have them.
Finally in java it looks like below:
private static final String[] STYLE_LAYERS = new String[]{ "block1_conv1,0.5", "block2_conv1,1.0", "block3_conv1,1.5", "block4_conv2,3.0", "block5_conv1,4.0" };
private Double allStyleLayersLoss(Map<String, INDArray> activationsStyleMap, Map<String, INDArray> activationsCombMap) { Double styles = 0.0; for (String styleLayers : STYLE_LAYERS) { String[] split = styleLayers.split(","); String styleLayerName = split[0]; double weight = Double.parseDouble(split[1]); styles += styleLoss(activationsStyleMap.get(styleLayerName).dup(), activationsCombMap.get(styleLayerName).dup()) * weight; } return styles; }
Total Cost Function
Total cost measures how fare or different the combine image is from content image features from the selected layer and from features selected from multiple layers on style image. In order to have control over how much we want our combine image to look as content or style two wights are introduced α and β:
Increasing α will cause combine image to look more like content while increasing β will cause combine image to have more style. Usually we decrease α and increase β.
Updating Combine Image
By now we have a great way to measure how similar the combine image is with the content and the style. What we need now is to react on the comparison result in order to change the combine image so that next time will have less difference or lower cost function value. Step by step we will change the combine image to become closer and closer to content and style images layers features.
The amount of change is done by using derivation of the total cost function. The derivation simply gives a direction to go. We than multiply the derivation value by an coefficient α which simply defines how much you want to progress or change. You don’t want small values as it will take a lot of iteration to improve where bigger values will make the algorithm never converge or producing unstable values(see here for more).
If it were TensorFlow we will be done by now since it handles the derivation or the cost function automatically for us. Although deeplearning4j requires manual calculation of derivation(n4dj it offers some autodiff feel free to experiment for automatic derivation) and is not design to work in the way style transfer learning requires , it has all the flexibility and pieces to build the algorithm.
Thanks to Jacob Schrum we were able to build derivation implementation in java please find the details on deeplearning4j examples on github class implementation originally started at MM-NEAT repository.
The last step is to update the combine image with the derivation value(multiplied by α as well):
AdamUpdater adamUpdater = createADAMUpdater(); for (int iteration = 0; iteration < ITERATIONS; iteration++) { log.info("iteration " + iteration); Map<String, INDArray> activationsCombMap = vgg16FineTune.feedForward(combination, true); INDArray styleBackProb = backPropagateStyles(vgg16FineTune, activationsStyleGramMap, activationsCombMap); INDArray backPropContent = backPropagateContent(vgg16FineTune, activationsContentMap, activationsCombMap); INDArray backPropAllValues = backPropContent.muli(ALPHA).addi(styleBackProb.muli(BETA)); adamUpdater.applyUpdater(backPropAllValues, iteration); combination.subi(backPropAllValues); log.info("Total Loss: " + totalLoss(activationsStyleMap, activationsCombMap, activationsContentMap)); if (iteration % SAVE_IMAGE_CHECKPOINT == 0) { //save image can be found at target/classes/styletransfer/out saveImage(combination.dup(), iteration); } }
So we simply subtract the derivation value from combine images pixels each iteration and the cost function grantees each iteration we come closer to the image we want. In order the algorithm to be effective we update using ADAM which is simply helps gradient descent to converge more stably. Basically a simpler Updater will work fine as well but it will take slightly more time.
What we described so fare is Gradient Descent more specifically Stochastic Gradient Descent since we are updating only one sample at time. Usually for transfer learning is used L-BFGS but with deeplearning4j will be harder and I didn’t have an insight how to approach it.
Application
Originally the case was implemented at MM-NEAT together with Jacob Schrum but later one was contributed to deeplearning4j-examples project so feel free to download from any of the source(from dl4j is slightly refactored).
Basically the class code can be easily copied and run on different project as it has no other dependencies beside deeplearning4j of course.
Usually to get descent results you need to run minimum 500 iteration but 1000 is more often recommended while 5000 iteration produces really high quality images. Anyway expect to let the algorithm run for couple of hours(3-4) for 1000 iterations.
There few parameters we can play in order to affect the combine image to what it looks best to us:
- Change loss α(impacts content) and β(impacts style) values which simply affect how much you want your image to look like content or style. Some values suggested in other implementations are (0.025, 5) , (5,100),(10,40) but anyway feel free to experiment there is plenty of room for optimization.
- Change style layers weights , currently we have bigger values for higher layers and small values for low layers. Anyway there other implementation showing very good results with equal wights e.x we have layers for style so the weights will be all 0.2. It will be quite interesting to try also increasing low level layers impact and notice how the images is transformed.
- Change content layer or style layers to a lower layers or deeper layers and notice how the image is greatly affected.
- There are other parameters related mostly with the algorithm itself which are worth to consider in the context of speeding up the execution like ADAM beta and beta_2 momentum constants or learning rate.
Showcases
Please find below some show cases. Feel free to share more show cases since exist many more interesting art mixtures out there.
1.
2.
3.
Results & Future Work
- In general style transfer algorithm is slow because each time requires forward pass and several back propagation passes(couple of hours with 800X600 resolution with tensorflow). I didn’t personally perform any comparison of deeplearning4j with other frameworks like TensorFlow but at first look I have the impression that it is slower.Especially if you try to run with high resolution like 800 X 600 it becomes on CPU almost not commutable. Maybe running on GPU will help and probably will do but again I did not try so feel free to suggest an new insights or experiments.
- There is a new paper which suggest a state of the art technique to make style transfer faster. The implementation is so efficient that can be applied also for videos so it will be quite interesting to find a way to implement on java. Please find below few demos and implementation in TensorFlow: