14/2/2019

In this article, we continue our investigation of Variational Autoencoders (see our previous posts on the regularization effect of the Kullback-Leibler divergence, and the sparsity phenomenon). In particular, we shall point out an interesting stationary condition induced by the Kullback-Leibler component of the objective function.

Let us first of all observe that trying to compute relevant statistics for the posterior distribution of latent variables without some kind of regularization constraint does not make much sense. As a matter of fact, given a network with mean and variance for a given latent variable , we can easily build another one having precisely the same behavior by scaling mean and standard deviation by some constant γ (for all data, uniformly), and then downscaling the generated samples in the next layer of the network. This kind of linear transformations are easily performed by any neural network (it is the same reason why it does not make much sense to add a batch-normalization layer before a linear layer).

Let’s see how the KL-divergence helps to choose a solution. In the following, we suppose to work on a specific latent variable z, omitting to specify it. Starting from the assumption that for a network it is easy to keep a fixed ratio , we can push this value in the closed form of the Kullback-Leibler divergence, that is

getting the following expression:

In Figure 1, we plot the previous function in terms of the variance, for a few given values of .

The above function has a minimum for

close to 0 when is small, and close to 1 when is high. Of course depends on X, while the rescaling operation after sampling must be uniform, still the network will have a propensity to synthesize standard variations close to (below we shall average on all X).

Substituting the definition of in equation (2), we expect to reach a minimum when , that, by trivial computations, implies the following simple **stationary condition:**

Let us now average together the KL components for all data X:

We use the notation to abbreviate the average of on all data .

The ratio can really (and easily) be kept constant by the net. Let us also observe that, assuming the mean of the latent variable to be 0, is just the (global) variance of the latent variable.

Pushing in the previous equation, we get

Now we perform a somewhat rough approximation. The average of the logarithms is the logarithm of the geometric mean of the variances. If we replace the geometric mean with an arithmetic mean, we get an expression essentially equivalent to expression (1), namely

that has a minimum when

that implies

or simply,

where we replaced with the variance of the latent variable in view of the consideration above.

Condition (3) can be experimentally verified. In spite of the rough approximation we did to get it, it proves to be quite accurate, provided the Variational Autoencoder is sufficiently trained. You can check it on your own experiments, or compare it with the data provided in our previous posts.

Let us finally remark that condition (3) is supposed to hold both for active and inactive variables.