In our previous article, we discussed the regularization effect of the Kullback-Leibler divergence in the objective function of Variational Autoencoders, providing empirical evidence that it results in a better coverage of the latent space.

In this article, we shall discuss another important effect of it: working in latent spaces of sufficiently high-dimension, the latent representation becomes* sparse*. Many latent variables are *zeroed-out* (independently from the input), the associated variance computed by the network is around one (while the real variance is close to 0), and in any case those variables are *neglected* by the decoder.

This property is usually known under the name of over-pruning, since it induces the model to only use small number of its stochastic units. In fact, this is a form of sparsity, with all the benefits typically associated with this form of regularization.

* * *

Sparsity is a well known and desirable property of encodings: it forces the model to focus on the relevant features of data, usually resulting in more robust representations, less prone to overfitting.

Sparsity is typically achieved in neural networks by means of weight-decay L1 regularizers, directly acting on weights. Remarkably, the same behaviour is induced in Variational Autoencoders by the Kullback-Leibler divergence, simply acting on the variance of the encoding distribution Q(z|X).

The most interesting consequence is that, at least for a given architecture, there seems to exist an *intrinsic internal dimension of data*. This property can be exploited both to understand if the network has sufficient internal capacity, augmenting it to attain sparsity, or conversely to reduce the dimension of the network removing links to unused neurons. Sparsity also help to explain a loss of variability in random generative sampling from the latent space one may sometimes observe with variational autoencoders

In the following sections we shall investigate sparsity for a couple of typical test cases.

#### MNIST

We start with the well known MNIST dataset of handwritten digits that we already used in our previous article. Our first architecture is a dense network with dimensions 784-256-64-32-16.

In Figure 1 we show the evolution during a typical training of the variance of the 16 latent variables.

Table 1 provides relevant statistics for each latent variable at the end of training, computed over the full dataset: the mean of its variance (that we expect to be around 1, since it should be normally distributed), and the mean of the computed variance σ^{2}(X) (that we expect to be a small value, close to 0). The mean value is around 0 as expected, and we do not report it.

no | variance | mean(σ^{2}(X)) |
---|---|---|

0 | 8.847272e-05 | 0.9802212 |

1 | 0.00011756 | 0.99551463 |

2 | 6.665453e-05 | 0.98517334 |

3 | 0.97417927 | 0.008741336 |

4 | 0.99131817 | 0.006186147 |

5 | 1.0012343 | 0.010142518 |

6 | 0.94563377 | 0.057169348 |

7 | 0.00015841 | 0.98205334 |

8 | 0.94694275 | 0.033207607 |

9 | 0.00014789 | 0.98505586 |

10 | 1.0040375 | 0.018151345 |

11 | 0.98543876 | 0.023995731 |

12 | 0.000107441 | 0.9829797 |

13 | 4.5068125e-05 | 0.998983 |

14 | 0.0001085 | 0.9604088 |

15 | 0.9886378 | 0.044405878 |

All variables highlighted in red have an anomalous behavior: their variance is very low (in practice, they *always* have value 0), while the variance σ^{2}(X) computed by the network is around 1 for each X. In other words, the representation is getting *sparse*! Only 8 latent variables out of 16 are in use: the other ones are completely ignored by the generator.

As an additional confirmation of this fact, in Figure 2 we show a few digits randomly generated from Gaussian sampling in the latent space (upper line) and the result of generation when inactive latent variables have been zeroed-out (lower line): they are indistinguishable.

Let’s try with a convolutional VAE. We consider a relatively sophisticated network, with the following structure (see Figure 3)

In this case, sparsity is less evident, but still present: 3 variables out of 16 are inactive.

no | variance | mean(σ^{2}(X)) |
---|---|---|

0 | 0.00015939 | 0.99321973 |

1 | 0.91644579 | 015826659 |

2 | 1.05478882 | 0.0062623904 |

3 | 0.98102569 | 0.012602937 |

4 | 1.07353293 | 0.0051363203 |

5 | 1.06932497 | 0.0066873398 |

6 | 6.477744e-05 | 0.96163213253 |

7 | 1.11955034 | 0.0031915947 |

8 | 0.88755643 | 0.024708110 |

9 | 0.97943300 | 0.0094883628 |

10 | 0.9322856 | 0.016983853 |

11 | 1.40059826 | 0.0025208105 |

12 | 1.30227565 | 0.0033756110 |

13 | 0.00019337 | 0.99533605 |

14 | 1.13597583 | 0.0076088942 |

15 | 1.33482563 | 0.002084088 |

Having less sparsity seems to suggest that convolutional networks make a better exploitation of latent variables, typically resulting in a more precise reconstruction and improved generative sampling.

This is likely due to the fact that latent variables encode information corresponding to different portions of the input space, and are less likely to become useless for the generator.

The sparsity phenomenon was first highlighted by Burda et al. in this work, and later confirmed by many authors on several different datasets and neural architectures. We discuss this debated topic and survey the recent literature in this article, where we also investigate a few more concrete examples.

The degree of sparsity may slightly vary from training to training, but not in a significant way (at least, for similar final values of the loss function). This seems to suggest that, given a certain neural architecture, there exists an intrinsic, “optimal” compression of data in the latent space. If the network does not exhibit sparsity, it is probably a good idea to augment the dimension of the latent space; conversely, if the network is sparse we may reduce its dimension removing inactive latent variables and their connections

#### Kullback-Leibler divergence and sparsity

Let us consider the loglikelihood for data X:

E_{z∼Q(z|X) }log P(X|z) − KL(Q(z|X)||P(z))

If we remove the Kullback-Leibler component from previous objective function, or just keep the quadratic penalty on latent variables, the sparsity phenomenon disappears. So, sparsity must be related to that part of the loss function.

It is also evident that if the generator ignores a latent variable, P(X|z) will not depend on it and the loglikelihood is maximal when the distribution of Q(z|X) is equal to the prior distribution P(z), that is just a normal distribution with 0 mean and standard deviation 1. In other words, the generator is induced to learn a trivial encoding z_{X} = 0 and a (fake) variance σ^{2}(X) =1. Sampling has no effect, since the sampled value for z_{X} will just be ignored.

Intuitively, if during training a latent variable is of moderate interest for reconstructing the input (in comparison with the other variables), the network will learn to give less importance to it; at the end, the Kullback-Leibler divergence may prevail, pushing the mean towards 0 and the standard deviation towards 1. This will make the latent variable even more noisy, in a vicious loop that will eventually induce the network to completely ignore the latent variable.