Normalizing flow models
We continue our study over another type of likelihood based generative models. As before, we assume we are given access to a dataset of -dimensional datapoints . So far we have learned two types of likelihood based generative models:
The two methods have relative strengths and weaknesses. Autoregressive models provide tractable likelihoods but no direct mechanism for learning features, whereas variational autoencoders can learn feature representations but have intractable marginal likelihoods.
In this section, we introduce normalizing flows a type of method that combines the best of both worlds, allowing both feature learning and tractable marginal likelihood estimation.
Change of Variables Formula
In normalizing flows, we wish to map simple distributions (easy to sample and evaluate densities) to complex ones (learned via data). The change of variables formula describe how to evaluate densities of a random variable that is a deterministic transformation from another variable.
Change of Variables: and be random variables which are related by a mapping such that and . Then
There are several things to note here.
and need to be continuous and have the same dimension.
is a matrix of dimension , where each entry at location is defined as . This matrix is also known as the Jacobian matrix.
denotes the determinant of a square matrix .
For any invertible matrix , , so for we have
If , then the mappings is volume preserving, which means that the transformed distribution will have the same “volume” compared to the original one .
Normalizing Flow Models
We are ready to introduce normalizing flow models. Let us consider a directed, latent-variable model over observed variables and latent variables . In a normalizing flow model, the mapping between and , given by , is deterministic and invertible such that and 1.
Using change of variables, the marginal likelihood is given by
The name “normalizing flow” can be interpreted as the following:
- “Normalizing” means that the change of variables gives a normalized density after applying an invertible transformation.
- “Flow” means that the invertible transformations can be composed with each other to create more complex invertible transformations.
Different from autoregressive model and variational autoencoders, deep normalizing flow models require specific architectural structures.
- The input and output dimensions must be the same.
- The transformation must be invertible.
- Computing the determinant of the Jacobian needs to be efficient (and differentiable).
Next, we introduce several popular forms of flow models that satisfy these properties.
The Planar Flow introduces the following invertible transformation
where are parameters.
The absolute value of the determinant of the Jacobian is given by
However, need to be restricted in order to be invertible. For example, and . Note that while is invertible, computing could be difficult analytically. The following models address this problem, where both and have simple analytical forms.
The Nonlinear Independent Components Estimation (NICE) model and Real Non-Volume Preserving (RealNVP) model composes two kinds of invertible transformations: additive coupling layers and rescaling layers. The coupling layer in NICE partitions a variable into two disjoints subsets, say and . Then it applies the following transformation:
, which is an identity mapping.
, where is a neural network.
Inverse mapping :
, which is an identity mapping.
, which is the inverse of the forward transformation.
Therefore, the Jacobian of the forward mapping is lower trangular, whose determinant is simply the product of the elements on the diagonal, which is 1. Therefore, this defines a volume preserving transformation. RealNVP adds scaling factors to the transformation:
where denotes elementwise product. This results in a non-volume preserving transformation.
Some autoregressive models can also be interpreted as flow models. For a Gaussian autoregressive model, one receive some Gaussian noise for each dimension of , which can be treated as the latent variables . Such transformations are also invertible, meaning that given and the model parameters, we can obtain exactly.
Masked Autoregressive Flow (MAF) uses this interpretation, where the forward mapping is an autoregressive model. However, sampling is sequential and slow, in time where is the dimension of the samples.
To address the sampling problem, the Inverse Autoregressive Flow (IAF) simply inverts the generating process. In this case, generating from the noise can be parallelized, but computing the likelihood of new data points is slow. However, for generated points the likelihood can be computed efficiently (since the noise are already obtained).
Parallel WaveNet combines the best of both worlds for IAF and MAF where it uses an IAF student model to retrieve sample and a MAF teacher model to compute likelihood. The teacher model can be efficiently trained via maximum likelihood, and the student model is trained by minimizing the KL divergence between itself and the teacher model. Since computing the IAF likelihood for an IAF sample is efficient, this process is efficient.
Recall the conditions for change of variable formula. ↩