Wasserstein Generative Adversarial Networks (WGANs)

WGAN’s architecture uses deep neural networks for both generator and discriminator. The key difference between GANs and WGANs is the loss function and the gradient penalty. WGANs were introduced as the solution to mode collapse issues. The network uses the Wasserstein distance, which provides a meaningful and smoother measure of distance between distributions.

WGAN architecture

WGANs use the Wasserstein distance, which provides a more meaningful and smoother measure of distance between distributions.

γ denotes the mass transported from x to y in order to transform the distribution Pr to Pg.
denotes the set of all joint distributions γ(x, y) whose marginals are respectively Pr and Pg.

The benefit of having Wasserstein Distance instead of Jensen-Shannon (JS) or Kullback-Leibler divergence is as follows:

W (Pr, Pg) is continuous.
W (Pr, Pg) is differential everywhere.
Whereas Jensen-Shannon divergence and other divergence or variance are not continuous, but rather discrete.
Hence, we can perform gradient descent and we can minimize the cost function.

Wasserstein GAN Algorithm

The algorithm is stated as follows:

The function f solves the maximization problem given by the Kantorovich-Rubinstein duality. To approximate it, a neural network is trained parametrized with weights w lying in a compact space W and then backprop as a typical GAN.
To have parameters w lie in a compact space, we clamp the weights to a fixed box. Weight clipping is although terrible, yields good results when experimenting. It is simpler and hence implemented. EM distance is continuous and differentiable allows to train the critic till optimality.
The JS gradient is stuck at local minima but the constrain of weight limits allows the possible growth of the function to be linear in most parts and get optimal critic.
Since the optimal generator for a fixed discriminator is a sum of deltas on the places the discriminator assigns the greatest values to, we train the critic until optimality prevents modes from collapsing.
It is obvious that the loss function at this stage is an estimate of the EM distance, as the critic f in the for loop lines indicates, prior to each generator update. Thus, it makes it possible for GAN literature to correlate based on the generated samples’ visual quality.
This makes it very convenient to identify failure modes and learn which models perform better than others without having to look at the generated samples.

Benefits of WGAN algorithm over GAN

WGAN is more stable due to the Wasserstein Distance which is continuous and differentiable everywhere allowing to perform gradient descent.
It allows to train the critic till optimality.
There is still no evidence of model collapse.
Not struck in local minima in gradient descent.
WGANs provide more flexibility in the choice of network architectures. The weight clipping, generators architectures can be changed according to choose.