Anomaly Detection Using Variational Autoencoder Reconstruction Probability

Let me preface this blog post by saying that the topic is somewhat complicated. Even a moderate explanation would take many pages, and so my summary will leave out a lot of detail.

A new technique for anomaly detection is to use a variational autoencoder and compute a metric called reconstruction probability. The idea is related to, but significantly different from anomaly detection using a standard autoencoder, and is also significantly differently from using a VAE with reconstruction error.

At a high level, a VAE computes an internal representation of the probability distribution of the source data in the form of a mean and a standard deviation (actually the log of the variance — one of many tricky details). A VAE outputs the distribution mean and log-variance, and a reconstructed version of the input. The idea of reconstruction probability anomaly detection is to compute a second probability distribution and then use it to calculate the likelihood that an input item came from the distribution. Data items with a low reconstruction probability are not likely to have come from the distribution, and so are anomalous in some way.

The source research paper is “Variational Autoencoder Based Anomaly Detection Using Reconstruction Probability” (2015), written by Jinwon An, and Sungzoon Cho. The paper is simultaneously excellent and . . . not so excellent. The idea of reconstruction probability is very clever, and the background and motivation are clearly explained. But the exact definition of “reconstruction probability” is never given, and the hints about how to implement it are somewhat contradictory.

I spent several days exploring the ideas of VAE reconstruction probability and finally got a possible demo running. I say “possible demo” because the ideas are so complex, there are dozens of places where I might have gone wrong.

The algorithm in the research paper is shown in the image below:

A close inspection of the algorithm reveals that it has ambiguities. For example, the g() function is the decoder component of a VAE, but g() in a standard VAE architecture does not return a mean and standard deviation as indicated by the algorithm. This could mean the researchers used a modified VAE architecture (where the output is a second mean and standard deviation — probably, or it could mean something else. The bottom line is that VAE reconstruction probability is not rigidly defined.

For this experiment, I set up a VAE with the architecture shown in the diagram below. The source X determines a u1 mean and s1 standard deviation which define a Normal distribution. That distribution determines a reconstructed x. To compute a reconstruction probability, I use the reconstructed X as a stand-in for the mean of a second Normal distribution. I use a dummy standard deviation of all 1s for the second distribution.

My preliminary PyTorch implementation of reconstruction probability is:

import torch as T
import numpy as np
import scipy.stats as sps

def recon_prob(model, xi, n_samples=10):
  # xi is one Tensor item, not a batch
  # assumes model.eval() has been set
  
  with T.no_grad():
    (u, logvar) = model.encode(xi)
  u = u.numpy()               # mean
  v = np.exp(logvar.numpy())  # variance

  samples = sps.multivariate_normal.rvs(u, \
    np.diag(v), size=n_samples) 
  samples = T.tensor(samples, \
    dtype=T.float32).to(device)  

  with T.no_grad():
    x_computeds = model.decode(samples)
  est_means = T.mean(x_computeds, dim=0)
  est_vars = T.ones(len(est_means), \
    dtype=T.float32).to(device)
  est_prob = sps.multivariate_normal.pdf( \
    xi.numpy(), est_means.numpy(), \
    np.diag(est_vars.numpy()))

  return est_prob

The recon_prob() function accepts a trained VAE model and a single data item. The encoder computes a mean and a log-var. These are used to construct a multivariate Normal distribtuion, which in turn is used to generate a set of samples. These samples are fed to the decoder component of the VAE. The outputs are averaged to produce a mean. I make a simplifying assumption and use a diagonal matrix of 1-values to stand in for the covariance matrix. The mean and covariance matrix produce a new distribution, and then the distribution’s pdf() function (probability density function) emits the result probability. My implementation gives a single probability, unlike the research paper which indicates it produces L probabilities and then averages them.

I coded up a demo where I use 240 dummy Employee data items. Each data item has a sex, age, city, income, and job-type. My VAE reconstruction probability implementation appears to be working, but there are many hours of exploration ahead before I’ll be ready to say I have an implementation that’s ready for posting. I’m looking forward to traveling the journey of investigation.

Three 1960s United Airlines travel posters by artist Stan Galli (1912-2009). I like his style a lot. These posters emit a sense of joy with probability = 1.

5 Responses to Anomaly Detection Using Variational Autoencoder Reconstruction Probability

diablo says:

March 17, 2021 at 1:51 am

i think the author of this paper suppose us to calculate the sigma s2(covariance) matrix directly from the reconstructed x.
Here is my assumption:
Draw L samples and reconstruct X,then get m2(mean) matrix by take the average among the reconstructed X.So now you have m2 (mean) matrix ,you can calculate sigma s2(covariance) matrix by a few steps below:
1. Use the source input x to subtract the m2 (mean) matrix to get the (x-m2) difference matrix
2. Get the s2(covariance) matrix by transpose (x-m2) difference matrix and multiply(dot product) by itself .
Now you have m2(mean) matrix and s2(covariance) matrix,you can calculate the probability of the source input x using the average of reconstructed output x and the calculated s2(covariance) matrix.

Loading...
- jamesdmccaffrey says:
  
  March 17, 2021 at 9:10 am
  
  I think you’re right, however, when I computed the covariance matrix from L samples, I kept getting errors that the matrix wasn’t positive semi-definite.
  
  Loading...
  - diabloardo says:
    
    March 17, 2021 at 10:25 am
    
    Sometime you may need to interchange the the order of dot product e.g the difference matrix multiply(dot product) the transpose of the difference matrix divide the number of the sample size to make sure the output of the dimension of the covariance matrix is n x n (n is the dimension of the source input x)
    
    Loading...
  - Tridib dutta says:
    
    June 9, 2021 at 12:04 pm
    
    That is not surprising. There is no guarantee that the matrix X-m2 you calculate could be positive-definite always.
    
    I think there is a problem with the approach in the paper. I had the same goal in mind: to consider the corresponding probability as a score rather than the reconstruction error. While I read that paper, one problem (along with what you already mentioned in your blog) is that it assumes the decoder (p(x|z) ) sigma as 1. If you assume that, the reconstruction error become plain old MSE. But why should the sigma be always 1? In order to avoid that difficulty, you have to “learn” the sigma as well. The clue is in the encoder implementation, I think. Having mu2 and sigma2 will allow construction of a gaussian distribution (multivariate) . Then you have to calculate the probability of the x given z_i, z_i is a sample drawn in the latent space. This calculation of the probability is not as simple as it seems in the paper. You have to use chi-square distribution to calculate it. I have glossed over some of the details that is needed but you get the idea.
    
    I think the best way to implement this is to use the powerful TF-Probability module. It has all the functions and distributions you need to implement it very effectively. I have done that myself for a project I am working on. I did not find it easy in any way. I am still working out the details.
    
    Loading...
diabloardo says:

March 17, 2021 at 9:56 am

Yes,btw,i forgot to point out that the covariance matrix should be scaled by dividing the number of samples we drawn before as the last step.

Loading...