Training scVI — Summarizing posterior predictive distributions

How good is a trained scVI model? The objective when training a model is to minimize the evidence lower bound (ELBO). The ELBO consists of two parts: the KL divergence between the approximate posterior of $ Z $ and the prior of $ Z $, $ \text{KL}(Q(Z) || P(Z)) $, and the reconstruction error. The reconstruction error is the negative log likelihood, meaning the probability of the observed data given the fitted model, $ -\log P(Y | Z) $.

The single reconstruction error number for a model consists of the sum of the likelihoods for all genes in each cell, then further summarized to the mean of those compound likelihoods across all the cells:

$$ \texttt{reconstruction_error} = - \frac{1}{N} \sum_{n = 1}^N \sum_{g = 1}^G \log P(y_{n, g} \ | \ z_n). $$

This final reconstruction error value might be, e.g., 3,783, for a fitted model. It is hard to understand the implications of this number, and how this relates to practical implications for interpreting and using the model.

In a previous post we looked at how monitoring the posterior predictive distribution of a tiny slice of data for different values of ELBO (or reconstruction error) could help build intuition about general performance of the model.

Can we build upon this idea to get an alternative performance quantification for the model that is more intuitive?

The likelihood reflects how often the observed value is sampled from the model when generating posterior samples. If the model is performing well, samples of UMI counts from the posterior will often coincide with the observed UMI count from a cell-gene pair. The posterior probability distribution of the molecule counts can be viewed as a mass, and we can consider ranges where the majority of this mass is located. If e.g., 90% of the mass falls within the interval of 200 to 100,000 molecules, this is referred to as the 90% confidence interval of the distribution. In the example below, the observed molecule count falls within this confidence interval.

On the other hand, in cases where the model is performing poorly, the observed value is rarely generated through sampling. In the example below, the observed count is outside the 90% confidence interval.

These two cases give us a way of summarizing model performance in an intuitive way. Out of all the cell-gene pairs in the data, what fraction are cases of the observed counts falling outside the 90% confidence interval?

This can be estimated by sampling counts from the posterior 200 times for each cell-gene pair, then checking if the observed molecule count is between the 5% quantile and the 95% quantile of the samples of UMI counts from the posterior.

Here we are using a dataset from (Garrido-Trigo et al. 2023), where the authors used single-cell RNA-sequencing data from colon tissue from multiple donors to investigate differences between healthy donors and donors with ulcerative colitis or Crohn’s disease. Here we are using a random subset of 10,000 cells out of the total 60,952 cells from these donors with measurements of 33,538 genes.

As an experiment to learn about the relation between the reconstruction error and the fraction of observed counts observed within the 90% confidence intervals, the process can be performed after each epoch of training of a model.

We can see with this quantification, that the fraction of observed counts within 90% confidence intervals, increases very slowly after around 20 epochs. The decrease in reconstruction error after 20 epochs appears more dramatic than the increase in fractions within 90% CI.

In this case, after training for 40 epochs, 99.5% of observations from cell-gene pairs are contained within the 90% confidence interval. What kinds of practical implications would this have? As one example, if a cluster of 1,000 cells all have high expression of a certain gene according to scVI, the model might be wrong about this expression level for 5 of those 1,000 cells.

One aspect of the posterior predictive distributions that are not captured by this quantification is the breadth of the confidence intervals. The likelihood value itself on the other hand will reflect this. If the confidence interval is small and the model is accurate in the sense that the observed count is within the interval, this means the observed count will be sampled even more often than if the interval was wide. I can’t think of an easy way to summarize the widths of the confidence intervals that isn’t as abstract as the likelihood itself.

Notebooks to reproduce this analysis are available at github.

References

Garrido-Trigo, Alba, Ana M. Corraliza, Marisol Veny, Isabella Dotti, Elisa Melón-Ardanaz, Aina Rill, Helena L. Crowell, et al. 2023. “Macrophage and Neutrophil Heterogeneity at Single-Cell Spatial Resolution in Human Inflammatory Bowel Disease.” Nature Communications 14 (1): 4506. https://doi.org/10.1038/s41467-023-40156-6.