Interpreting scVI - adjusted expression levels

With single cell RNA-seq measurements, you observe molecule counts spread over genes of origin from a sample of the transcripts in individual cells. The small number of sampled molecules in relation to the number of genes used by cells to produce transcripts leads to sparse and discontinuous data that requires statistical analysis to properly interpret.

One statistical tool to explore and interpret this count data is the scVI model. As a reminder, the model can be written as $$ \begin{align*} z_n &\sim \text{N}(\mathbf{0}, I), \\ \rho_n &= f(z_n, s_n), \\ y_n &\sim \text{NB}(\rho_n \cdot \ell_n, \phi). \end{align*}$$

In this model, \( \rho_{n, g} \) represents a relative expression level of gene \( g \) in cell \( n \). After fitting a model, we get access to an estimate of \( \mathbb{E}[\rho_{n, g} \ | \ Y] \), which we can refer to as \( \hat{\rho}_{n, g} \). These estimates of expression levels are inferred using all observed counts of all genes in all cells, where the function \( f \) encodes various co-expression relationships between genes across cells based on the available evidence (if the model has fitted successfully). In scvi-tools models, these estimates are accessible through the .get_normalized_expression() method.

As an example, we can apply the scVI model to a dataset generated by (Garrido-Trigo et al. 2023). In this study, the authors made scRNA-seq measurements of 33,538 genes in 60,952 cells from colon biopsies of 18 donors. Six of the donors were healthy (HC-1 through HC-6), six of the donors had Crohn’s disease (CD-1 through CD-6), and six of the donors had ulcerative colitis (UC-1 through UC-6).

We fit the model, supplying the 18 donor identities as ‘batches’ (\( s_n \) in the model).

With the fitted model, we can generate the estimates for any genes we are interested in. In this case, we will look at the genes MS4A1 and TGFB1.

The gene MS4A1 (membrane-spanning 4A 1) encodes the protein CD20. Like all CD (cluster of differentiation) proteins, CD20 is a surface protein. Proteins are designated CD numbers at the ‘Human Leukocyte Differentiation Antigen Workshops’ if antibodies against the proteins can be used to isolate functionally distinct white blood cell populations. The protein encoded by MS4A1 was given the number CD20 in the 1984 workshop because it can be used to isolate B cells (Bernard and Boumsell 1984).

TGFB1 encodes the protein TGF-β1 (transforming growth factor β 1) which is a cytokine; a protein that is released from cells in order to communicate with other nearby cells, dictating them to change their current functions. TGF-β1 is released by most immune cells, including B cells.

In the above plot, the x-axis displays \( \hat{\rho}_{n, \text{MS4A1}} \) and the y-axis displays \( \hat{\rho}_{n, \text{TGFB1}} \) for all the 60,952 cells, colored by health status of the biopsy donor. As is apparent from the plot, there is a group of cells with particularly high expression of MS4A1 compared to all the other cells. These are likely B cells. The expression levels of TGFB1 appear to vary between disease states of the donors. To dig into this, we can instead plot a small multiple plot of the expression levels, where cells from all 18 of the donors are split out into individual facets.

In this plot, the distribution of all cells (regardless of donor) are displayed in the background as light gray points. This is a technique to make it easier to notice which subsets of a dataset correspond to certain regions of the range of the full dataset without having to look too closely at the axis labels.

From the plot we can learn that the variation in TGFB1 expression levels is not clearly related to health status of the donors, but rather related to independent donor-to-donor variation.

Suppose we want to classify cells into two levels: whether a cell is an immune cell, and then whether an immune cell is a B cell. Since we know that most immune cells secrete TGF-β1, we can argue that immune cells will express higher levels of TGFB1 than other cells. The expression levels of TGFB1 vary between donors, but a consistent pattern emerges: a dense group of cells with pronounced TGFB1 expression exists within each donor. But to decide whether a cell is an immune cell, we don’t want to set a different threshold for TGFB1 expression for each donor individually.

To solve this problem we can use the statistical technique of ‘adjusting’ the expression levels for donor identity. Another popular term for this technique is ‘regressing out’ donor identity, or the more descriptive term ‘keeping donor identity constant’. Another popular term for the same idea is ‘accounting for’ donor identity.

This technique is extremely common when using linear regression models for analysis. As a simple example, say we want to analyze how the weight of people depends on a choice between two different dietary choices. If you have some data on this, you would write a linear regression model as $$ \text{weight} = a \cdot [\text{diet}] + \epsilon. $$

However, you might worry that your dataset is biased in some way, where, for some reason, taller people choose one diet over the other. In that case you want to adjust for height in the model (or in other terms, ‘accounting for height’, ‘regressing out height’, or ‘keeping height constant’). To do this, you change the linear regression model to $$ \text{weight} = a \cdot [\text{diet}] + b \cdot [\text{height}] + \epsilon. $$

This way you can interpret the model and data in the unit of ‘height adjusted weight’ by fixing ‘height’ to a specific value while varying the value of ‘diet’. A common choice is to set \( \text{height} = 0 \), though, in the particular case of height it is a better choice to set it to the average height in the dataset, ensuring that predicted weight values remain within a realistic range.

The same ‘adjustment’ concept can be used with an scVI model. The estimated expression level \( \hat{\rho}_{n, g} \) produced through \( \rho_{n} = f(z_n, s_n) \), where \( s_n \) is the identity of the donor of the cell. To adjust the expression levels, so they are comparable between batches, we can fix \( s_n \) to a reference batch, e.g., donor HC-1. The choice of reference batch is arbitrary, and here we are choosing HC-1 just because it is the first healthy donor in the dataset. So we can produce adjusted estimated expression through $$ \hat{\rho}_{n, g \, | \, \text{HC-1}} = \mathbb{E}[ f(z_n, \text{HC-1} ) \ | \ Y ]. $$

In scvi-tools models, these adjusted expression levels can be obtained using the transform_batch argument in the .get_normalized_expression() method. In this particular case you would write .get_normalized_expression(transform_batch = ‘HC-1’).

Unlike in a linear regression model, the scVI model will adjust expression levels for different individual cells at different amounts, since the function \( f \) is a neural network which includes interactions between the cell representations \( z_n \) and the donor identities \( s_n \).

After producing the estimated adjusted expression levels, we can again plot the small multiples plot.

In this plot, the gray cells in background use the original estimated expression levels (from all donors regardless of facet), while the colored cells in the foreground display the adjusted estimated expression levels \( \hat{\rho}_{n, \text{MS4A1} \, | \, \text{HC-1}} \) and \( \hat{\rho}_{n, \text{TGFB1} \, | \, \text{HC-1}} \). On top of the cells, a small sample of cells have arrows between the estimated expression levels and the adjusted estimated expression levels, to illustrate how the scVI model adjusts expression for different individual cells.

On the scale of these adjusted estimated expression levels, the ranges of TGFB1 expression levels are comparable between the donors. This would allow us to isolate the dense region of cells with high TGFB1 expression which likely indicate immune cells.

A notebook for producing the figures in this plot is available on GitHub.

References

Bernard, A., and L. Boumsell. 1984. “[Human leukocyte differentiation antigens].” Presse medicale 13 (38): 2311–16. https://www.ncbi.nlm.nih.gov/pubmed/6239187.

Garrido-Trigo, Alba, Ana M. Corraliza, Marisol Veny, Isabella Dotti, Elisa Melón-Ardanaz, Aina Rill, Helena L. Crowell, et al. 2023. “Macrophage and Neutrophil Heterogeneity at Single-Cell Spatial Resolution in Human Inflammatory Bowel Disease.” Nature Communications 14 (1): 4506. https://doi.org/10.1038/s41467-023-40156-6.