Cell2fate infers RNA velocity modules to improve cell fate prediction

Main

The concept of ‘RNA velocity’, which involves inferring transcriptional dynamics from spliced and unspliced counts in single-cell RNA sequencing (scRNA-seq), has displayed notable potential^1,2,3,4. The first implementations of RNA velocity models^1,2,3 have undergone an evolution of conceptual and technical refinements, including improved parameter inference^5,6,7 as well as the use of numerical approaches^6,8,9,10 to solve the underlying differential equations. However, these existing refinements are bound to tradeoffs between either introducing coarse biophysical approximations^{1,2,3,5,7,11,12} or relying on extensive numerical approximations^6,8,9,10. Hence, the fundamental challenge remains to define a mathematically sound framework that allows for capturing realistic transcriptional dynamics while retaining computational and numerical tractability.

To address the aforementioned limitations, we present cell2fate, a fully Bayesian model of RNA velocity based on a more realistic biophysical model of complex transcription dynamics. Cell2fate employs linearization to decompose differential equations describing complex transcriptional patterns, such as transcriptional boosts, into tractable components that can be solved analytically. By doing so, the model is at the same time expressive, interpretable and computationally efficient. The approach to decompose the velocity problem into components also provides a connection between RNA velocity and dimensionality reduction using a biophysical solution.

We assess and benchmark cell2fate in the context of different real-world settings, demonstrating its ability to capture complex dynamics and weak dynamical signals in rare and mature cell types. Finally, we show how cell2fate can be combined with spatial transcriptomics, thereby connecting transcriptional dynamics to their spatial tissue environment.

Results

The cell2fate model

Cell2fate builds on established concepts for RNA velocity^1,2, employing a dynamical model to explain variation in spliced (s) and unspliced (u) read counts for individual genes and cells (Fig. 1a), which can be defined in two coupled ODEs:

$$frac{d{u}_{g}}{{dt}}={alpha }_{g}(t)-{beta }_{g}{u}_{g}$$

(1)

$$frac{d{s}_{g}}{{dt}}={beta }_{g}{u}_{g}-{gamma }_{g}{s}_{g}.$$

(2)

Here, α, β, γ denote the transcription, splicing and degradation rates for different genes g. Solving the ODEs for u and s and fitting the equations to observed counts allows estimation of the unknown rate parameters, which can then be substituted into equation (2) to obtain the rate of change in spliced counts in each cell, (frac{{ds}}{{dt}}), which in turn gives rise to what is commonly referred to as ‘RNA velocity’ (ref. ¹). An important challenge for RNA velocity models is that transcription rates as a function of differentiation time (({alpha }_{g}(t))) can be complex and nonlinear, reflecting the implicit dependency on active transcription factor (TF) abundance in the nucleus¹³, yet the integral of the transcription rate function ({alpha }_{g}(t)) needs to remain tractable to allow the ODEs to be solved efficiently. As a consequence, existing methods either assume simplified stepwise functions for α_g(t) (refs. ^1,2,5,7) or they resort to numerical approximations to solve the ODE^6,8,9,10 (Supplementary Notes and Supplementary Fig. 3).

In cell2fate, we use an expansion of the derivative of the transcription rate in terms of individually integrable basis functions, which we refer to as linearization in the following:

$$frac{d{alpha }_{g}}{{dt}}=mathop{sum }limits_{m=1}^{M}frac{d{alpha }_{{mg}}}{{dt}}=mathop{sum }limits_{m=1}^{M}{lambda }_{{mi}}({hat{alpha }}_{{mgi}}-{alpha }_{{mg}}).$$

(3)

We term each basis function a module, denoted by a subscript m. Subscript i denotes the state of a module (ON or OFF). ({hat{alpha }}_{{mgi}}) is the target transcription rate of a module, which takes on nonzero values for all genes, when the module is in the ON state (Fig. 1b, bottom right). ({lambda }_{{mi}}) is the rate at which the target transcription rate is reached, and the state i depends on switch times T_m,ON/T_m,OFF on a cell-specific timescale T_c (Fig. 1b, top). This choice of basis functions allows for each individual ODE as well as their total sum to be solved analytically (Methods and Supplementary Notes). The parameters T_m,ON/T_m,OFF, ({lambda }_{{mi}}) and T_c are shared across all genes, which vastly reduces the number of parameters that need to be estimated compared to existing models yet still provides gene-specific transcription rates ({alpha }_{{mg}}).

In addition to being appealing for computational reasons, the linearization also provides a biophysical connection between RNA velocity and statistical dimensionality reduction. This link becomes apparent when casting the linearization as a mixed membership model, in which transcription rates, RNA velocity and spliced and unspliced counts of each gene are governed by a linear combination of M modules (Methods and Fig. 1b). The mixing coefficients can then be interpreted analogously to gene loadings of factor analysis or principal-component analysis. Mechanistically, modules can be interpreted as approximating the transcription rate changes induced by all active regulatory proteins as a small set of independent effects that are each valid within time windows defined by T_m,ON/T_m,OFF.

Cell2fate is a fully Bayesian model that is fit to raw cell-level counts as input, and it includes a series of refinements to account for technical sources of variation, including overdispersion, variation in detection sensitivity of spliced and unspliced RNA molecules, ambient RNA and known batches (Methods). A hierarchical prior structure achieves efficient regularization of model parameters while sharing evidence strength across genes (Methods). The model is implemented in the probabilistic programming language Pyro¹⁴ and fit using Pyro’s stochastic variational inference framework using a customized variational distribution structure to account for dependencies between model parameters (Methods). The software implementation comes with guidelines and heuristics to determine hyperparameters such as the number of modules (Methods), and it builds on scvi-tools¹⁵ to facilitate its integration in existing workflows.

Improved predictions of cell fates and transcription rates

To benchmark the performance of cell2fate, we compared the model to existing RNA velocity methods by assessing the consistency of estimated cell fate trajectories with prior knowledge. Briefly, we considered the cross-boundary directional correctness (CBDir) metric to benchmark alternative models, thereby scoring the consistency of transition probabilities at the boundary between cell clusters with known transitions³ (Methods).

We considered ten RNA velocity methods spanning different model classes and approaches for parameter inference (Table 1 and Methods). All methods were applied to five scRNA-seq datasets, including widely used benchmark datasets such as the developing mouse dentate gyrus¹⁶ and pancreas¹⁷ (Supplementary Fig. 1). To test the ability of different methods to resolve complex transcriptional dynamics, we also examined mouse erythroid maturation¹⁸ and human bone marrow¹⁹: two datasets that feature multiple transcriptional boosts across cellular trajectories¹². Finally, we considered a mouse bone marrow dataset with markedly low unique molecular identifier (UMI) counts¹, thereby assessing the extent to which these models cope with low-coverage data.

Table 1 Summary of RNA velocity model versions and parameter settings used in benchmark

Full size table

On average, across all five datasets, cell2fate achieved the best CBDir scores. More importantly, cell2fate inferred the correct directionality of cell fate transitions in all datasets, whereas all other methods, with the notable exception of pyroVelocity_model2, inferred reverse-order dynamics in at least one benchmark setting (corresponding to negative CBDir values). Inspecting the benchmarking results, we could attribute the performance of cell2fate to overcoming two major challenges as elaborated below.

First, cell2fate provided sufficient statistical power to identify correct velocity flows from subtle transcriptional dynamics. For example, in the mouse dentate gyrus dataset, other methods failed to resolve the late maturation trajectory of granule neurons, thus incorrectly inferring that mature cells transitioned into their immature counterparts (Fig. 2b, blue inset boxes and Supplementary Fig. 2). The same result was observed when applying CellRank²⁰ to the velocity estimates instead of uniform manifold approximation and projection (UMAP), indicating that only cell2fate could resolve the trajectory toward mature granule neurons (Supplementary Figs. 3 and 4).

**Fig. 2: Enhanced performance of cell2fate in benchmark of ten methods across five datasets.**

Second, cell2fate correctly reconstructs complex transcriptional dynamics. In the mouse erythroid maturation datasets, the model resolved the correct cell trajectories, whereas other models tended to perform poorly (Fig. 2c and Supplementary Fig. 5). Previous analysis¹⁸, based on the visual inspection of spliced and unspliced counts across manually annotated cell clusters, has provided evidence that mouse erythroid lineage formation features many ‘multi-rate kinetic’ genes such as Hba-x and Nudt4 that display coordinated changes in transcription rates across the cell maturation trajectory¹⁸. Consistently, cell2fate recapitulated the stepwise transcriptional rate boosts in these multi-rate kinetic genes¹⁸ (Fig. 2d, turquoise line). By contrast, other methods, such as pyroVelocity_model2, can only predict a single nonzero transcription rate, due to their simpler underlying dynamical model (Fig. 2d, green line), results that we again confirmed using CellRank (Supplementary Figs. 6 and 7).

As an alternative metric to CBDir, we also assessed concordance of the estimated time differences between clusters with prior knowledge (Supplementary Table 8), and we compared time outputs to known developmental ages of different mouse samples (Supplementary Figs. 8 and 9); these alternative metrics and benchmarks confirmed the performance benefits of cell2fate. Notably, granule neuron subclusters with higher cell2fate estimated time contained more cells from the later developmental ages (Supplementary Fig. 8b), validating cell2fate estimates. Alternative methods as well as standard diffusion pseudotime analysis did not resolve this granule neuron maturation trajectory (Supplementary Figs. 8c and 10). Furthermore, cell2fate was robust to changes in the assumptions on prior distributions (Supplementary Fig. 11), and we confirmed the ability of the model to estimate ground truth dynamical rates when applying the model to semisynthetic data (Supplementary Fig. 12). Finally, we note that, while cell2fate has higher computational requirements than some existing methods, the computational requirements of the model are aligned with alternatives so that cell2fate can be readily applied to larger datasets (Supplementary Tables 6 and 7 and Supplementary Fig. 13).

We note that cell2fate’s cell-specific timescale^3,6,7 (Fig. 1a,b) provides two additional use cases that can help to gain deeper insights. First, it aids the identification of cell lineage progression and distinct cell lineages. For example, in the mouse dentate gyrus dataset, granule neurons and astrocytes were assigned markedly disconnected time points, with oligodendrocytes occupying a mid-time point range (Fig. 2e, left), consistent with the distinct lineage origins of these three cell types¹⁶. By contrast, in the mouse erythroid maturation dataset, a single lineage with a single connected time range is identified (Fig. 2e, right). Second, inspecting the Bayesian posterior^5,6,7 of the cell-specific time provides a principled measure of confidence in the RNA velocity values across and within datasets. In both datasets mentioned above, the coefficient of variation (CV) of the posterior distribution of individual cell times was consistently estimated to be close to zero, indicating low uncertainty (Fig. 2e, bottom). By contrast, cell2fate applied to a steady-state dataset of peripheral blood mononuclear cells²¹, in which no consistent transcriptional dynamics are expected¹², results in confidence estimates with a CV close to 1, indicating high uncertainty (Fig. 2f). The posterior uncertainties can also be used to estimate confidence levels in individual transitions. To do so, cell2fate implements a heuristic confidence score based on the fraction of posterior samples from the cell-specific time in one cluster that are higher than the 90th percentile of samples from another cluster. This heuristic correlates well with the CBDir score of the corresponding transitions (ρ = 0.56), indicating that it is well calibrated (Supplementary Fig. 14). Hence, the posterior of cell-specific time can serve as quality control to assess whether cell2fate identifies meaningful dynamics in a given dataset.

In sum, our benchmark demonstrates cell2fate’s enhanced statistical power to estimate cell trajectories and resolve complex transcriptional dynamics and the ability to quantify uncertainty in velocity estimates.

RNA velocity modules map fine stages of late cell maturation

Cell2fate modules are sequentially activated gene expression programs over time. Given their biophysical foundations in transcriptional kinetics, we expect that RNA velocity modules can provide a more granular characterization of dynamic processes during cellular differentiation than conventional dimensionality reduction techniques that lack a mechanistic basis, such as matrix factorization or clustering. In addition, cell2fate comes with a suite of downstream analysis and visualization tools, enabling users to explore dynamic processes and derive biological insights.

To demonstrate the cell2fate toolkit, we considered the mouse brain single-cell dataset included as part of the benchmarking study (Fig. 2b), profiling the dentate gyrus region in the hippocampus across two developmental stages¹⁶. In addition to early differentiation of neurons and astrocytes from neural progenitors, this dataset covers the late maturation trajectory of granule neurons (that is, the late differentiation after the immature neuron stage), a critical process that is however not well understood, and, more generally, it is unknown whether this late maturation process unfolds across successive transcriptional stages. Previous RNA velocity methods applied to this dataset^2,3 were able to distinguish neuronal versus astrocyte lineage trajectories; however, the correct trajectory for the most mature granule neurons could not be resolved (Fig. 2b).

Cell2fate applied to this dataset revealed 16 distinct RNA velocity modules (Supplementary Fig. 15), capturing all the expected cell trajectories, with the dominant lineage corresponding to granule neuron differentiation and maturation stemming from neural intermediate progenitor cells (nIPCs), neuroblasts and immature neurons. Radial glial-like progenitor cells are largely committed to astrocytes (Fig. 3a), as evident both from cell2fate’s time estimates and CellRank fate probabilities (Supplementary Fig. 16). We also observed that mossy cells, another neuronal population in the dentate gyrus, were assigned to the middle stages of the granule neuron trajectory. While mossy cells are thought to have different lineage origins, their transcriptional development is highly similar to that of granule neurons²².

**Fig. 3: Module decomposition of cell2fate resolves final stages of granule neuron maturation.**

To explore the dynamics of neuronal differentiation in greater depth, we used the fitted cell2fate model to estimate the total spliced transcript abundance for each of the nine granule neuron lineage modules in individual cells across the inferred time (Fig. 3d, top). This analysis identified the successive induction of modules across the early differentiation of radial glia into nIPCs, neuroblasts and immature neurons (modules 1–3). Strikingly, cell2fate also recovered dynamics in mature granule neurons, explained by six modules (modules 4–9) that are sequentially activated and temporally overlap across mature granule neurons, thereby finely dissecting the late maturation of these cells into distinct transcriptional windows (Fig. 3d). The model also correctly identified a temporal gap between immature and mature granule neurons (Fig. 3d), which is consistent with prior expectations¹⁶. The cell2fate visualization tool complements t-distributed stochastic neighbor embedding or UMAP by providing dynamic insights anchored on estimated differentiation time, and it can also visualize additional metadata such as cell type annotations or developmental age (Fig. 3d, bottom two panels).

The total spliced count estimates can also be used to visualize the dynamics of RNA velocity modules across cells, for example, on a conventional UMAP plot (Fig. 3b). The activation of different modules per cell can be inspected, similar to the factor activity in conventional matrix factorization. We also compared these module activation estimates to conventional factor analysis and clustering methods. Briefly, multi-omics factor analysis (MOFA)²³ yielded factors that captured complementary sources of variation, with activity profiles that were temporally more diffuse across the differentiation trajectory. Specifically, these factors did not stratify late granule neuron maturation (Fig. 3b and Supplementary Fig. 17). We also observed overall low correlation between cell2fate module and MOFA factor gene loadings, particularly for the late neuronal maturation modules 4–9 (Supplementary Fig. 18). Similarly, Leiden clustering of the scRNA-seq dataset at different resolutions identified clusters that were not aligned with neuronal maturation (Fig. 3b and Supplementary Fig. 19). Collectively, these observations indicate that cell2fate captures complementary aspects of variation compared to existing decomposition methods and is well suited to conduct granular dissection of granule neuron maturation.

Beyond the activity of modules, the dynamics can be further classified into states within each cell, based on whether they are increasing or decreasing in expression (Fig. 3c and Methods). Both quantities are shown in Fig. 3e,f for granule neuron differentiation. The additional dynamic information in this visualization shows, for example, that module 9 has not reached a steady state, implying that granule neuron maturation continues beyond the time range captured in this dataset.

Finally, we examined to what extent RNA velocity modules can provide deeper insights into the late stages of granule neuron differentiation. We ranked genes by how much of their transcription rate is explained by each module and used top genes as ‘module markers’ (Fig. 3g and Supplementary Fig. 20). We identified Rbm24, Tafa1 (Fam19a1) and Sptb (module 4) and Pakap (Palm2), Pdzd2 and Usp19 (module 5) as markers switched on in immature granule neurons (Fig. 3g and Supplementary Fig. 20). By contrast, Fst, Rapgef5 and Moxd1 (module 6), Prr16, Kdm5d and Rpa3 (module 7) and Rgs2, Nudt13 and 1700048O20Rik (module 8) provide markers of late granule neuron maturation stages (Fig. 3g). Fam19a1 has been reported to suppress neural stem cell maintenance and promote differentiation^24,25, consistent with its expression pattern in maturing granule neurons. Rgs2 is dynamically expressed during neuronal activity²⁶ and involved in synaptic plasticity²⁷, consistent with its late induction in module 8. Apart from these two genes, the marker genes reported here have not been functionally investigated in granule neurons or brain development to our knowledge. We further intersected the top 30 markers of each module with a set of 30 Alzheimer’s risk genes²⁸ and 250 autism spectrum disorder risk genes^29,30, which highlighted late granule neuron maturation module 8, where the Alzheimer’s associated gene Trappc6a and the autism-associated genes Kdm6a and Nr4a2 were among the markers (Supplementary Table 9). These results highlight that cell2fate can lead to new biological insights, even in a widely characterized cell lineage like granule neurons.

Additionally, we can extract top module marker genes that are TFs as ‘module TFs’ (Fig. 3h). Moving on to such module TFs, Zmat4 (module 6) and Tfam (module 8) are enriched in late granule neuron maturation stages (Fig. 3h). Zmat4 has been reported as upregulated in the auditory cortex of young mice at postnatal day 7 compared to adults³¹, while Tfam knockout results in immature neuronal phenotypes³². Yet their roles in granule neuron differentiation have not been studied to date. We also find that genes with putative promoter sequences that are most likely to be bound by the top 20 module TFs, as predicted by the ProBound algorithm³³, are more frequently among the top 300 module genes than those least likely to be bound by those TFs (Supplementary Fig. 21). These TFs provide putative candidate regulators of late granule neuron differentiation.

In sum, our results demonstrate the great interpretability and statistical power of cell2fate’s module decomposition for scRNA-seq datasets to finely dissect cellular processes and suggest that late granule neuron maturation is composed of distinct stages.

Spatial mapping of RNA velocity modules

Temporal biological processes are often spatially organized in tissues. For example, cell differentiation and migration are often coupled, with cells associating with distinct spatial signaling microenvironments throughout their differentiation trajectories. Here, we sought to link the temporal information captured by cell2fate to spatial tissue organization by mapping RNA velocity modules in a newly generated spatial transcriptomics dataset of human brain development (Fig. 4a).

**Fig. 4: Cell2fate interfaces with cell2location to spatially map the cortical neurogenesis process in human brain development.**

We focused on the fetal human cerebral cortex in which excitatory neuron maturation follows a highly stereotyped trajectory through space and time³⁴. Neural progenitors termed radial glia and intermediate progenitors reside in cortical germinal zones, where they sequentially give rise to distinct neuronal subtypes that subsequently migrate out to the deep and upper layers of the cortical plate across their maturation (Fig. 4b). Deep layer-residing neurons (DLn) are born before upper-layer neurons (ULn) in early gestation; hence DLn are relatively more mature than ULn by midgestation (Fig. 4b). Thus, the maturation state and spatial location of cortical excitatory neurons are tightly linked.

To examine cellular differentiation trajectories in the human cortex, we initially performed single-nucleus RNA-sequencing (snRNA-seq) profiling (10x version 3.0) of one donor at midgestation. We then followed standard snRNA-seq processing workflows (Methods) to cluster cells and annotated cell types using markers from the literature³⁵. We annotated distinct neural progenitors (radial glial and intermediate progenitor cells) as well as excitatory neuron populations at different stages of maturation (Fig. 4c). As expected, mature neurons expressed DLn markers, whereas newborn and immature neurons showed enriched expression of ULn markers (Supplementary Fig. 22). We also annotated inhibitory neurons and glial cell types but excluded them from the subsequent excitatory neuron trajectory analysis.

We then applied cell2fate to this human brain snRNA-seq dataset and observed the expected excitatory neuronal differentiation trajectory from neural progenitors to newborn, immature and mature neurons (Fig. 4c). The RNA velocity modules dissected the neuronal trajectory into finer-grained maturation stages, identifying seven sequentially activated and temporally overlapping modules throughout immature and mature neurons (Fig. 4d and Supplementary Fig. 23). While these modules contained some DLn and ULn cell type markers, they also included many genes that are widely expressed across all excitatory neurons in the adult human cortex³⁶, such as PSMC3, KRR1 and BMPER (Supplementary Figs. 24 and 25). This suggests that the modules partially identify a neuronal maturation trajectory common to both DLn and ULn.

In contrast to cell2fate, other RNA velocity methods such as scVelo were not able to accurately identify velocity flow in mature neurons (Supplementary Fig. 26). Additionally, the integrated measurement model of cell2fate allowed us to factor in different detection probabilities for spliced and unspliced counts and correct batch effects in our human brain snRNA-seq dataset (Supplementary Fig. 27), which is crucial for estimating true transcriptional dynamics from observed counts (Methods).

To spatially map our RNA velocity modules in the developing human cortex, we performed Visium spatial RNA-seq profiling (10x CytAssist) of one cortical tissue section from an age-matched donor (Fig. 4b). As the Visium assay offers coarse spatial resolution and profiles multiple cells at each tissue location (that is, Visium spot), we used the cell2location algorithm³⁷ to deconvolve the abundance of RNA velocity modules across spatial data. We used the steady-state expression counts of each module as reference gene expression signatures and then applied the standard cell2location workflow to infer the abundance of each module signature across Visium spots (Fig. 4a and the Methods).

The RNA velocity modules showed expected patterns of spatial mapping across the human cortex (Fig. 4e–f and Supplementary Fig. 28). Progenitor modules spatially mapped to germinal zones (Fig. 4e), while neuronal modules primarily mapped to the cortical plate (Fig. 4e and Supplementary Fig. 28). The fine spatial locations of neuronal modules were consistent with their maturation state (Fig. 4f). The immature ULn module (0) mapped to the upper cortical layers as well as the subplate–intermediate zone that immature neurons pass through during their migration to the cortical plate³⁸. The early mature ULn module (2) was exclusively mapped to the upper cortical layers. By contrast, the late mature DLn module (5) specifically mapped to deep cortical layers.

In sum, our approach provides a workflow to spatially resolve complex cell trajectories through tissues.

Discussion

Here, we present cell2fate, a Bayesian model of RNA velocity that is capable of inferring transcriptional dynamics in settings of complex changes or weak signals in rare and mature cell types. A core innovation of cell2fate is a formulation of the velocity problem that builds on linearization, which allows for solving a biophysically more accurate model using analytically tractable linearized components. Another benefit of this formulation is that these linear components can be inspected as interpretable RNA velocity modules. This provides for a direct biophysical connection between cell2fate and statistical dimensionality reduction methods. We illustrated this feature by characterizing late maturation trajectories in granule neurons that have been elusive with other methods. Furthermore, RNA velocity modules can be used to locate differentiation trajectories in spatial transcriptomics data. We exemplified this in the developing human brain where the RNA velocity modules of neuronal differentiation showed a high degree of spatial organization.

Despite cell2fate’s improved biophysical accuracy, the model still makes simplifying assumptions, such as a constant degradation rate and no stochastic bifurcations. However, the concepts proposed in cell2fate are general and give rise to several extensions that can further increase the biophysical accuracy of the model in the future without resorting to numerical approximations. This includes RNA velocity models with cell-specific splicing and degradation rates, stochastic rates at lineage-branching points and causal connections between transcription rates at different time points, equivalent to dynamic gene regulatory networks (Supplementary Notes). In the long term, dynamical models should also include the effects of cell–cell interactions, based on signaling molecules measured with spatial transcriptomics. An immediate step toward this goal would be combining RNA velocity module mapping with spatial cell–cell interaction tools, such as NCEM³⁹, which could identify putative interactions that drive specific steps of a differentiation process.

Methods

A complete description of the cell2fate model and a comparison with other methods can be found in the Supplementary Notes.

Generative model for cell2fate

Equations (1) to (3) give rise to these new RNA velocity equations that cell2fate is based on:

$$frac{d{alpha }_{g}}{{dt}}=mathop{sum }limits_{m=1}^{M}frac{d{alpha }_{{mg}}}{{dt}}=mathop{sum }limits_{m=1}^{M}{lambda }_{{mi}}({hat{alpha }}_{{mgi}}-{alpha }_{{mg}})$$

(4)

$$frac{d{u}_{g}}{{dt}}=mathop{sum }limits_{m=1}^{M}frac{d{u}_{{mg}}}{{dt}}=mathop{sum }limits_{m=1}^{M}left({alpha }_{{mg}}-{beta }_{g}{u}_{{mg}}right)$$

(5)

$$frac{d{s}_{g}}{{dt}}=mathop{sum }limits_{m=1}^{M}frac{d{s}_{{mg}}}{{dt}}=mathop{sum }limits_{m=1}^{M}({beta }_{g}{u}_{{mg}}-{gamma }_{g}{s}_{{mg}}).$$

(6)

Subscripts g and m correspond to genes and modules, respectively. ({hat{alpha }}_{{mgi}}) is the target transcription rate of each module that is 0 in the OFF state (i = OFF) and nonzero in the ON state (i = ON). The target transcription rate is reached at a rate ({lambda }_{{mi}}). α, β, γ denote the transcription, splicing and degradation rates, similar to other RNA velocity models. These RNA velocity equations can be solved analytically (Supplementary Notes 1.3 and 3.1) to give equations for the expected spliced counts ({s}_{{cg}}) and unspliced counts ({u}_{{cg}}) in each cell (Supplementary Note 1.3, equations (15) and (16)). We call these count values ‘biological’ expectation values, denoted by ({x}_{{cg,j}}^{rm{B}}=({u}_{{cg}},s_{cg})), because they represent the expected counts in the absence of measurement effects. The index j denotes maturity of RNA transcripts (spliced/unspliced). To account for measurement effects and obtain ‘measurement’ expectation values, ({x}_{{cg,j}}^{rm{M}}), we transform ({x}_{{cg,j}}^{rm{B}}) as follows:

$${x}_{{cg,j}}^{rm{M}}={l}_{{cg,j}}({H}_{{ce}}{s}_{{eg,j}}+x_{{cg,j}}^{rm{B}}).$$

(7)

Here, ({H}_{{ce}}) denotes a one-hot categorical assignment of cells to experimental batches, (,{l}_{{cg,j}}) describes differences in detection efficiency (for example, sequencing depth, read alignment and quantification) of genes across cells and ({s}_{{eg,j}}) models ambient RNA (‘soup’) for each gene in each batch. These ‘measurement’ expectation values are then used to parameterize the mean of a negative binomial (NB) observation model of the observed raw count values X_cgj = (U_cg, S_cg):

$${X}_{{cg,j}} sim {rm{NB}}left(mu ={x}_{{cg,j}}^{rm{M}},alpha ={a}_{{g,j}}right).$$

(8)

Here, a_gj are NB overdispersion parameters for each gene, separate for spliced and unspliced counts. All parameters have hierarchical prior distributions (Supplementary Note 1.6) and are inferred using stochastic variational inference in the Pyro probabilistic programming framework⁴⁰ (Supplementary Note 1.8).

Downstream analysis of cell2fate

Computation of the RNA velocity graph

We followed the procedure proposed in the scVelo package for computing a cell–cell transition probability graph from RNA velocity estimates², with the modification of averaging the velocity graph over 100 posterior velocity samples, so that noisy gene velocities with high posterior uncertainty have less weight in estimating transition probabilities. Optionally, the original procedure can also be followed exactly by computing the velocity graph with mean velocity estimates from our method, using the original function in the scVelo Python package.

Computation of module activation

We calculate the module activation, which we defined as the total spliced counts produced by a module in a cell, by substituting posterior parameter values into the equations that capture the time evolution of spliced counts. Module activation can then be plotted over time by plotting the posterior time of each cell on the x axis and the calculated module activation of each module on the y axis.

Calculation of normalized module activation and state

Module normalized activation denotes the number of counts produced by a module divided by the steady-state counts of the module (Fig. 3d, gray line), which is calculated by setting time to infinity in the equations that capture the time evolution of spliced counts. The module state is defined as OFF if the cell time is smaller than the switch-on time of the module or the module’s normalized activation is below 0.05 (Fig. 3d, gray) and as ON if its normalized activation is above 0.95 (Fig. 3d, bright green). Otherwise, a module is in either the induction or repression state depending on whether the inferred cell time is below or above the switch-off time (Fig. 3d, light green or orange).

Benchmarking of RNA velocity methods

Processing of datasets

We used the 3,000 most variable genes with at least 20 detected counts for all results in the paper. In addition, we applied the standard preprocessing pipeline suggested in the main scVelo tutorial (on mouse pancreas development), except for Pyro-Velocity and cell2fate, which do not require preprocessing. This included total count normalization, log transformation and calculating mean expression among the 30 nearest neighbors in 30-component PCA space (‘kNN-smoothing’).

Application of velocity models

We followed online tutorials based on the mouse pancreas development dataset for all methods and then used the same analysis pipeline to produce the benchmarking results in this paper. We summarize parameter settings for all methods in Table 1 and also include the method version when available. For cell2fate, we kept all training and model parameters at their default values.

Definition of ground truth cell state transitions

For the mouse bone marrow dataset, we defined the following ground truth cluster transitions: ‘dividing’ to ‘progenitors’ to ‘activating’. For the remaining datasets, we used the ground truth transitions from the UniTVelo RNA velocity study that also used these datasets for benchmarking⁴¹. All ground truth transitions are included in Supplementary Tables 1–5.

Benchmarking metric

We calculated the CBDir, using the functions provided by the UniTVelo Python package⁴¹. The following is an explanation of the metric from the UniTVelo publication: ‘CBDir measures the correctness of transitions from a source cluster to target cluster using boundary cells given ground truth. Here boundary of source cluster refers to cells in that cluster that are neighbors of the target cluster and vice versa. Boundary cells are used because they reflects the biological development in a short period of time and CBDir is calculated via

$$begin{array}{ccc}{rm{CBDir}}(c) & = & frac{1}{{rm{Norm}}}sum _{c{prime} epsilon {C}_{A}cap N(c)}frac{{v}_{c}cdot left({x}_{c{prime} }-{x}_{c}right)}{left|{v}_{c}right|cdot left|{x}_{c{prime} }-{x}_{c}right|}\ & & {rm{Norm}}=left|c{prime} epsilon {C}_{A}cap N(c)right|,end{array}$$

(9)

where ({C}_{A}) is sets of cells in target cluster A, N(c) stands for the neighboring cells of specified cell c. ({v}_{c}) and ({x}_{c}) are the low-dimensional vectors representing computed velocity and positions of cell c. Therefore, ({x}_{c{prime} }-{x}_{c}) is the displacement in space during the short period of time’.

CellRank analysis

For the dentate gyrus dataset, we ran CellRank with default parameters and computed fate probabilities for the astrocyte, oligodendrocyte and granule mature states, if the method identified them as macrostates. When using the default parameter n_states = (4, 12), VeloVI and VeloVAE returned an error: ‘ValueError: Clustering into 12 clusters will split complex conjugate eigenvalues. Request one cluster more or less’; so we set n_states = (4, 11), which ran successfully. For cellDancer and VeloVAE, we also obtained the following error: ‘[…] value(s) do not sum to 1 (rtol=1e-3). Try decreasing the tolerance as ‘tol = …’, specifying a preconditioner as ‘preconditioner = …’ or use a direct solver as ‘solver = ‘direct’’ if the matrix is small’. This error could not be resolved, even after following the steps in the error message and the advice found regarding relevant GitHub issues. For the erythroid maturation dataset, we also ran CellRank with default parameters and computed fate probabilities both toward the progenitor 1 state and the erythroid 3 state.

Simulation benchmark

First, we fit cell2fate to the dentate gyrus dataset. Next, we generated data from the cell2fate generative model, keeping all parameters at their fitted values except one chosen parameter that we multiplied with 0.25, 0.5, 1, 2 or 4. We investigated four biological and technical parameters in this way: the splicing rate, the degradation rate, the detection probability of transcripts and the NB overdispersion parameter. We then fit cell2fate to this simulated data and calculated the correlation of inferred splicing and degradation rate values to the known ground truth (as RNA velocity is a function of splicing and degradation rates).

Time and memory requirements

We measured the time requirements for all methods using the ‘time’ Python package. The start point for time measurements was placed right before data preprocessing began, and the end point was placed right after velocity values were obtained. We measured memory requirements by running either the ‘htop’ (for CPU-based methods) or the ‘nvidia-smi’ command (for GPU-based methods) every 10 s for up to 2 min and recording the highest value.

Comparison of decomposition methods on the dentate gyrus dataset

Application of cell2fate

Following the benchmarking run on dentate gyrus data, we applied the downstream analysis methods, described above in Downstream analysis, to calculate and plot module activations and states.

Application of multi-omics factor analysis

We used the 3,000 most variable genes with at least 20 detected counts as input, identical to the cell2fate analysis. We limited the analysis to clusters involved in the neuron differentiation trajectory (radial glia like, nIPC, neuroblast, granule immature, granule mature). We added spliced and unspliced counts and then normalized, log transformed and scaled the data matrix using the respective scanpy functions. We ran MOFA using a Gaussian likelihood and ten factors (the same number of factors found by cell2fate in the relevant clusters). Further run options were spikeslab weights = True, ard factors = True, ard weights = True.

ProBound algorithm

To produce Supplementary Fig. 29, we applied the ProBound algorithm³³. ProBound can predict the binding affinity for most TFs to a user-supplied DNA sequence. We used the refdata-cellranger-arc-mm10-2020-A-2.0.0 genome as a reference. To obtain putative promoter sequences for each gene in the reference, we extracted the DNA sequence 450 bp upstream and 149 bp downstream of the transcription start site of each gene. For each TF, we then ranked genes based on the predicted ProBound binding affinity to its putative transcription start site. We then used the top ten and bottom ten binding targets of each TF to produce the results illustrated in Supplementary Fig. 29.

Spatial integration of RNA velocity

Human tissue

Formalin-fixed paraffin-embedded (FFPE) blocks of second-trimester human fetal brain were obtained from the MRC- and Wellcome-funded Human Developmental Biology Resource (HDBR, http://www.hdbr.org), with appropriate maternal written consent and approval from the Fulham Research Ethics Committee (REC reference 18/LO/0822) and the Newcastle & North Tyneside 1 Research Ethics Committee (REC reference 18/NE/0290). The HDBR is regulated by the UK Human Tissue Authority (http://www.hta.gov.uk) and operates in accordance with the relevant Human Tissue Authority Codes of Practice.

Developing human brain single-nucleus library preparation and sequencing

Single nuclei were isolated from frozen fetal brain tissue according to a published protocol⁴². Briefly, tissue was Dounce homogenized in homogenization buffer (250 mM sucrose, 25 mM potassium chloride, 5 mM magnesium chloride, 10 mM Tris buffer, pH 8.0, 1 M 1,4-dithiothreitol, 0.1% Triton X-100, 1× protease inhibitor, 0.4 U l⁻¹ RNasin Plus RNase inhibitor, 0.2 U l⁻¹ SUPERase·In RNase inhibitor) and filtered with a 40-µm cell strainer. Debris was removed from the filtrate via density centrifugation with 27% Percoll. All nuclei in a batch were mixed in equal concentrations before droplet encapsulation with the 10x Chromium Single Cell 3′ version 3.1 kit. Libraries were generated according to the manufacturer’s protocol (CG000204) and single-index sequenced with cycles 28-8-91 on a NovaSeq 6000 System (Illumina) using a NovaSeq S4 flow cell.

Developing human brain Visium library preparation and sequencing

FFPE tissue sections (5 μm) were stained and imaged following the 10x Genomics Visium CytAssist user guide (CG000520). The following times were used for the fetal brain tissue: hematoxylin, 3 min; bluing, 1 min; eosin, 1 min. After probe hybridization and ligation, a Visium CytAssist instrument was used to transfer analytes from the glass slide to a Visium CytAssist Spatial Gene Expression slide with a capture area of 42.25 mm². Probe extension and library construction were carried out following the standard Visium for FFPE workflow (CG000495) outside of the instrument. Libraries were sequenced with paired-end dual indexing (28 cycles, read 1; ten cycles, i7; ten cycles, i5; 90 cycles, read 2) on the Illumina-HTP NovaSeq 6000 System with paired-end sequencing in an SP flow cell. The Loupe Browser was used to generate the JSON file, and the Space Ranger pipeline version 2022.0705.1 (10x Genomics) and the GRCh38-2020-A reference were used to process FASTQ files.

Developing human brain Visium and single-nucleus sequencing count quantification, clustering and annotation

We quantified counts using Space Ranger 1.3.0 for the Visium data and STARsolo with Cell Ranger 3.02 for the single-nucleus data using GRCh38 version 1.2.0 as a reference. We applied CellBender to the single-nucleus total count matrix (but not spliced and unspliced count matrices). We followed the scanpy processing and clustering tutorial with default parameters (min_genes = 200, min_cells = 3, n_genes_by_counts < 2,500, pct_counts mt < 5), which involves removing cells and genes with low UMI counts, followed by removal of cells with very high total counts or mitochondrial gene ratio, total count normalization, log transformation, highly variable gene selection, data scaling, principal-component analysis and finally Louvain clustering. Expression of cell type marker genes taken from the single-cell atlas (Fig. 1f of Polioudakis et al.³⁵) was plotted for each cluster, based on which we annotated the cluster identity. Clusters outside the excitatory neuron lineage (oligodendrocytes, interneurons, microglia) were not considered for further analysis.

Cell2location

We used the steady-state counts of each module as a reference gene expression profile. This steady-state expression corresponds to the ({g}_{{mg}}) parameter of the generative model. We then ran the cell2location method with default parameter settings (including α = 20).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.