SpliceVI MuData Object
SpliceVI takes a MuData object as input, with two modalities keyed "rna" and "splicing". This page describes the required structure of each modality and how the MuData is assembled from raw inputs.
Overview
MuData
├── mdata["rna"] AnnData (cells × genes) — gene expression
└── mdata["splicing"] AnnData (cells × junctions) — splicing
Both modalities share the same obs (cell barcodes / metadata) index. All cells must be present in both modalities.
RNA modality — mdata["rna"]
| Slot | Contents |
|---|---|
.X |
Raw gene expression counts (cells × genes), integer dtype |
.layers["length_norm"] |
Length-normalized expression: raw counts divided by transcript length and rescaled to the median transcript length across genes. Used as the model's expression input. |
.var["gene_id"] |
Ensembl gene ID |
.var["modality"] |
"Gene_Expression" |
.obsm["X_library_size"] |
Per-cell library size (sum of length-normalized counts), shape (cells, 1) |
Length normalization
The length_norm layer is computed as:
This rescales expression to be in comparable units across genes while preserving raw-count scale for the negative binomial likelihood.
Splicing modality — mdata["splicing"]
The splicing AnnData follows the SplicingDataset format defined by LeafletFA. The required raw layers are:
| Layer | Shape | Contents |
|---|---|---|
cell_by_junction_matrix |
cells × junctions |
Integer junction read counts (sparse) |
cell_by_cluster_matrix |
cells × junctions |
Total read counts across all junctions in the same ATSE event (sparse); the denominator for PSI |
From these, three additional layers are computed during MuData assembly:
| Layer | Shape | Contents |
|---|---|---|
junc_ratio |
cells × junctions |
PSI values: junction_counts / atse_total_counts, or 0 where unobserved |
psi_mask |
cells × junctions |
Binary mask: 1 where atse_total_counts > 0 (junction is observed in that cell), 0 otherwise |
What is an ATSE?
An Alternative Transcript Structure Event (ATSE) is a group of junctions that compete within a single splicing choice (e.g. a cassette exon skip has one skipping junction and two inclusion junctions). All junctions belonging to the same ATSE share the same cell_by_cluster_matrix denominator value within a cell.
PSI mask
The psi_mask is critical to correct model behavior. SpliceVI's partial encoder processes only observed junctions (mask = 1) per cell, completely ignoring positions where psi_mask = 0. This is what allows the model to handle the 70–95% sparsity typical in single-cell splicing data without any imputation at the input level.
var metadata
| Column | Contents |
|---|---|
junction_id |
Unique junction identifier (e.g. chr:start:end:strand) |
event_id |
ATSE group identifier (shared across all junctions in the same event) |
modality |
"Splicing" |
Shared obs metadata
mdata.obs is the union of obs columns from both modalities. Typical columns include:
| Column | Description |
|---|---|
cell_type / broad_cell_type |
Cell type annotation |
tissue |
Tissue of origin |
age / age_numeric |
Age (string or parsed integer) |
mouse.id / donor_id |
Batch/donor identifier used for batch correction |
Passing the MuData to SpliceVI
import mudata
from splicevi import SPLICEVI
mdata = mudata.read_h5mu("train_70_30_combined.h5mu")
SPLICEVI.setup_mudata(
mdata,
rna_layer="length_norm",
batch_key="mouse.id",
)
model = SPLICEVI(mdata, n_latent=30)
See Getting Started for a full training example.