SpliceVI MuData Object

SpliceVI takes a MuData object as input, with two modalities keyed "rna" and "splicing". This page describes the required structure of each modality and how the MuData is assembled from raw inputs.

Overview

MuData
├── mdata["rna"]       AnnData   (cells × genes)   — gene expression
└── mdata["splicing"]  AnnData   (cells × junctions) — splicing

Both modalities share the same obs (cell barcodes / metadata) index. All cells must be present in both modalities.

RNA modality — `mdata["rna"]`

Slot	Contents
`.X`	Raw gene expression counts `(cells × genes)`, integer dtype
`.layers["length_norm"]`	Length-normalized expression: raw counts divided by transcript length and rescaled to the median transcript length across genes. Used as the model's expression input.
`.var["gene_id"]`	Ensembl gene ID
`.var["modality"]`	`"Gene_Expression"`
`.obsm["X_library_size"]`	Per-cell library size (sum of length-normalized counts), shape `(cells, 1)`

Length normalization

The length_norm layer is computed as:

length_norm = (raw_counts / transcript_length) × median(transcript_length across all genes)

This rescales expression to be in comparable units across genes while preserving raw-count scale for the negative binomial likelihood.

Splicing modality — `mdata["splicing"]`

The splicing AnnData follows the SplicingDataset format defined by LeafletFA. The required raw layers are:

Layer	Shape	Contents
`cell_by_junction_matrix`	`cells × junctions`	Integer junction read counts (sparse)
`cell_by_cluster_matrix`	`cells × junctions`	Total read counts across all junctions in the same ATSE event (sparse); the denominator for PSI

From these, three additional layers are computed during MuData assembly:

Layer	Shape	Contents
`junc_ratio`	`cells × junctions`	PSI values: `junction_counts / atse_total_counts`, or 0 where unobserved
`psi_mask`	`cells × junctions`	Binary mask: 1 where `atse_total_counts > 0` (junction is observed in that cell), 0 otherwise

What is an ATSE?

An Alternative Transcript Structure Event (ATSE) is a group of junctions that compete within a single splicing choice (e.g. a cassette exon skip has one skipping junction and two inclusion junctions). All junctions belonging to the same ATSE share the same cell_by_cluster_matrix denominator value within a cell.

PSI mask

The psi_mask is critical to correct model behavior. SpliceVI's partial encoder processes only observed junctions (mask = 1) per cell, completely ignoring positions where psi_mask = 0. This is what allows the model to handle the 70–95% sparsity typical in single-cell splicing data without any imputation at the input level.

var metadata

Column	Contents
`junction_id`	Unique junction identifier (e.g. `chr:start:end:strand`)
`event_id`	ATSE group identifier (shared across all junctions in the same event)
`modality`	`"Splicing"`

Shared obs metadata

mdata.obs is the union of obs columns from both modalities. Typical columns include:

Column	Description
`cell_type` / `broad_cell_type`	Cell type annotation
`tissue`	Tissue of origin
`age` / `age_numeric`	Age (string or parsed integer)
`mouse.id` / `donor_id`	Batch/donor identifier used for batch correction

Passing the MuData to SpliceVI

import mudata
from splicevi import SPLICEVI

mdata = mudata.read_h5mu("train_70_30_combined.h5mu")

SPLICEVI.setup_mudata(
    mdata,
    rna_layer="length_norm",
    batch_key="mouse.id",
)

model = SPLICEVI(mdata, n_latent=30)

See Getting Started for a full training example.