Jupyter Notebook Binder

Multi-modal#

Warning

This is, for now, just a stub.

Here, we’ll showcase how to curate and register ECCITE-seq data from Papalexi21 in the form of MuData objects. ECCITE-seq is designed to enable interrogation of single-cell transcriptomes together with surface protein markers in the context of CRISPR screens.

Setup#

!lamin init --storage ./test-multimodal --schema bionty
Hide code cell output
✅ saved: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-10-04 16:44:26)
✅ saved: Storage(id='agupObf9', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-multimodal', type='local', updated_at=2023-10-04 16:44:26, created_by_id='DzTjkKse')
💡 loaded instance: testuser1/test-multimodal
💡 did not register local instance on hub (if you want, call `lamin register`)

import lamindb as ln
import lnschema_bionty as lb

lb.settings.species = "human"
💡 loaded instance: testuser1/test-multimodal (lamindb 0.55.0)
hello

ln.track()
💡 notebook imports: lamindb==0.55.0 lnschema_bionty==0.31.2
💡 Transform(id='yMWSFirS6qv2z8', name='Multi-modal', short_name='multimodal', version='0', type=notebook, updated_at=2023-10-04 16:44:30, created_by_id='DzTjkKse')
💡 Run(id='Br39x7vMzZ316M4pnRog', run_at=2023-10-04 16:44:30, transform_id='yMWSFirS6qv2z8', created_by_id='DzTjkKse')
hello

within hello

Papalexi21#

Let’s use a MuData object:

Transform #

Hide code cell content
mdata = ln.dev.datasets.mudata_papalexi21_subset()
mdata
MuData object with n_obs × n_vars = 200 × 300
  var:	'name'
  4 modalities
    rna:	200 x 173
      obs:	'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'nCount_GDO', 'nCount_ADT', 'nFeature_ADT', 'percent.mito', 'MULTI_ID', 'HTO_classification', 'guide_ID', 'gene_target', 'NT', 'perturbation', 'replicate', 'S.Score', 'G2M.Score', 'Phase'
      var:	'name'
    adt:	200 x 4
      obs:	'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'nCount_GDO', 'nCount_ADT', 'nFeature_ADT', 'percent.mito', 'MULTI_ID', 'HTO_classification', 'guide_ID', 'gene_target', 'NT', 'perturbation', 'replicate', 'S.Score', 'G2M.Score', 'Phase'
      var:	'name'
    hto:	200 x 12
      obs:	'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'nCount_GDO', 'nCount_ADT', 'nFeature_ADT', 'percent.mito', 'MULTI_ID', 'HTO_classification', 'guide_ID', 'gene_target', 'NT', 'perturbation', 'replicate', 'S.Score', 'G2M.Score', 'Phase'
      var:	'name'
    gdo:	200 x 111
      obs:	'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'nCount_GDO', 'nCount_ADT', 'nFeature_ADT', 'percent.mito', 'MULTI_ID', 'HTO_classification', 'guide_ID', 'gene_target', 'NT', 'perturbation', 'replicate', 'S.Score', 'G2M.Score', 'Phase'
      var:	'name'

MuData objects build on top of AnnData objects to store and serialize multimodal data. More information can be found on the MuData documentation.

First we register the file:

file = ln.File(
    "papalexi21_subset.h5mu", description="Sub-sampled MuData from Papalexi21"
)
file.save()

Now let’s validate and register the 3 feature sets this data contains:

  1. RNA (gene expression)

  2. ADT (antibody derived tags reflecting surface proteins)

  3. obs (metadata)

For the two modalities rna and adt, we use bionty tables as the reference:

Validate #

mdata["rna"].var_names[:5]
Index(['RP5-827C21.6', 'XX-CR54.1', 'SH2D6', 'RP11-379B18.5', 'RP11-778D9.12'], dtype='object', name='index')
lb.Gene.validate(mdata["rna"].var_names, lb.Gene.symbol);
hello

173 terms (100.00%) are not validated for symbol: RP5-827C21.6, XX-CR54.1, SH2D6, RP11-379B18.5, RP11-778D9.12, RP11-703G6.1, AC005150.1, RP11-717H13.1, CTC-498J12.1, CTC-467M3.1, ARHGAP26-AS1, GABRA1, HIST1H4K, HLA-DQB1-AS1, RP11-524H19.2, SPACA1, VNN1, AC006042.7, AC002066.1, AC073934.6, ...
genes = lb.Gene.from_values(mdata["rna"].var_names, lb.Gene.symbol)
ln.save(genes)
hello

❗ ambiguous validation in Bionty for 6 records: 'HLA-DQB1-AS1', 'CTAGE15', 'CTRB2', 'LGALS9C', 'PCDHB11', 'TBC1D3G'
did not create Gene records for 84 non-validated symbols: 'AC002066.1', 'AC004019.13', 'AC005150.1', 'AC006042.7', 'AC011558.5', 'AC026471.6', 'AC073934.6', 'AC091132.1', 'AC092295.4', 'AC092687.5', 'AE000662.93', 'AL132989.1', 'AP000442.4', 'CTA-373H7.7', 'CTB-134F13.1', 'CTB-31O20.9', 'CTC-498J12.1', 'CTD-2562J17.2', 'CTD-3012A18.1', 'CTD-3065B20.2', ...
mdata["rna"].var_names = lb.Gene.standardize(mdata["rna"].var_names, lb.Gene.symbol)
hello

hello

validated = lb.Gene.validate(mdata["rna"].var_names, lb.Gene.symbol)
hello

84 terms (48.60%) are not validated for symbol: RP5-827C21.6, XX-CR54.1, RP11-379B18.5, RP11-778D9.12, RP11-703G6.1, AC005150.1, RP11-717H13.1, CTC-498J12.1, RP11-524H19.2, AC006042.7, AC002066.1, AC073934.6, RP11-268G12.1, U52111.14, RP11-235C23.5, RP11-12J10.3, RP11-324E6.9, RP11-187A9.3, RP11-365N19.2, RP11-346D14.1, ...
new_genes = [lb.Gene(symbol=symbol) for symbol in mdata["rna"].var_names[~validated]]
ln.save(new_genes)
lb.Gene.validate(mdata["rna"].var_names, lb.Gene.symbol);
hello

feature_set_rna = ln.FeatureSet.from_values(
    mdata["rna"].var_names, field=lb.Gene.symbol
)
hello

hello

mdata["adt"].var_names
Index(['CD86', 'PDL1', 'PDL2', 'CD366'], dtype='object', name='index')
lb.CellMarker.validate(mdata["adt"].var_names);
hello

4 terms (100.00%) are not validated for name: CD86, PDL1, PDL2, CD366
markers = lb.CellMarker.from_values(mdata["adt"].var_names)
ln.save(markers)
hello

lb.CellMarker.validate(mdata["adt"].var_names);
hello

Register #

feature_set_adt = ln.FeatureSet.from_values(
    mdata["adt"].var_names, field=lb.CellMarker.name
)
hello

hello

Link them to file:

file.features.add_feature_set(feature_set_rna, slot="rna")
file.features.add_feature_set(feature_set_adt, slot="adt")

The 3rd feature set is the obs:

obs = mdata["rna"].obs

We’re only interested in a single metadata column:

ln.Feature(name="gene_target", type="category").save()
hello

features = ln.Feature.from_df(obs)
ln.save(features)
hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

feature_set_obs = ln.FeatureSet.from_df(obs)
hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

file.features.add_feature_set(feature_set_obs, slot="obs")
gene_targets = lb.Gene.from_values(obs["gene_target"], lb.Gene.symbol)
ln.save(gene_targets)
features = ln.Feature.lookup()
file.labels.add(gene_targets, feature=features.gene_target)
hello

❗ ambiguous validation in Bionty for 4 records: 'MARCHF8', 'IRF7', 'IFNGR2', 'TNFRSF14'
did not create Gene record for 1 non-validated symbol: 'NT'
hello

hello

within hello

nt = ln.ULabel(name="NT", description="Non-targeting control of perturbations")
nt.save()
hello

file.labels.add(nt, feature=features.gene_target)
hello

within hello

for col in ["orig.ident", "perturbation", "replicate", "Phase", "guide_ID"]:
    labels = [ln.ULabel(name=name) for name in obs[col].unique()]
    ln.save(labels)
hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

❗ record with similar name exist! did you mean to load it?
id __ratio__
name
G1 IU4lPcNC 90.0
hello

❗ record with similar name exist! did you mean to load it?
id __ratio__
name
G1 IU4lPcNC 90.0
hello

hello

❗ record with similar name exist! did you mean to load it?
id __ratio__
name
S 6jP1PaJu 90.0
hello

❗ records with similar names exist! did you mean to load one of them?
id __ratio__
name
G1 IU4lPcNC 90.0
S 6jP1PaJu 90.0
hello

hello

❗ record with similar name exist! did you mean to load it?
id __ratio__
name
NT 5QX76Jnt 90.0
hello

hello

❗ records with similar names exist! did you mean to load one of them?
id __ratio__
name
G1 IU4lPcNC 90.0
NT 5QX76Jnt 90.0
hello

❗ record with similar name exist! did you mean to load it?
id __ratio__
name
S 6jP1PaJu 90.0
hello

hello

hello

❗ record with similar name exist! did you mean to load it?
id __ratio__
name
G1 IU4lPcNC 90.0
hello

hello

hello

❗ record with similar name exist! did you mean to load it?
id __ratio__
name
G1 IU4lPcNC 90.0
hello

❗ records with similar names exist! did you mean to load one of them?
id __ratio__
name
G1 IU4lPcNC 90.0
S 6jP1PaJu 90.0
hello

hello

❗ record with similar name exist! did you mean to load it?
id __ratio__
name
G1 IU4lPcNC 90.0
hello

hello

hello

hello

hello

❗ record with similar name exist! did you mean to load it?
id __ratio__
name
S 6jP1PaJu 90.0
hello

❗ record with similar name exist! did you mean to load it?
id __ratio__
name
G1 IU4lPcNC 90.0
hello

hello

hello

hello

hello

hello

hello

hello

❗ records with similar names exist! did you mean to load one of them?
id __ratio__
name
G1 IU4lPcNC 90.0
NT 5QX76Jnt 90.0
hello

❗ record with similar name exist! did you mean to load it?
id __ratio__
name
G1 IU4lPcNC 90.0
hello

❗ record with similar name exist! did you mean to load it?
id __ratio__
name
G1 IU4lPcNC 90.0
hello

hello

❗ record with similar name exist! did you mean to load it?
id __ratio__
name
G1 IU4lPcNC 90.0
hello

❗ record with similar name exist! did you mean to load it?
id __ratio__
name
S 6jP1PaJu 90.0
hello

hello

hello

❗ records with similar names exist! did you mean to load one of them?
id __ratio__
name
G1 IU4lPcNC 90.0
S 6jP1PaJu 90.0
hello

❗ record with similar name exist! did you mean to load it?
id __ratio__
name
G1 IU4lPcNC 90.0
hello

hello

hello

❗ record with similar name exist! did you mean to load it?
id __ratio__
name
G1 IU4lPcNC 90.0
hello

❗ record with similar name exist! did you mean to load it?
id __ratio__
name
S 6jP1PaJu 90.0
hello

❗ record with similar name exist! did you mean to load it?
id __ratio__
name
NT 5QX76Jnt 90.0
hello

❗ record with similar name exist! did you mean to load it?
id __ratio__
name
G1 IU4lPcNC 90.0
hello

❗ record with similar name exist! did you mean to load it?
id __ratio__
name
NT 5QX76Jnt 90.0
hello

❗ record with similar name exist! did you mean to load it?
id __ratio__
name
G1 IU4lPcNC 90.0
hello

hello

hello

hello

hello

❗ record with similar name exist! did you mean to load it?
id __ratio__
name
G1 IU4lPcNC 90.0
hello

❗ record with similar name exist! did you mean to load it?
id __ratio__
name
NT 5QX76Jnt 90.0
hello

❗ record with similar name exist! did you mean to load it?
id __ratio__
name
S 6jP1PaJu 90.0
hello

hello

hello

❗ record with similar name exist! did you mean to load it?
id __ratio__
name
G1 IU4lPcNC 90.0
hello

❗ record with similar name exist! did you mean to load it?
id __ratio__
name
G1 IU4lPcNC 90.0
hello

❗ record with similar name exist! did you mean to load it?
id __ratio__
name
S 6jP1PaJu 90.0
hello

hello

hello

❗ record with similar name exist! did you mean to load it?
id __ratio__
name
G1 IU4lPcNC 90.0
hello

❗ record with similar name exist! did you mean to load it?
id __ratio__
name
S 6jP1PaJu 90.0
hello

hello

❗ record with similar name exist! did you mean to load it?
id __ratio__
name
G1 IU4lPcNC 90.0
hello

❗ record with similar name exist! did you mean to load it?
id __ratio__
name
S 6jP1PaJu 90.0
hello

hello

hello

hello

hello

❗ record with similar name exist! did you mean to load it?
id __ratio__
name
S 6jP1PaJu 90.0

Because none of these labels seem like something we’d want to track in the registry or validate, we don’t link them to the file.

file.features
hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

Features:
  rna: FeatureSet(id='8OnnfBTt7zJOkMQY1PYE', n=184, type='number', registry='bionty.Gene', hash='Y8lsRtXCZKyPPberKAF0', updated_at=2023-10-04 16:44:38, created_by_id='DzTjkKse')
    'CDH8', 'TMPRSS3', 'CTD-3193O13.8', 'RP11-2H8.2', 'LGALS9C', 'RP11-138C9.1', 'RP11-835E18.5', 'PLGLB2', 'RP11-324E6.9', 'SLC46A2', 'ARHGAP26-AS1', 'MEF2C-AS2', 'AK8', 'LINC02914', 'CTB-31O20.9', 'HOXC-AS2', 'HPN', 'RP11-17J14.2', 'CSMD3', 'NBPF15', ...
  adt: FeatureSet(id='idnUhxt5G27OMfzaeQZ0', n=4, type='number', registry='bionty.CellMarker', hash='b-CtyjgPRO0WN27lTOqC', updated_at=2023-10-04 16:44:38, created_by_id='DzTjkKse')
    'PDL1', 'CD86', 'PDL2', 'CD366'
  obs: FeatureSet(id='Tv5Fbp02xzr030aW2ug9', n=19, registry='core.Feature', hash='mAPyVLti8m11pj46FSxa', updated_at=2023-10-04 16:44:39, created_by_id='DzTjkKse')
    nCount_HTO (number)
    nCount_GDO (number)
    NT (category)
    nFeature_HTO (number)
    orig.ident (category)
    nFeature_ADT (number)
    MULTI_ID (category)
    percent.mito (number)
    Phase (category)
    nCount_ADT (number)
    HTO_classification (category)
    perturbation (category)
    replicate (category)
    nFeature_RNA (number)
    G2M.Score (number)
    guide_ID (category)
    nCount_RNA (number)
    S.Score (number)
    🔗 gene_target (bionty.Gene|core.ULabel)
        🔗 gene_target (28, bionty.Gene): 'ATF2', 'PDCD1LG2', 'STAT2', 'STAT3', 'NFKBIA', 'TNFRSF14', 'CMTM6', 'CD86', 'IFNGR2', 'MARCHF8', ...
        🔗 gene_target (1, core.ULabel): 'NT'
file.describe()
hello

hello

hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

File(id='DhmaYEyGu6PHhZovNypy', suffix='.h5mu', accessor='MuData', description='Sub-sampled MuData from Papalexi21', size=606320, hash='RaivS3NesDOP-6kNIuaC3g', hash_type='md5', updated_at=2023-10-04 16:44:31)

Provenance:
  🗃️ storage: Storage(id='agupObf9', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-multimodal', type='local', updated_at=2023-10-04 16:44:26, created_by_id='DzTjkKse')
  💫 transform: Transform(id='yMWSFirS6qv2z8', name='Multi-modal', short_name='multimodal', version='0', type=notebook, updated_at=2023-10-04 16:44:30, created_by_id='DzTjkKse')
  👣 run: Run(id='Br39x7vMzZ316M4pnRog', run_at=2023-10-04 16:44:30, transform_id='yMWSFirS6qv2z8', created_by_id='DzTjkKse')
  👤 created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-10-04 16:44:26)
Features:
  rna: FeatureSet(id='8OnnfBTt7zJOkMQY1PYE', n=184, type='number', registry='bionty.Gene', hash='Y8lsRtXCZKyPPberKAF0', updated_at=2023-10-04 16:44:38, created_by_id='DzTjkKse')
    'CDH8', 'TMPRSS3', 'CTD-3193O13.8', 'RP11-2H8.2', 'LGALS9C', 'RP11-138C9.1', 'RP11-835E18.5', 'PLGLB2', 'RP11-324E6.9', 'SLC46A2', 'ARHGAP26-AS1', 'MEF2C-AS2', 'AK8', 'LINC02914', 'CTB-31O20.9', 'HOXC-AS2', 'HPN', 'RP11-17J14.2', 'CSMD3', 'NBPF15', ...
  adt: FeatureSet(id='idnUhxt5G27OMfzaeQZ0', n=4, type='number', registry='bionty.CellMarker', hash='b-CtyjgPRO0WN27lTOqC', updated_at=2023-10-04 16:44:38, created_by_id='DzTjkKse')
    'PDL1', 'CD86', 'PDL2', 'CD366'
  obs: FeatureSet(id='Tv5Fbp02xzr030aW2ug9', n=19, registry='core.Feature', hash='mAPyVLti8m11pj46FSxa', updated_at=2023-10-04 16:44:39, created_by_id='DzTjkKse')
    nCount_HTO (number)
    nCount_GDO (number)
    NT (category)
    nFeature_HTO (number)
    orig.ident (category)
    nFeature_ADT (number)
    MULTI_ID (category)
    percent.mito (number)
    Phase (category)
    nCount_ADT (number)
    HTO_classification (category)
    perturbation (category)
    replicate (category)
    nFeature_RNA (number)
    G2M.Score (number)
    guide_ID (category)
    nCount_RNA (number)
    S.Score (number)
    🔗 gene_target (bionty.Gene|core.ULabel)
        🔗 gene_target (28, bionty.Gene): 'ATF2', 'PDCD1LG2', 'STAT2', 'STAT3', 'NFKBIA', 'TNFRSF14', 'CMTM6', 'CD86', 'IFNGR2', 'MARCHF8', ...
        🔗 gene_target (1, core.ULabel): 'NT'
Labels:
  🏷️ genes (28, bionty.Gene): 'ATF2', 'PDCD1LG2', 'STAT2', 'STAT3', 'NFKBIA', 'TNFRSF14', 'CMTM6', 'CD86', 'IFNGR2', 'MARCHF8', ...
  🏷️ ulabels (1, core.ULabel): 'NT'
file.view_flow()
hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

hello

hello

hello

hello

hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

hello

hello

hello

hello

https://d33wubrfki0l68.cloudfront.net/8a611bdca4907b09817b09b70d310213f171d742/13d61/_images/bd8458b62448288396bebc425e9f81a9cee70b416b36661650c77fe8b92383fb.svg
# clean up test instance
!lamin delete --force test-multimodal
!rm -r test-multimodal
Hide code cell output
💡 deleting instance testuser1/test-multimodal
✅     deleted instance settings file: /home/runner/.lamin/instance--testuser1--test-multimodal.env
✅     instance cache deleted
✅     deleted '.lndb' sqlite file
❗     consider manually deleting your stored data: /home/runner/work/lamin-usecases/lamin-usecases/docs/test-multimodal