Jupyter Notebook Binder

Analysis flow#

Here, we’ll track typical data transformations like subsetting that occur during analysis.

If exploring more generally, read this first: Project flow.

Setup#

# a lamindb instance containing Bionty schema
!lamin init --storage ./analysis-usecase --schema bionty
Hide code cell output
✅ saved: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-10-04 16:38:55)
✅ saved: Storage(id='HLOSi8IF', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase', type='local', updated_at=2023-10-04 16:38:55, created_by_id='DzTjkKse')
💡 loaded instance: testuser1/analysis-usecase
💡 did not register local instance on hub (if you want, call `lamin register`)

import lamindb as ln
import lnschema_bionty as lb

lb.settings.species = "human"  # globally set species
lb.settings.auto_save_parents = False
💡 loaded instance: testuser1/analysis-usecase (lamindb 0.55.0)
hello

ln.track()
💡 notebook imports: lamindb==0.55.0 lnschema_bionty==0.31.2
💡 Transform(id='eNef4Arw8nNMz8', name='Analysis flow', short_name='analysis-flow', version='0', type=notebook, updated_at=2023-10-04 16:38:57, created_by_id='DzTjkKse')
💡 Run(id='7I0db5EDIUKofs2MXUNS', run_at=2023-10-04 16:38:57, transform_id='eNef4Arw8nNMz8', created_by_id='DzTjkKse')
hello

within hello

Track cell types, tissues and diseases#

We fetch an example dataset from LaminDB that has a few cell type, tissue and disease annotations:

Hide code cell content
adata = ln.dev.datasets.anndata_with_obs()
adata
AnnData object with n_obs × n_vars = 40 × 100
    obs: 'cell_type', 'cell_type_id', 'tissue', 'disease'
adata.var_names[:5]
Index(['ENSG00000000003', 'ENSG00000000005', 'ENSG00000000419',
       'ENSG00000000457', 'ENSG00000000460'],
      dtype='object')
adata.obs[["tissue", "cell_type", "disease"]].value_counts()
tissue  cell_type                disease                   
brain   my new cell type         Alzheimer disease             10
heart   hepatocyte               cardiac ventricle disorder    10
kidney  T cell                   chronic kidney disease        10
liver   hematopoietic stem cell  liver lymphoma                10
dtype: int64

Processing the dataset#

To track our data transformation we create a new Transform of type “pipeline”:

transform = ln.Transform(
    name="Subset to T-cells and liver lymphoma", version="0.1.0", type="pipeline"
)
hello

Set the current tracking to the new transform:

ln.track(transform)
💡 Transform(id='S0N0uJgMJCM0Yo', name='Subset to T-cells and liver lymphoma', version='0.1.0', type='pipeline', updated_at=2023-10-04 16:39:04, created_by_id='DzTjkKse')
💡 Run(id='eZ2xy6r8PBUFGbK7z5Ss', run_at=2023-10-04 16:39:04, transform_id='S0N0uJgMJCM0Yo', created_by_id='DzTjkKse')
hello

within hello

Get a backed AnnData object#

file = ln.File.filter(key="mini_anndata_with_obs.h5ad").one()
adata = file.backed()
adata
hello

hello

AnnDataAccessor object with n_obs × n_vars = 40 × 100
  constructed for the AnnData object mini_anndata_with_obs.h5ad
    obs: ['_index', 'cell_type', 'cell_type_id', 'disease', 'tissue']
    var: ['_index']
adata.obs[["cell_type", "disease"]].value_counts()
cell_type                disease                   
T cell                   chronic kidney disease        10
hematopoietic stem cell  liver lymphoma                10
hepatocyte               cardiac ventricle disorder    10
my new cell type         Alzheimer disease             10
dtype: int64

Subset dataset to specific cell types and diseases#

Create the subset:

subset_obs = adata.obs.cell_type.isin(["T cell", "hematopoietic stem cell"]) & (
    adata.obs.disease.isin(["liver lymphoma", "chronic kidney disease"])
)
adata_subset = adata[subset_obs]
adata_subset
AnnDataAccessorSubset object with n_obs × n_vars = 20 × 100
  obs: ['_index', 'cell_type', 'cell_type_id', 'disease', 'tissue']
  var: ['_index']
adata_subset.obs[["cell_type", "disease"]].value_counts()
cell_type                disease               
T cell                   chronic kidney disease    10
hematopoietic stem cell  liver lymphoma            10
dtype: int64

This subset can now be registered:

file_subset = ln.File.from_anndata(
    adata_subset.to_memory(),
    key="subset/mini_anndata_with_obs.h5ad",
    field=lb.Gene.ensembl_gene_id,
)
/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/anndata/_core/anndata.py:1840: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")
hello

❗    received 99 unique terms, 1 empty/duplicated term is ignored
99 terms (100.00%) are not validated for ensembl_gene_id: ENSG00000000003, ENSG00000000005, ENSG00000000419, ENSG00000000457, ENSG00000000460, ENSG00000000938, ENSG00000000971, ENSG00000001036, ENSG00000001084, ENSG00000001167, ENSG00000001460, ENSG00000001461, ENSG00000001497, ENSG00000001561, ENSG00000001617, ENSG00000001626, ENSG00000001629, ENSG00000001630, ENSG00000001631, ENSG00000002016, ...
❗    no validated features, skip creating feature set
hello

1 term (25.00%) is not validated for name: cell_type_id
hello

hello

hello

file_subset.save()

Add labels to features, all of them validate:

cell_types = lb.CellType.from_values(adata.obs.cell_type, lb.CellType.name)
tissues = lb.Tissue.from_values(adata.obs.tissue, lb.Tissue.name)
diseases = lb.Disease.from_values(adata.obs.disease, lb.Disease.name)

file_subset.labels.add(cell_types, feature=features.cell_type)
file_subset.labels.add(tissues, feature=features.tissue)
file_subset.labels.add(diseases, feature=features.disease)
Hide code cell output
hello

did not create CellType record for 1 non-validated name: 'my new cell type'
hello

hello

hello

within hello

hello

within hello

hello

within hello

file_subset.describe()
hello

hello

hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

File(id='NiYKyAOJzuEdIsbWn1ym', key='subset/mini_anndata_with_obs.h5ad', suffix='.h5ad', accessor='AnnData', size=38992, hash='RgGUx7ndRplZZSmalTAWiw', hash_type='md5', updated_at=2023-10-04 16:39:04)

Provenance:
  🗃️ storage: Storage(id='HLOSi8IF', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase', type='local', updated_at=2023-10-04 16:38:55, created_by_id='DzTjkKse')
  🧩 transform: Transform(id='S0N0uJgMJCM0Yo', name='Subset to T-cells and liver lymphoma', version='0.1.0', type='pipeline', updated_at=2023-10-04 16:39:04, created_by_id='DzTjkKse')
  👣 run: Run(id='eZ2xy6r8PBUFGbK7z5Ss', run_at=2023-10-04 16:39:04, transform_id='S0N0uJgMJCM0Yo', created_by_id='DzTjkKse')
  👤 created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-10-04 16:38:55)
Features:
  obs: FeatureSet(id='s7B5EKDcidM4VuJhCEOP', n=3, registry='core.Feature', hash='Ao1PZItnk13m5SR7EriH', updated_at=2023-10-04 16:39:03, modality_id='vMaV6V7s', created_by_id='DzTjkKse')
    🔗 cell_type (3, bionty.CellType): 'T cell', 'hepatocyte', 'hematopoietic stem cell'
    🔗 disease (4, bionty.Disease): 'cardiac ventricle disorder', 'chronic kidney disease', 'Alzheimer disease', 'liver lymphoma'
    🔗 tissue (4, bionty.Tissue): 'heart', 'kidney', 'brain', 'liver'
Labels:
  🏷️ tissues (4, bionty.Tissue): 'heart', 'kidney', 'brain', 'liver'
  🏷️ cell_types (3, bionty.CellType): 'T cell', 'hepatocyte', 'hematopoietic stem cell'
  🏷️ diseases (4, bionty.Disease): 'cardiac ventricle disorder', 'chronic kidney disease', 'Alzheimer disease', 'liver lymphoma'

Examine data flow#

Common questions that might arise are:

  • Which h5ad file is in the subset subfolder?

  • Which notebook ingested this file?

  • By whom?

  • And which file is its parent?

Let’s answer this using LaminDB:

Query a subsetted .h5ad file containing “hematopoietic stem cell” and “T cell” to learn which h5ad file is in the subset subfolder:

cell_types_bt_lookup = lb.CellType.lookup()
hello

my_subset = ln.File.filter(
    suffix=".h5ad",
    key__startswith="subset",
    cell_types__in=[
        cell_types_bt_lookup.hematopoietic_stem_cell,
        cell_types_bt_lookup.t_cell,
    ],
).first()
my_subset.view_flow()
hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

hello

hello

hello

hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

hello

hello

hello

hello

hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

hello

hello

hello

hello

hello

hello

hello

https://d33wubrfki0l68.cloudfront.net/017f93bb2b96e444c66ad633426a096917829b26/55a8d/_images/744dfd9dec9a4be9ffff3f5cdc19f52ffbe2df01e4467fca06b81b5de00e8985.svg
Hide code cell content
!lamin delete --force analysis-usecase
!rm -r ./analysis-usecase
💡 deleting instance testuser1/analysis-usecase
✅     deleted instance settings file: /home/runner/.lamin/instance--testuser1--analysis-usecase.env
✅     instance cache deleted
✅     deleted '.lndb' sqlite file
❗     consider manually deleting your stored data: /home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase