Analysis flow#

Here, we’ll track typical data transformations like subsetting that occur during analysis.

If exploring more generally, read this first: Project flow.

Setup#

# a lamindb instance containing Bionty schema
!lamin init --storage ./analysis-usecase --schema bionty

import lamindb as ln
import lnschema_bionty as lb

lb.settings.species = "human"  # globally set species
lb.settings.auto_save_parents = False

💡 loaded instance: testuser1/analysis-usecase (lamindb 0.55.0)

hello

ln.track()

💡 notebook imports: lamindb==0.55.0 lnschema_bionty==0.31.2

💡 Transform(id='eNef4Arw8nNMz8', name='Analysis flow', short_name='analysis-flow', version='0', type=notebook, updated_at=2023-10-04 16:38:57, created_by_id='DzTjkKse')

💡 Run(id='7I0db5EDIUKofs2MXUNS', run_at=2023-10-04 16:38:57, transform_id='eNef4Arw8nNMz8', created_by_id='DzTjkKse')

hello

within hello

Track cell types, tissues and diseases#

We fetch an example dataset from LaminDB that has a few cell type, tissue and disease annotations:

adata

AnnData object with n_obs × n_vars = 40 × 100
    obs: 'cell_type', 'cell_type_id', 'tissue', 'disease'

adata.var_names[:5]

Index(['ENSG00000000003', 'ENSG00000000005', 'ENSG00000000419',
       'ENSG00000000457', 'ENSG00000000460'],
      dtype='object')

adata.obs[["tissue", "cell_type", "disease"]].value_counts()

tissue  cell_type                disease                   
brain   my new cell type         Alzheimer disease             10
heart   hepatocyte               cardiac ventricle disorder    10
kidney  T cell                   chronic kidney disease        10
liver   hematopoietic stem cell  liver lymphoma                10
dtype: int64

Register biological metadata and link to the dataset#

As a first step, we register the Anndata object with LaminDB using from_anndata():

file = ln.File.from_anndata(
    adata, key="mini_anndata_with_obs.h5ad", field=lb.Gene.ensembl_gene_id
)

... storing 'cell_type' as categorical

... storing 'cell_type_id' as categorical

... storing 'tissue' as categorical

... storing 'disease' as categorical

hello

❗    received 99 unique terms, 1 empty/duplicated term is ignored

❗    99 terms (100.00%) are not validated for ensembl_gene_id: ENSG00000000003, ENSG00000000005, ENSG00000000419, ENSG00000000457, ENSG00000000460, ENSG00000000938, ENSG00000000971, ENSG00000001036, ENSG00000001084, ENSG00000001167, ENSG00000001460, ENSG00000001461, ENSG00000001497, ENSG00000001561, ENSG00000001617, ENSG00000001626, ENSG00000001629, ENSG00000001630, ENSG00000001631, ENSG00000002016, ...

❗    no validated features, skip creating feature set

hello

hello

❗    4 terms (100.00%) are not validated for name: cell_type, cell_type_id, tissue, disease

❗    no validated features, skip creating feature set

file.save()

cell_types = lb.CellType.from_values(adata.obs.cell_type, lb.CellType.name)
tissues = lb.Tissue.from_values(adata.obs.tissue, lb.Tissue.name)
diseases = lb.Disease.from_values(adata.obs.disease, lb.Disease.name)

All of these look good and contain no typos, let’s save them to their registries:

ln.save(cell_types)
ln.save(tissues)
ln.save(diseases)

We also need some features to bucket these labels:

ln.Feature(name="cell_type", type="category").save()
ln.Feature(name="tissue", type="category").save()
ln.Feature(name="disease", type="category").save()
features = ln.Feature.lookup()

hello

hello

hello

hello

Link labels against the file:

file.labels.add(cell_types, feature=features.cell_type)
file.labels.add(tissues, feature=features.tissue)
file.labels.add(diseases, feature=features.disease)

file.describe()

hello

hello

hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

File(id='wqCsMTtPOLAzn09jPJXa', key='mini_anndata_with_obs.h5ad', suffix='.h5ad', accessor='AnnData', size=46992, hash='IJORtcQUSS11QBqD-nTD0A', hash_type='md5', updated_at=2023-10-04 16:38:58)

Provenance:
  🗃️ storage: Storage(id='HLOSi8IF', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase', type='local', updated_at=2023-10-04 16:38:55, created_by_id='DzTjkKse')
  💫 transform: Transform(id='eNef4Arw8nNMz8', name='Analysis flow', short_name='analysis-flow', version='0', type=notebook, updated_at=2023-10-04 16:38:57, created_by_id='DzTjkKse')
  👣 run: Run(id='7I0db5EDIUKofs2MXUNS', run_at=2023-10-04 16:38:57, transform_id='eNef4Arw8nNMz8', created_by_id='DzTjkKse')
  👤 created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-10-04 16:38:55)
Features:
  external: FeatureSet(id='s7B5EKDcidM4VuJhCEOP', n=3, registry='core.Feature', hash='Ao1PZItnk13m5SR7EriH', updated_at=2023-10-04 16:39:03, modality_id='vMaV6V7s', created_by_id='DzTjkKse')
    🔗 cell_type (3, bionty.CellType): 'T cell', 'hepatocyte', 'hematopoietic stem cell'
    🔗 disease (4, bionty.Disease): 'cardiac ventricle disorder', 'chronic kidney disease', 'Alzheimer disease', 'liver lymphoma'
    🔗 tissue (4, bionty.Tissue): 'heart', 'kidney', 'brain', 'liver'
Labels:
  🏷️ tissues (4, bionty.Tissue): 'heart', 'kidney', 'brain', 'liver'
  🏷️ cell_types (3, bionty.CellType): 'T cell', 'hepatocyte', 'hematopoietic stem cell'
  🏷️ diseases (4, bionty.Disease): 'cardiac ventricle disorder', 'chronic kidney disease', 'Alzheimer disease', 'liver lymphoma'

file.view_flow()

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

hello

hello

hello

hello

hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

hello

hello

hello

hello

https://d33wubrfki0l68.cloudfront.net/f8f7a51b6ba98f17caa68a95b5e5fbaf165f13a1/6584d/_images/a96ca697249673032ab72b79cbf81f6a3b41aac908d8750833ab4ae7ab45353e.svg

Examine the currently available cell types and tissues:

lb.CellType.filter().df()

Show code cell output Hide code cell output

	name	ontology_id	abbr	synonyms	description	bionty_source_id	updated_at	created_by_id
id
BxNjby0x	T cell	CL:0000084	None	T-lymphocyte\|T-cell\|T lymphocyte	A Type Of Lymphocyte Whose Defining Characteri...	Pdmu	2023-10-04 16:39:03	DzTjkKse
J7hHC8SK	hepatocyte	CL:0000182	None	None	The Main Structural Component Of The Liver. Th...	Pdmu	2023-10-04 16:39:03	DzTjkKse
m91LZBDZ	hematopoietic stem cell	CL:0000037	None	blood forming stem cell\|hemopoietic stem cell\|HSC	A Stem Cell From Which All Cells Of The Lympho...	Pdmu	2023-10-04 16:39:03	DzTjkKse

lb.Tissue.filter().df()

Show code cell output Hide code cell output

	name	ontology_id	abbr	synonyms	description	bionty_source_id	updated_at	created_by_id
id
sm45H0wI	heart	UBERON:0000948	None	vertebrate heart\|chambered heart	A Myogenic Muscular Circulatory Organ Found In...	pOEi	2023-10-04 16:39:03	DzTjkKse
j9lTWyWV	kidney	UBERON:0002113	None	None	A Paired Organ Of The Urinary Tract Which Has ...	pOEi	2023-10-04 16:39:03	DzTjkKse
7HcGzG0l	brain	UBERON:0000955	None	None	The Brain Is The Center Of The Nervous System ...	pOEi	2023-10-04 16:39:03	DzTjkKse
HHKnN309	liver	UBERON:0002107	None	None	An Exocrine Gland Which Secretes Bile And Func...	pOEi	2023-10-04 16:39:03	DzTjkKse

Processing the dataset#

To track our data transformation we create a new Transform of type “pipeline”:

transform = ln.Transform(
    name="Subset to T-cells and liver lymphoma", version="0.1.0", type="pipeline"
)

hello

Set the current tracking to the new transform:

ln.track(transform)

💡 Transform(id='S0N0uJgMJCM0Yo', name='Subset to T-cells and liver lymphoma', version='0.1.0', type='pipeline', updated_at=2023-10-04 16:39:04, created_by_id='DzTjkKse')

💡 Run(id='eZ2xy6r8PBUFGbK7z5Ss', run_at=2023-10-04 16:39:04, transform_id='S0N0uJgMJCM0Yo', created_by_id='DzTjkKse')

hello

within hello

Get a backed AnnData object#

file = ln.File.filter(key="mini_anndata_with_obs.h5ad").one()

adata = file.backed()
adata

hello

hello

AnnDataAccessor object with n_obs × n_vars = 40 × 100
  constructed for the AnnData object mini_anndata_with_obs.h5ad
    obs: ['_index', 'cell_type', 'cell_type_id', 'disease', 'tissue']
    var: ['_index']

adata.obs[["cell_type", "disease"]].value_counts()

cell_type                disease                   
T cell                   chronic kidney disease        10
hematopoietic stem cell  liver lymphoma                10
hepatocyte               cardiac ventricle disorder    10
my new cell type         Alzheimer disease             10
dtype: int64

Subset dataset to specific cell types and diseases#

Create the subset:

subset_obs = adata.obs.cell_type.isin(["T cell", "hematopoietic stem cell"]) & (
    adata.obs.disease.isin(["liver lymphoma", "chronic kidney disease"])
)

adata_subset = adata[subset_obs]
adata_subset

AnnDataAccessorSubset object with n_obs × n_vars = 20 × 100
  obs: ['_index', 'cell_type', 'cell_type_id', 'disease', 'tissue']
  var: ['_index']

adata_subset.obs[["cell_type", "disease"]].value_counts()

cell_type                disease               
T cell                   chronic kidney disease    10
hematopoietic stem cell  liver lymphoma            10
dtype: int64

This subset can now be registered:

file_subset = ln.File.from_anndata(
    adata_subset.to_memory(),
    key="subset/mini_anndata_with_obs.h5ad",
    field=lb.Gene.ensembl_gene_id,
)

/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/anndata/_core/anndata.py:1840: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")

hello

❗    received 99 unique terms, 1 empty/duplicated term is ignored

❗    99 terms (100.00%) are not validated for ensembl_gene_id: ENSG00000000003, ENSG00000000005, ENSG00000000419, ENSG00000000457, ENSG00000000460, ENSG00000000938, ENSG00000000971, ENSG00000001036, ENSG00000001084, ENSG00000001167, ENSG00000001460, ENSG00000001461, ENSG00000001497, ENSG00000001561, ENSG00000001617, ENSG00000001626, ENSG00000001629, ENSG00000001630, ENSG00000001631, ENSG00000002016, ...

❗    no validated features, skip creating feature set

hello

❗    1 term (25.00%) is not validated for name: cell_type_id

hello

hello

hello

file_subset.save()

Add labels to features, all of them validate:

cell_types = lb.CellType.from_values(adata.obs.cell_type, lb.CellType.name)
tissues = lb.Tissue.from_values(adata.obs.tissue, lb.Tissue.name)
diseases = lb.Disease.from_values(adata.obs.disease, lb.Disease.name)

file_subset.labels.add(cell_types, feature=features.cell_type)
file_subset.labels.add(tissues, feature=features.tissue)
file_subset.labels.add(diseases, feature=features.disease)

file_subset.describe()

hello

hello

hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

File(id='NiYKyAOJzuEdIsbWn1ym', key='subset/mini_anndata_with_obs.h5ad', suffix='.h5ad', accessor='AnnData', size=38992, hash='RgGUx7ndRplZZSmalTAWiw', hash_type='md5', updated_at=2023-10-04 16:39:04)

Provenance:
  🗃️ storage: Storage(id='HLOSi8IF', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase', type='local', updated_at=2023-10-04 16:38:55, created_by_id='DzTjkKse')
  🧩 transform: Transform(id='S0N0uJgMJCM0Yo', name='Subset to T-cells and liver lymphoma', version='0.1.0', type='pipeline', updated_at=2023-10-04 16:39:04, created_by_id='DzTjkKse')
  👣 run: Run(id='eZ2xy6r8PBUFGbK7z5Ss', run_at=2023-10-04 16:39:04, transform_id='S0N0uJgMJCM0Yo', created_by_id='DzTjkKse')
  👤 created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-10-04 16:38:55)
Features:
  obs: FeatureSet(id='s7B5EKDcidM4VuJhCEOP', n=3, registry='core.Feature', hash='Ao1PZItnk13m5SR7EriH', updated_at=2023-10-04 16:39:03, modality_id='vMaV6V7s', created_by_id='DzTjkKse')
    🔗 cell_type (3, bionty.CellType): 'T cell', 'hepatocyte', 'hematopoietic stem cell'
    🔗 disease (4, bionty.Disease): 'cardiac ventricle disorder', 'chronic kidney disease', 'Alzheimer disease', 'liver lymphoma'
    🔗 tissue (4, bionty.Tissue): 'heart', 'kidney', 'brain', 'liver'
Labels:
  🏷️ tissues (4, bionty.Tissue): 'heart', 'kidney', 'brain', 'liver'
  🏷️ cell_types (3, bionty.CellType): 'T cell', 'hepatocyte', 'hematopoietic stem cell'
  🏷️ diseases (4, bionty.Disease): 'cardiac ventricle disorder', 'chronic kidney disease', 'Alzheimer disease', 'liver lymphoma'

Examine data flow#

Common questions that might arise are:

Which h5ad file is in the subset subfolder?
Which notebook ingested this file?
By whom?
And which file is its parent?

Let’s answer this using LaminDB:

Query a subsetted .h5ad file containing “hematopoietic stem cell” and “T cell” to learn which h5ad file is in the subset subfolder:

cell_types_bt_lookup = lb.CellType.lookup()

hello

my_subset = ln.File.filter(
    suffix=".h5ad",
    key__startswith="subset",
    cell_types__in=[
        cell_types_bt_lookup.hematopoietic_stem_cell,
        cell_types_bt_lookup.t_cell,
    ],
).first()

my_subset.view_flow()

hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

hello

hello

hello

hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

hello

hello

hello

hello

hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

hello

hello

hello

hello

hello

hello

hello

https://d33wubrfki0l68.cloudfront.net/017f93bb2b96e444c66ad633426a096917829b26/55a8d/_images/744dfd9dec9a4be9ffff3f5cdc19f52ffbe2df01e4467fca06b81b5de00e8985.svg