Analysis flow#
Here, we’ll track typical data transformations like subsetting that occur during analysis.
If exploring more generally, read this first: Project flow.
Setup#
# a lamindb instance containing Bionty schema
!lamin init --storage ./analysis-usecase --schema bionty
Show code cell output
✅ saved: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-10-04 16:38:55)
✅ saved: Storage(id='HLOSi8IF', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase', type='local', updated_at=2023-10-04 16:38:55, created_by_id='DzTjkKse')
💡 loaded instance: testuser1/analysis-usecase
💡 did not register local instance on hub (if you want, call `lamin register`)
import lamindb as ln
import lnschema_bionty as lb
lb.settings.species = "human" # globally set species
lb.settings.auto_save_parents = False
💡 loaded instance: testuser1/analysis-usecase (lamindb 0.55.0)
hello
ln.track()
💡 notebook imports: lamindb==0.55.0 lnschema_bionty==0.31.2
💡 Transform(id='eNef4Arw8nNMz8', name='Analysis flow', short_name='analysis-flow', version='0', type=notebook, updated_at=2023-10-04 16:38:57, created_by_id='DzTjkKse')
💡 Run(id='7I0db5EDIUKofs2MXUNS', run_at=2023-10-04 16:38:57, transform_id='eNef4Arw8nNMz8', created_by_id='DzTjkKse')
hello
within hello
Track cell types, tissues and diseases#
We fetch an example dataset from LaminDB that has a few cell type, tissue and disease annotations:
Show code cell content
adata = ln.dev.datasets.anndata_with_obs()
adata
AnnData object with n_obs × n_vars = 40 × 100
obs: 'cell_type', 'cell_type_id', 'tissue', 'disease'
adata.var_names[:5]
Index(['ENSG00000000003', 'ENSG00000000005', 'ENSG00000000419',
'ENSG00000000457', 'ENSG00000000460'],
dtype='object')
adata.obs[["tissue", "cell_type", "disease"]].value_counts()
tissue cell_type disease
brain my new cell type Alzheimer disease 10
heart hepatocyte cardiac ventricle disorder 10
kidney T cell chronic kidney disease 10
liver hematopoietic stem cell liver lymphoma 10
dtype: int64
Register biological metadata and link to the dataset#
As a first step, we register the Anndata object with LaminDB using from_anndata()
:
file = ln.File.from_anndata(
adata, key="mini_anndata_with_obs.h5ad", field=lb.Gene.ensembl_gene_id
)
... storing 'cell_type' as categorical
... storing 'cell_type_id' as categorical
... storing 'tissue' as categorical
... storing 'disease' as categorical
hello
❗ received 99 unique terms, 1 empty/duplicated term is ignored
❗ 99 terms (100.00%) are not validated for ensembl_gene_id: ENSG00000000003, ENSG00000000005, ENSG00000000419, ENSG00000000457, ENSG00000000460, ENSG00000000938, ENSG00000000971, ENSG00000001036, ENSG00000001084, ENSG00000001167, ENSG00000001460, ENSG00000001461, ENSG00000001497, ENSG00000001561, ENSG00000001617, ENSG00000001626, ENSG00000001629, ENSG00000001630, ENSG00000001631, ENSG00000002016, ...
❗ no validated features, skip creating feature set
hello
hello
❗ 4 terms (100.00%) are not validated for name: cell_type, cell_type_id, tissue, disease
❗ no validated features, skip creating feature set
file.save()
cell_types = lb.CellType.from_values(adata.obs.cell_type, lb.CellType.name)
tissues = lb.Tissue.from_values(adata.obs.tissue, lb.Tissue.name)
diseases = lb.Disease.from_values(adata.obs.disease, lb.Disease.name)
Show code cell output
hello
❗ did not create CellType record for 1 non-validated name: 'my new cell type'
hello
hello
All of these look good and contain no typos, let’s save them to their registries:
ln.save(cell_types)
ln.save(tissues)
ln.save(diseases)
We also need some features to bucket these labels:
ln.Feature(name="cell_type", type="category").save()
ln.Feature(name="tissue", type="category").save()
ln.Feature(name="disease", type="category").save()
features = ln.Feature.lookup()
hello
hello
hello
hello
Link labels against the file:
file.labels.add(cell_types, feature=features.cell_type)
file.labels.add(tissues, feature=features.tissue)
file.labels.add(diseases, feature=features.disease)
Show code cell output
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
file.describe()
hello
hello
hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
File(id='wqCsMTtPOLAzn09jPJXa', key='mini_anndata_with_obs.h5ad', suffix='.h5ad', accessor='AnnData', size=46992, hash='IJORtcQUSS11QBqD-nTD0A', hash_type='md5', updated_at=2023-10-04 16:38:58)
Provenance:
🗃️ storage: Storage(id='HLOSi8IF', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase', type='local', updated_at=2023-10-04 16:38:55, created_by_id='DzTjkKse')
💫 transform: Transform(id='eNef4Arw8nNMz8', name='Analysis flow', short_name='analysis-flow', version='0', type=notebook, updated_at=2023-10-04 16:38:57, created_by_id='DzTjkKse')
👣 run: Run(id='7I0db5EDIUKofs2MXUNS', run_at=2023-10-04 16:38:57, transform_id='eNef4Arw8nNMz8', created_by_id='DzTjkKse')
👤 created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-10-04 16:38:55)
Features:
external: FeatureSet(id='s7B5EKDcidM4VuJhCEOP', n=3, registry='core.Feature', hash='Ao1PZItnk13m5SR7EriH', updated_at=2023-10-04 16:39:03, modality_id='vMaV6V7s', created_by_id='DzTjkKse')
🔗 cell_type (3, bionty.CellType): 'T cell', 'hepatocyte', 'hematopoietic stem cell'
🔗 disease (4, bionty.Disease): 'cardiac ventricle disorder', 'chronic kidney disease', 'Alzheimer disease', 'liver lymphoma'
🔗 tissue (4, bionty.Tissue): 'heart', 'kidney', 'brain', 'liver'
Labels:
🏷️ tissues (4, bionty.Tissue): 'heart', 'kidney', 'brain', 'liver'
🏷️ cell_types (3, bionty.CellType): 'T cell', 'hepatocyte', 'hematopoietic stem cell'
🏷️ diseases (4, bionty.Disease): 'cardiac ventricle disorder', 'chronic kidney disease', 'Alzheimer disease', 'liver lymphoma'
file.view_flow()
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
hello
hello
hello
hello
hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
hello
hello
hello
hello
Examine the currently available cell types and tissues:
lb.CellType.filter().df()
Show code cell output
name | ontology_id | abbr | synonyms | description | bionty_source_id | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|
id | ||||||||
BxNjby0x | T cell | CL:0000084 | None | T-lymphocyte|T-cell|T lymphocyte | A Type Of Lymphocyte Whose Defining Characteri... | Pdmu | 2023-10-04 16:39:03 | DzTjkKse |
J7hHC8SK | hepatocyte | CL:0000182 | None | None | The Main Structural Component Of The Liver. Th... | Pdmu | 2023-10-04 16:39:03 | DzTjkKse |
m91LZBDZ | hematopoietic stem cell | CL:0000037 | None | blood forming stem cell|hemopoietic stem cell|HSC | A Stem Cell From Which All Cells Of The Lympho... | Pdmu | 2023-10-04 16:39:03 | DzTjkKse |
lb.Tissue.filter().df()
Show code cell output
name | ontology_id | abbr | synonyms | description | bionty_source_id | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|
id | ||||||||
sm45H0wI | heart | UBERON:0000948 | None | vertebrate heart|chambered heart | A Myogenic Muscular Circulatory Organ Found In... | pOEi | 2023-10-04 16:39:03 | DzTjkKse |
j9lTWyWV | kidney | UBERON:0002113 | None | None | A Paired Organ Of The Urinary Tract Which Has ... | pOEi | 2023-10-04 16:39:03 | DzTjkKse |
7HcGzG0l | brain | UBERON:0000955 | None | None | The Brain Is The Center Of The Nervous System ... | pOEi | 2023-10-04 16:39:03 | DzTjkKse |
HHKnN309 | liver | UBERON:0002107 | None | None | An Exocrine Gland Which Secretes Bile And Func... | pOEi | 2023-10-04 16:39:03 | DzTjkKse |
Processing the dataset#
To track our data transformation we create a new Transform
of type “pipeline”:
transform = ln.Transform(
name="Subset to T-cells and liver lymphoma", version="0.1.0", type="pipeline"
)
hello
Set the current tracking to the new transform:
ln.track(transform)
💡 Transform(id='S0N0uJgMJCM0Yo', name='Subset to T-cells and liver lymphoma', version='0.1.0', type='pipeline', updated_at=2023-10-04 16:39:04, created_by_id='DzTjkKse')
💡 Run(id='eZ2xy6r8PBUFGbK7z5Ss', run_at=2023-10-04 16:39:04, transform_id='S0N0uJgMJCM0Yo', created_by_id='DzTjkKse')
hello
within hello
Get a backed AnnData object#
file = ln.File.filter(key="mini_anndata_with_obs.h5ad").one()
adata = file.backed()
adata
hello
hello
AnnDataAccessor object with n_obs × n_vars = 40 × 100
constructed for the AnnData object mini_anndata_with_obs.h5ad
obs: ['_index', 'cell_type', 'cell_type_id', 'disease', 'tissue']
var: ['_index']
adata.obs[["cell_type", "disease"]].value_counts()
cell_type disease
T cell chronic kidney disease 10
hematopoietic stem cell liver lymphoma 10
hepatocyte cardiac ventricle disorder 10
my new cell type Alzheimer disease 10
dtype: int64
Subset dataset to specific cell types and diseases#
Create the subset:
subset_obs = adata.obs.cell_type.isin(["T cell", "hematopoietic stem cell"]) & (
adata.obs.disease.isin(["liver lymphoma", "chronic kidney disease"])
)
adata_subset = adata[subset_obs]
adata_subset
AnnDataAccessorSubset object with n_obs × n_vars = 20 × 100
obs: ['_index', 'cell_type', 'cell_type_id', 'disease', 'tissue']
var: ['_index']
adata_subset.obs[["cell_type", "disease"]].value_counts()
cell_type disease
T cell chronic kidney disease 10
hematopoietic stem cell liver lymphoma 10
dtype: int64
This subset can now be registered:
file_subset = ln.File.from_anndata(
adata_subset.to_memory(),
key="subset/mini_anndata_with_obs.h5ad",
field=lb.Gene.ensembl_gene_id,
)
/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/anndata/_core/anndata.py:1840: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
utils.warn_names_duplicates("var")
hello
❗ received 99 unique terms, 1 empty/duplicated term is ignored
❗ 99 terms (100.00%) are not validated for ensembl_gene_id: ENSG00000000003, ENSG00000000005, ENSG00000000419, ENSG00000000457, ENSG00000000460, ENSG00000000938, ENSG00000000971, ENSG00000001036, ENSG00000001084, ENSG00000001167, ENSG00000001460, ENSG00000001461, ENSG00000001497, ENSG00000001561, ENSG00000001617, ENSG00000001626, ENSG00000001629, ENSG00000001630, ENSG00000001631, ENSG00000002016, ...
❗ no validated features, skip creating feature set
hello
❗ 1 term (25.00%) is not validated for name: cell_type_id
hello
hello
hello
file_subset.save()
Add labels to features, all of them validate:
cell_types = lb.CellType.from_values(adata.obs.cell_type, lb.CellType.name)
tissues = lb.Tissue.from_values(adata.obs.tissue, lb.Tissue.name)
diseases = lb.Disease.from_values(adata.obs.disease, lb.Disease.name)
file_subset.labels.add(cell_types, feature=features.cell_type)
file_subset.labels.add(tissues, feature=features.tissue)
file_subset.labels.add(diseases, feature=features.disease)
Show code cell output
hello
❗ did not create CellType record for 1 non-validated name: 'my new cell type'
hello
hello
hello
within hello
hello
within hello
hello
within hello
file_subset.describe()
hello
hello
hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
File(id='NiYKyAOJzuEdIsbWn1ym', key='subset/mini_anndata_with_obs.h5ad', suffix='.h5ad', accessor='AnnData', size=38992, hash='RgGUx7ndRplZZSmalTAWiw', hash_type='md5', updated_at=2023-10-04 16:39:04)
Provenance:
🗃️ storage: Storage(id='HLOSi8IF', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase', type='local', updated_at=2023-10-04 16:38:55, created_by_id='DzTjkKse')
🧩 transform: Transform(id='S0N0uJgMJCM0Yo', name='Subset to T-cells and liver lymphoma', version='0.1.0', type='pipeline', updated_at=2023-10-04 16:39:04, created_by_id='DzTjkKse')
👣 run: Run(id='eZ2xy6r8PBUFGbK7z5Ss', run_at=2023-10-04 16:39:04, transform_id='S0N0uJgMJCM0Yo', created_by_id='DzTjkKse')
👤 created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-10-04 16:38:55)
Features:
obs: FeatureSet(id='s7B5EKDcidM4VuJhCEOP', n=3, registry='core.Feature', hash='Ao1PZItnk13m5SR7EriH', updated_at=2023-10-04 16:39:03, modality_id='vMaV6V7s', created_by_id='DzTjkKse')
🔗 cell_type (3, bionty.CellType): 'T cell', 'hepatocyte', 'hematopoietic stem cell'
🔗 disease (4, bionty.Disease): 'cardiac ventricle disorder', 'chronic kidney disease', 'Alzheimer disease', 'liver lymphoma'
🔗 tissue (4, bionty.Tissue): 'heart', 'kidney', 'brain', 'liver'
Labels:
🏷️ tissues (4, bionty.Tissue): 'heart', 'kidney', 'brain', 'liver'
🏷️ cell_types (3, bionty.CellType): 'T cell', 'hepatocyte', 'hematopoietic stem cell'
🏷️ diseases (4, bionty.Disease): 'cardiac ventricle disorder', 'chronic kidney disease', 'Alzheimer disease', 'liver lymphoma'
Examine data flow#
Common questions that might arise are:
Which h5ad file is in the
subset
subfolder?Which notebook ingested this file?
By whom?
And which file is its parent?
Let’s answer this using LaminDB:
Query a subsetted .h5ad
file containing “hematopoietic stem cell” and “T cell” to learn which h5ad file is in the subset
subfolder:
cell_types_bt_lookup = lb.CellType.lookup()
hello
my_subset = ln.File.filter(
suffix=".h5ad",
key__startswith="subset",
cell_types__in=[
cell_types_bt_lookup.hematopoietic_stem_cell,
cell_types_bt_lookup.t_cell,
],
).first()
my_subset.view_flow()
hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
hello
hello
hello
hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
hello
hello
hello
hello
hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
hello
hello
hello
hello
hello
hello
hello
Show code cell content
!lamin delete --force analysis-usecase
!rm -r ./analysis-usecase
💡 deleting instance testuser1/analysis-usecase
✅ deleted instance settings file: /home/runner/.lamin/instance--testuser1--analysis-usecase.env
✅ instance cache deleted
✅ deleted '.lndb' sqlite file
❗ consider manually deleting your stored data: /home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase