Append a new batch of data#
Here, weβll learn
how to standardize a less well curated dataset
how to append it as a new batch of data to the growing versioned dataset
import lamindb as ln
import lnschema_bionty as lb
import pandas as pd
ln.track()
π‘ loaded instance: testuser1/test-scrna (lamindb 0.55.0)
π‘ notebook imports: lamindb==0.55.0 lnschema_bionty==0.31.2 pandas==1.5.3
π‘ Transform(id='ManDYgmftZ8Cz8', name='Append a new batch of data', short_name='scrna2', version='0', type=notebook, updated_at=2023-10-04 16:40:02, created_by_id='DzTjkKse')
π‘ Run(id='5BdVWAv5slv0iNhEPo6T', run_at=2023-10-04 16:40:02, transform_id='ManDYgmftZ8Cz8', created_by_id='DzTjkKse')
hello
within hello
Standardize a less-well curated dataset#
Access #
Letβs now consider a dataset with less-well curated features:
adata = ln.dev.datasets.anndata_pbmc68k_reduced()
adata
AnnData object with n_obs Γ n_vars = 70 Γ 765
obs: 'cell_type', 'n_genes', 'percent_mito', 'louvain'
var: 'n_counts', 'highly_variable'
uns: 'louvain', 'louvain_colors', 'neighbors', 'pca'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
obsp: 'connectivities', 'distances'
We see that this dataset is indexed by gene symbols. Because we assume that in-house, we index all datasets by Ensembl IDs, weβll need to re-curate:
adata.var.head()
n_counts | highly_variable | |
---|---|---|
index | ||
HES4 | 1153.387451 | True |
TNFRSF4 | 304.358154 | True |
SSU72 | 2530.272705 | False |
PARK7 | 7451.664062 | False |
RBP7 | 272.811035 | True |
We are still working with human data, and can globally instruct bionty
to assume human:
lb.settings.species = "human"
hello
Validate #
Curate & validate genes#
lb.Gene.validate(adata.var.index, lb.Gene.symbol);
hello
β 70 terms (9.20%) are not validated for symbol: ATPIF1, C1orf228, CCBL2, RP11-782C8.1, RP11-277L2.3, RP11-156E8.1, AC079767.4, GPX1, H1FX, SELT, ATP5I, IGJ, CCDC109B, FYB, H2AFY, FAM65B, HIST1H4C, HIST1H1E, ZNRD1, C6orf48, ...
lb.Gene.inspect(adata.var.index, lb.Gene.symbol);
hello
β 70 terms (9.20%) are not validated for symbol: ATPIF1, C1orf228, CCBL2, RP11-782C8.1, RP11-277L2.3, RP11-156E8.1, AC079767.4, GPX1, H1FX, SELT, ATP5I, IGJ, CCDC109B, FYB, H2AFY, FAM65B, HIST1H4C, HIST1H1E, ZNRD1, C6orf48, ...
detected 54 terms with synonyms: ATPIF1, C1orf228, CCBL2, AC079767.4, H1FX, SELT, ATP5I, IGJ, CCDC109B, FYB, H2AFY, FAM65B, HIST1H4C, HIST1H1E, ZNRD1, C6orf48, SEPT7, WBSCR22, RSBN1L-AS1, CCDC132, ...
β standardize terms via .standardize()
detected 5 Gene terms in Bionty for symbol: 'RN7SL1', 'SNORD3B-2', 'GPX1', 'IGLL5', 'SOD2'
β add records from Bionty to your Gene registry via .from_values()
couldn't validate 11 terms: 'TMBIM4-1', 'RP11-291B21.2', 'RP11-620J15.3', 'RP11-277L2.3', 'RP11-489E7.4', 'RP11-156E8.1', 'RP11-390E23.6', 'CTD-3138B18.5', 'RP11-782C8.1', 'AC084018.1', 'RP3-467N11.1'
β if you are sure, create new records via ln.Gene() and save to your registry
Standardize symbols and register additional symbols from Bionty:
adata.var.index = lb.Gene.standardize(adata.var.index, lb.Gene.symbol)
gene_records = lb.Gene.from_values(adata.var.index, lb.Gene.symbol)
ln.save(gene_records)
hello
hello
hello
β did not create Gene records for 11 non-validated symbols: 'AC084018.1', 'CTD-3138B18.5', 'RP11-156E8.1', 'RP11-277L2.3', 'RP11-291B21.2', 'RP11-390E23.6', 'RP11-489E7.4', 'RP11-620J15.3', 'RP11-782C8.1', 'RP3-467N11.1', 'TMBIM4-1'
We only want to register data with validated genes: data related to other features wouldnβt be useful to us, anyway.
Hence, we submet the AnnData
object to the validated genes:
validated = lb.Gene.validate(adata.var.index, lb.Gene.symbol)
adata_validated = adata[:, validated].copy()
hello
β 11 terms (1.40%) are not validated for symbol: RP11-782C8.1, RP11-277L2.3, RP11-156E8.1, RP3-467N11.1, RP11-390E23.6, RP11-489E7.4, RP11-291B21.2, RP11-620J15.3, TMBIM4-1, AC084018.1, CTD-3138B18.5
Now, we need to convert gene symbols into ensembl gene ids:
records = lb.Gene.filter(id__in=[record.id for record in gene_records])
mapper = pd.DataFrame(records.values_list("symbol", "ensembl_gene_id")).set_index(0)[1]
adata_validated.var.insert(0, "gene_symbol", adata_validated.var.index)
adata_validated.var.rename(index=mapper, inplace=True)
adata_validated.var.head()
gene_symbol | n_counts | highly_variable | |
---|---|---|---|
ENSG00000188290 | HES4 | 1153.387451 | True |
ENSG00000186827 | TNFRSF4 | 304.358154 | True |
ENSG00000160075 | SSU72 | 2530.272705 | False |
ENSG00000116288 | PARK7 | 7451.664062 | False |
ENSG00000162444 | RBP7 | 272.811035 | True |
Curate & validate cell types#
Inspection shows none of the terms are validated:
inspector = lb.CellType.inspect(adata_validated.obs.cell_type)
hello
β received 9 unique terms, 61 empty/duplicated terms are ignored
β 9 terms (100.00%) are not validated for name: Dendritic cells, CD19+ B, CD4+/CD45RO+ Memory, CD8+ Cytotoxic T, CD4+/CD25 T Reg, CD14+ Monocytes, CD56+ NK, CD8+/CD45RA+ Naive Cytotoxic, CD34+
couldn't validate 9 terms: 'CD4+/CD25 T Reg', 'CD56+ NK', 'CD19+ B', 'CD4+/CD45RO+ Memory', 'Dendritic cells', 'CD8+/CD45RA+ Naive Cytotoxic', 'CD34+', 'CD14+ Monocytes', 'CD8+ Cytotoxic T'
β if you are sure, create new records via ln.CellType() and save to your registry
Let us search the cell type names from the public ontology, and add the name value found in the AnnData
object as a synonym to the top match found in the public ontology.
bionty = lb.CellType.bionty() # access the public ontology through bionty
name_mapper = {}
for name in adata_validated.obs.cell_type.unique():
ontology_id = (
bionty.search(name).iloc[0].ontology_id
) # search the public ontology and use the ontology id of the top match
record = lb.CellType.from_bionty(
ontology_id=ontology_id
) # create a record by loading the top match from bionty
name_mapper[name] = record.name # map the original name to standardized name
record.save() # save the record
record.add_synonym(
name
) # add the original name as a synonym, so that next time, we can just run .standardize()
hello
hello
β now recursing through parents: this only happens once, but is much slower than bulk saving
hello
hello
hello
hello
hello
hello
hello
hello
hello
β now recursing through parents: this only happens once, but is much slower than bulk saving
hello
hello
hello
hello
hello
hello
β now recursing through parents: this only happens once, but is much slower than bulk saving
hello
hello
hello
hello
hello
β now recursing through parents: this only happens once, but is much slower than bulk saving
hello
hello
hello
hello
β now recursing through parents: this only happens once, but is much slower than bulk saving
hello
hello
We can now standardize cell type names using the search-based mapper:
adata_validated.obs.cell_type = adata_validated.obs.cell_type.map(name_mapper)
Now, all cell types are validated:
validated = lb.CellType.validate(adata_validated.obs.cell_type)
assert all(validated)
hello
We donβt want to store any of the other metadata columns:
for column in ["n_genes", "percent_mito", "louvain"]:
adata.obs.drop(column, axis=1)
Register #
modalities = ln.Modality.lookup()
experimental_factors = lb.ExperimentalFactor.lookup()
species = lb.Species.lookup()
features = ln.Feature.lookup()
hello
hello
hello
hello
file = ln.File.from_anndata(
adata_validated,
description="10x reference adata",
field=lb.Gene.ensembl_gene_id,
modality=modalities.rna,
)
hello
hello
hello
hello
β 3 terms (75.00%) are not validated for name: n_genes, percent_mito, louvain
hello
As we do not want to manage the remaining unvalidated terms in registries, we can save the file.
file.save()
file.labels.add(adata_validated.obs.cell_type, features.cell_type)
file.labels.add(species.human, feature=features.species)
file.labels.add(experimental_factors.single_cell_rna_sequencing, feature=features.assay)
hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
file.describe()
hello
hello
hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
File(id='UKDwd6U73Y9ZbFkPS4IC', suffix='.h5ad', accessor='AnnData', description='10x reference adata', size=660792, hash='GU-hbSJqGkENOxVKFLmvbA', hash_type='md5', updated_at=2023-10-04 16:40:30)
Provenance:
ποΈ storage: Storage(id='62dM2FBg', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna', type='local', updated_at=2023-10-04 16:38:48, created_by_id='DzTjkKse')
π« transform: Transform(id='ManDYgmftZ8Cz8', name='Append a new batch of data', short_name='scrna2', version='0', type=notebook, updated_at=2023-10-04 16:40:02, created_by_id='DzTjkKse')
π£ run: Run(id='5BdVWAv5slv0iNhEPo6T', run_at=2023-10-04 16:40:02, transform_id='ManDYgmftZ8Cz8', created_by_id='DzTjkKse')
π€ created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-10-04 16:38:48)
Features:
var: FeatureSet(id='MdeWQREeWEr692zXBk3m', n=754, type='number', registry='bionty.Gene', hash='WMDxN7253SdzGwmznV5d', updated_at=2023-10-04 16:40:30, modality_id='8B60C3n3', created_by_id='DzTjkKse')
'CCT6A', 'PLD4', 'ADD3', 'UBALD2', 'BCAS4', 'MPP1', 'SF3B2', 'PLP2', 'ADISSP', 'LY86', 'CDK6', 'ADSL', 'ARHGAP45', 'NKG7', 'UPP1', 'DYNLRB1', 'YWHAB', 'TINF2', 'CFD', 'GATA2', ...
obs: FeatureSet(id='NgdCrmgeLfsl8aREwy1x', n=1, registry='core.Feature', hash='unk4vdl7-1zZScbA6xft', updated_at=2023-10-04 16:40:30, modality_id='FnQ7xHJL', created_by_id='DzTjkKse')
π cell_type (9, bionty.CellType): 'dendritic cell', 'monocyte', 'CD16-positive, CD56-dim natural killer cell, human', 'mature T cell', 'central memory CD8-positive, alpha-beta T cell', 'Cd4-negative, CD8_alpha-negative, CD11b-positive dendritic cell', 'CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'CD8-positive, alpha-beta memory T cell', 'B cell, CD19-positive'
external: FeatureSet(id='luodNh86nnjnElRIaj6t', n=2, registry='core.Feature', hash='Xlk8U1DLUjEsqhGGPltC', updated_at=2023-10-04 16:40:30, modality_id='FnQ7xHJL', created_by_id='DzTjkKse')
π assay (1, bionty.ExperimentalFactor): 'single-cell RNA sequencing'
π species (1, bionty.Species): 'human'
Labels:
π·οΈ species (1, bionty.Species): 'human'
π·οΈ cell_types (9, bionty.CellType): 'dendritic cell', 'monocyte', 'CD16-positive, CD56-dim natural killer cell, human', 'mature T cell', 'central memory CD8-positive, alpha-beta T cell', 'Cd4-negative, CD8_alpha-negative, CD11b-positive dendritic cell', 'CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'CD8-positive, alpha-beta memory T cell', 'B cell, CD19-positive'
π·οΈ experimental_factors (1, bionty.ExperimentalFactor): 'single-cell RNA sequencing'
file.view_flow()
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
hello
hello
hello
hello
hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
hello
hello
hello
hello
Append a file to the growing dataset#
import lamindb as ln
Query the previous dataset and the file we just created:
dataset_v1 = ln.Dataset.filter(name="My versioned scRNA-seq dataset", version="1").one()
new_file = ln.File.filter().order_by("-created_at").first()
Create a new version of the dataset by sharding it across the new file and the file in the previous version of the dataset:
dataset_v2 = ln.Dataset(
[new_file, dataset_v1.file],
is_new_version_of=dataset_v1,
)
dataset_v2.save()
# annotate the dataset
dataset_v2.labels.add_from(new_file)
dataset_v2.labels.add_from(dataset_v1)
hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
hello
hello
hello
hello
hello
hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
Version 2 of the dataset covers significantly more conditions.
dataset_v2.describe()
hello
hello
hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
hello
within hello
Dataset(id='D1s4vH9rGh45eJ13uQVj', name='My versioned scRNA-seq dataset', version='2', hash='us9ZADbZkmjSnRqVFu2W', updated_at=2023-10-04 16:40:38)
Provenance:
π« transform: Transform(id='ManDYgmftZ8Cz8', name='Append a new batch of data', short_name='scrna2', version='0', type=notebook, updated_at=2023-10-04 16:40:02, created_by_id='DzTjkKse')
π£ run: Run(id='5BdVWAv5slv0iNhEPo6T', run_at=2023-10-04 16:40:02, transform_id='ManDYgmftZ8Cz8', created_by_id='DzTjkKse')
π initial_version: Dataset(id='D1s4vH9rGh45eJ13uQiV', name='My versioned scRNA-seq dataset', version='1', hash='WEFcMZxJNmMiUOFrcSTaig', updated_at=2023-10-04 16:39:54, transform_id='Nv48yAceNSh8z8', run_id='zyBkh49M3qcVFVbvsvZ5', file_id='D1s4vH9rGh45eJ13uQiV', created_by_id='DzTjkKse')
π€ created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-10-04 16:38:48)
Features:
var: FeatureSet(id='XCLaYSObdUzJQGfop2pc', n=37257, type='number', registry='bionty.Gene', hash='bWfNZ3hy-yGnPj65T3kc', updated_at=2023-10-04 16:40:32, created_by_id='DzTjkKse')
'FAM50B', 'None', 'None', 'BLOC1S3', 'COA5', 'CCDC85B', 'LINC01214', 'TRDV1', 'IGHV3-23', 'ANKRD30A', 'LINC01906', 'TLNRD1', 'None', 'None', 'IGHD5-5', 'ERN1', 'PRKAR1A', 'AMBP', 'GRM7-AS2', 'DIAPH3-AS2', ...
obs: FeatureSet(id='nOcOg6PfZo1yonP1aKOL', n=4, registry='core.Feature', hash='4xEiqlhlgIHH9Nls3xEk', updated_at=2023-10-04 16:39:47, modality_id='FnQ7xHJL', created_by_id='DzTjkKse')
π donor (12, core.ULabel): 'A52', 'A29', '621B', 'A31', '582C', 'A37', 'A36', '637C', 'D496', 'A35', ...
π cell_type (39, bionty.CellType): 'dendritic cell', 'monocyte', 'CD16-positive, CD56-dim natural killer cell, human', 'mature T cell', 'central memory CD8-positive, alpha-beta T cell', 'Cd4-negative, CD8_alpha-negative, CD11b-positive dendritic cell', 'CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'CD8-positive, alpha-beta memory T cell', 'B cell, CD19-positive', 'plasma cell', ...
π assay (4, bionty.ExperimentalFactor): 'single-cell RNA sequencing', '10x 5' v1', '10x 5' v2', '10x 3' v3'
π tissue (17, bionty.Tissue): 'thymus', 'ileum', 'thoracic lymph node', 'mesenteric lymph node', 'omentum', 'sigmoid colon', 'bone marrow', 'blood', 'caecum', 'liver', ...
external: FeatureSet(id='luodNh86nnjnElRIaj6t', n=2, registry='core.Feature', hash='Xlk8U1DLUjEsqhGGPltC', updated_at=2023-10-04 16:40:30, modality_id='FnQ7xHJL', created_by_id='DzTjkKse')
π assay (4, bionty.ExperimentalFactor): 'single-cell RNA sequencing', '10x 5' v1', '10x 5' v2', '10x 3' v3'
π species (1, bionty.Species): 'human'
Labels:
π·οΈ species (1, bionty.Species): 'human'
π·οΈ tissues (17, bionty.Tissue): 'thymus', 'ileum', 'thoracic lymph node', 'mesenteric lymph node', 'omentum', 'sigmoid colon', 'bone marrow', 'blood', 'caecum', 'liver', ...
π·οΈ cell_types (39, bionty.CellType): 'dendritic cell', 'monocyte', 'CD16-positive, CD56-dim natural killer cell, human', 'mature T cell', 'central memory CD8-positive, alpha-beta T cell', 'Cd4-negative, CD8_alpha-negative, CD11b-positive dendritic cell', 'CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'CD8-positive, alpha-beta memory T cell', 'B cell, CD19-positive', 'plasma cell', ...
π·οΈ experimental_factors (4, bionty.ExperimentalFactor): 'single-cell RNA sequencing', '10x 5' v1', '10x 5' v2', '10x 3' v3'
π·οΈ ulabels (12, core.ULabel): 'A52', 'A29', '621B', 'A31', '582C', 'A37', 'A36', '637C', 'D496', 'A35', ...
View the flow:
dataset_v2.view_flow()
hello
within hello
hello
within hello
hello
hello
hello
hello
within hello
hello
within hello
hello
hello
hello
hello
within hello
hello
within hello
hello
hello
hello
hello