Iteratively train an ML model on a dataset#

In the previous tutorial, we loaded an entire dataset into memory to perform a simple analysis.

Here, we’ll iterate over the files within the dataset, to train an ML model.

import lamindb as ln
import anndata as ad
import numpy as np

💡 loaded instance: testuser1/test-scrna (lamindb 0.55.0)

ln.track()

💡 notebook imports: anndata==0.9.2 lamindb==0.55.0 numpy==1.25.2 scgen==2.1.1

💡 Transform(id='Qr1kIHvK506rz8', name='Iteratively train an ML model on a dataset', short_name='scrna5', version='0', type=notebook, updated_at=2023-10-04 16:41:12, created_by_id='DzTjkKse')

💡 Run(id='ru99Hlg0EzWVqI4SJjkG', run_at=2023-10-04 16:41:12, transform_id='Qr1kIHvK506rz8', created_by_id='DzTjkKse')

hello

within hello

Setup#

dataset_v2 = ln.Dataset.filter(name="My versioned scRNA-seq dataset", version="2").one()

dataset_v2

Dataset(id='D1s4vH9rGh45eJ13uQVj', name='My versioned scRNA-seq dataset', version='2', hash='us9ZADbZkmjSnRqVFu2W', updated_at=2023-10-04 16:40:38, transform_id='ManDYgmftZ8Cz8', run_id='5BdVWAv5slv0iNhEPo6T', initial_version_id='D1s4vH9rGh45eJ13uQiV', created_by_id='DzTjkKse')

We import scGen, which is built on scvi-tools.

import scgen

Similar to what we did in the previous tutorial, we could load the entire dataset into memory and train a model in 4 lines of code.

Let us instead load all file records:

file1, file2 = dataset_v2.files.list()

hello

within hello

hello

hello

hello

within hello

We’d like some context on what the first file contains and where it’s from:

file1.describe()
file1.view_flow()

Show code cell output Hide code cell output

hello

hello

hello

hello

hello

within hello

hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

File(id='D1s4vH9rGh45eJ13uQiV', suffix='.h5ad', accessor='AnnData', description='Conde22', size=28049505, hash='WEFcMZxJNmMiUOFrcSTaig', hash_type='md5', updated_at=2023-10-04 16:39:54)

Provenance:
  🗃️ storage: Storage(id='62dM2FBg', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna', type='local', updated_at=2023-10-04 16:38:48, created_by_id='DzTjkKse')
  📔 transform: Transform(id='Nv48yAceNSh8z8', name='scRNA-seq', short_name='scrna', version='0', type='notebook', updated_at=2023-10-04 16:38:54, created_by_id='DzTjkKse')
  👣 run: Run(id='zyBkh49M3qcVFVbvsvZ5', run_at=2023-10-04 16:38:54, transform_id='Nv48yAceNSh8z8', created_by_id='DzTjkKse')
  👤 created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-10-04 16:38:48)
  ⬇️ input_of (core.Run): ['2023-10-04 16:40:02', '2023-10-04 16:40:47']
Features:
  var: FeatureSet(id='epYAIDvjkMEktCtQmZlq', n=36503, type='number', registry='bionty.Gene', hash='dnRexHCtxtmOU81_EpoJ', updated_at=2023-10-04 16:39:42, modality_id='8B60C3n3', created_by_id='DzTjkKse')
    'UBE2V1', 'ZNF407', 'APC2', 'None', 'None', 'LRP2BP-AS1', 'None', 'None', 'None', 'HSPA1B', 'None', 'None', 'PRKG2-AS1', 'SAR1A', 'PIPOX', 'None', 'RPS6KA4', 'None', 'MMRN1', 'ADIRF-AS1', ...
  obs: FeatureSet(id='nOcOg6PfZo1yonP1aKOL', n=4, registry='core.Feature', hash='4xEiqlhlgIHH9Nls3xEk', updated_at=2023-10-04 16:39:47, modality_id='FnQ7xHJL', created_by_id='DzTjkKse')
    🔗 donor (12, core.ULabel): 'A52', 'A29', '621B', 'A31', '582C', 'A37', 'A36', '637C', 'D496', 'A35', ...
    🔗 cell_type (32, bionty.CellType): 'CD4-positive helper T cell', 'lymphocyte', 'mucosal invariant T cell', 'naive thymus-derived CD4-positive, alpha-beta T cell', 'germinal center B cell', 'progenitor cell', 'naive thymus-derived CD8-positive, alpha-beta T cell', 'alveolar macrophage', 'CD16-negative, CD56-bright natural killer cell, human', 'macrophage', ...
    🔗 assay (4, bionty.ExperimentalFactor): 'single-cell RNA sequencing', '10x 5' v1', '10x 3' v3', '10x 5' v2'
    🔗 tissue (17, bionty.Tissue): 'mesenteric lymph node', 'skeletal muscle tissue', 'sigmoid colon', 'duodenum', 'lamina propria', 'lung', 'omentum', 'bone marrow', 'liver', 'spleen', ...
Labels:
  🏷️ species (1, bionty.Species): 'human'
  🏷️ tissues (17, bionty.Tissue): 'mesenteric lymph node', 'skeletal muscle tissue', 'sigmoid colon', 'duodenum', 'lamina propria', 'lung', 'omentum', 'bone marrow', 'liver', 'spleen', ...
  🏷️ cell_types (32, bionty.CellType): 'CD4-positive helper T cell', 'lymphocyte', 'mucosal invariant T cell', 'naive thymus-derived CD4-positive, alpha-beta T cell', 'germinal center B cell', 'progenitor cell', 'naive thymus-derived CD8-positive, alpha-beta T cell', 'alveolar macrophage', 'CD16-negative, CD56-bright natural killer cell, human', 'macrophage', ...
  🏷️ experimental_factors (4, bionty.ExperimentalFactor): 'single-cell RNA sequencing', '10x 5' v1', '10x 3' v3', '10x 5' v2'
  🏷️ ulabels (12, core.ULabel): 'A52', 'A29', '621B', 'A31', '582C', 'A37', 'A36', '637C', 'D496', 'A35', ...

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

hello

hello

hello

hello

hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

hello

hello

hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

hello

hello

hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

hello

hello

hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

hello

hello

hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

hello

hello

hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

https://d33wubrfki0l68.cloudfront.net/dcca24a88b66aeb1ed290ed0f2263aeb45ee92c7/9642b/_images/08492809cdd73a6c2e1d1335a7a7e4e4ea4c10d547c58ae3a4ae56c7f6b1c18e.svg

We’ll need to make a decision on the features that we want to use for training the model.

Because each file is validated, they’re all indexed by ensembl_gene_id in the var slot of AnnData.

shared_genes = file1.features["var"] & file2.features["var"]
shared_genes_ensembl = shared_genes.list("ensembl_gene_id")

hello

within hello

hello

within hello

Train the model#

Let us load the first file into memory:

data_train1 = file1.load()[:, shared_genes_ensembl].copy()
data_train1

AnnData object with n_obs × n_vars = 1648 × 749
    obs: 'donor', 'tissue', 'cell_type', 'assay'
    var: 'feature_is_filtered', 'feature_reference', 'feature_biotype'
    uns: 'cell_type_ontology_term_id_colors', 'default_embedding', 'schema_version', 'title'
    obsm: 'X_umap'

Train the model on this first file:

scgen.SCGEN.setup_anndata(data_train1)
vae = scgen.SCGEN(data_train1)
vae.train(max_epochs=1)  # we use max_epochs=1 to run it on CI
vae.save("saved_models/scgen1")

Load the second file and resume training the model:

data_train2 = file2.load()[:, shared_genes_ensembl].copy()
vae = scgen.SCGEN.load("saved_models/scgen1", data_train2)
vae.train(max_epochs=1)
vae.save("saved_models/scgen1", overwrite=True)

Save the model#

weights = ln.File("saved_models/scgen1/model.pt", description="My trained model")
weights.save()

Save latent representation as a new dataset#

latent1 = vae.get_latent_representation(data_train1)
latent2 = vae.get_latent_representation(data_train2)

adata_latent1 = ad.AnnData(X=latent1, obs=data_train1.obs)
adata_latent2 = ad.AnnData(X=latent2, obs=data_train2.obs)

INFO

 Input AnnData not setup with scvi-tools. attempting to transfer AnnData setup

Because the latent representation is low-dimensional, we can typically fit very high number of observations into memory.

Hence, let’s store it as a concatenated adata.

adata_latent = ad.concat([adata_latent1, adata_latent2])

dataset_v2_latent = ln.Dataset(
    adata_latent,
    name="Latent representation of scRNA-seq dataset v2",
    description="For the original data, see dataset T5x0SkRJNviE0jYGbJKt",
)
dataset_v2_latent.save()

hello

Let us look at the data flow:

dataset_v2_latent.view_flow()

hello

within hello

hello

within hello

hello

hello

hello

hello

within hello

hello

within hello

hello

hello

hello

hello

within hello

hello

within hello

hello

hello

hello

hello

within hello

hello

within hello

hello

hello

hello

hello

hello

hello

hello

https://d33wubrfki0l68.cloudfront.net/090fd2fdd83bcd7a8e49243655caf546d47bf549/87878/_images/483c5f0b488bab1bcbb2a99106af84ab43716d7cb9b7ea572166c67da6d00564.svg

Compare this with the model:

weights.view_flow()

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

hello

hello

hello

hello

hello

hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

hello

hello

hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

hello

hello

hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

hello

hello

hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

hello

hello

hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

hello

hello

hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

hello

hello

hello

hello

hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

hello

hello

hello

hello

hello

hello

hello

https://d33wubrfki0l68.cloudfront.net/526e4cc066f2e51d822295148b4ca4f5aac7f863/014b0/_images/ea3736661345ff6976e1a4fbac24227a0aeb3d4308b8e8e3febc993381c43a24.svg

Annotate with labels:

dataset_v2_latent.labels.add_from(dataset_v2)

dataset_v2_latent.describe()

hello

hello

within hello

hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

Dataset(id='ZKuYirH7CgP00nbWooY2', name='Latent representation of scRNA-seq dataset v2', description='For the original data, see dataset T5x0SkRJNviE0jYGbJKt', hash='m3gtjBb1KkjvaD7zL8ayTg', updated_at=2023-10-04 16:41:20)

Provenance:
  💫 transform: Transform(id='Qr1kIHvK506rz8', name='Iteratively train an ML model on a dataset', short_name='scrna5', version='0', type=notebook, updated_at=2023-10-04 16:41:12, created_by_id='DzTjkKse')
  👣 run: Run(id='ru99Hlg0EzWVqI4SJjkG', run_at=2023-10-04 16:41:12, transform_id='Qr1kIHvK506rz8', created_by_id='DzTjkKse')
  📄 file: File(id='ZKuYirH7CgP00nbWooY2', suffix='.h5ad', accessor='AnnData', description='See dataset ZKuYirH7CgP00nbWooY2', size=838706, hash='m3gtjBb1KkjvaD7zL8ayTg', hash_type='md5', updated_at=2023-10-04 16:41:20, storage_id='62dM2FBg', transform_id='Qr1kIHvK506rz8', run_id='ru99Hlg0EzWVqI4SJjkG', created_by_id='DzTjkKse')
  👤 created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-10-04 16:38:48)
Features:
  external: FeatureSet(id='ijQCOdoQq4N5UaZE2KfO', n=5, registry='core.Feature', hash='0QqJ2zisRmC_q4Pnd1TD', updated_at=2023-10-04 16:41:21, modality_id='FnQ7xHJL', created_by_id='DzTjkKse')
    🔗 species (1, bionty.Species): 'human'
    🔗 assay (4, bionty.ExperimentalFactor): '10x 3' v3', 'single-cell RNA sequencing', '10x 5' v1', '10x 5' v2'
    🔗 cell_type (39, bionty.CellType): 'group 3 innate lymphoid cell', 'CD16-positive, CD56-dim natural killer cell, human', 'CD4-positive helper T cell', 'plasma cell', 'mucosal invariant T cell', 'classical monocyte', 'gamma-delta T cell', 'animal cell', 'lymphocyte', 'regulatory T cell', ...
    🔗 tissue (17, bionty.Tissue): 'lamina propria', 'skeletal muscle tissue', 'spleen', 'mesenteric lymph node', 'caecum', 'duodenum', 'lung', 'ileum', 'thoracic lymph node', 'transverse colon', ...
    🔗 donor (12, core.ULabel): 'A52', 'A29', '621B', 'A31', '582C', 'A37', 'A36', '637C', 'D496', 'A35', ...
Labels:
  🏷️ species (1, bionty.Species): 'human'
  🏷️ tissues (17, bionty.Tissue): 'lamina propria', 'skeletal muscle tissue', 'spleen', 'mesenteric lymph node', 'caecum', 'duodenum', 'lung', 'ileum', 'thoracic lymph node', 'transverse colon', ...
  🏷️ cell_types (39, bionty.CellType): 'group 3 innate lymphoid cell', 'CD16-positive, CD56-dim natural killer cell, human', 'CD4-positive helper T cell', 'plasma cell', 'mucosal invariant T cell', 'classical monocyte', 'gamma-delta T cell', 'animal cell', 'lymphocyte', 'regulatory T cell', ...
  🏷️ experimental_factors (4, bionty.ExperimentalFactor): '10x 3' v3', 'single-cell RNA sequencing', '10x 5' v1', '10x 5' v2'
  🏷️ ulabels (12, core.ULabel): 'A52', 'A29', '621B', 'A31', '582C', 'A37', 'A36', '637C', 'D496', 'A35', ...

# clean up test instance
!lamin delete --force test-scrna
!rm -r ./test-scrna

💡 deleting instance testuser1/test-scrna
✅     deleted instance settings file: /home/runner/.lamin/instance--testuser1--test-scrna.env

✅     instance cache deleted
✅     deleted '.lndb' sqlite file
❗     consider manually deleting your stored data: /home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna