Project flow#

LaminDB allows tracking data flow on the entire project level.

Here, we walk through exemplified app uploads, pipelines & notebooks following Schmidt et al., 2022.

A CRISPR screen reading out a phenotypic endpoint on T cells is paired with scRNA-seq to generate insights into IFN-γ production.

These insights get linked back to the original data through the steps taken in the project to provide context for interpretation & future decision making.

Setup#

Init a test instance:

!lamin init --storage ./mydata

Import lamindb:

import lamindb as ln
from IPython.display import Image, display

💡 loaded instance: testuser1/mydata (lamindb 0.55.0)

Steps#

In the following, we walk through exemplified steps covering different types of transforms (Transform).

Note

The full notebooks are in this repository.

App upload of phenotypic data #

ln.setup.login("testuser1")
transform = ln.Transform(name="Upload GWS CRISPRa result", type="app")
ln.track(transform)
output_path = ln.dev.datasets.schmidt22_crispra_gws_IFNG(ln.settings.storage)
output_file = ln.File(output_path, description="Raw data of schmidt22 crispra GWS")
output_file.save()

Hit identification in notebook #

Access, transform & register data in drylab by testuser2:

ln.setup.login("testuser2")
transform = ln.Transform(name="GWS CRIPSRa analysis", type="notebook")
ln.track(transform)
# access
input_file = ln.File.filter(key="schmidt22-crispra-gws-IFNG.csv").one()
# identify hits
input_df = input_file.load().set_index("id")
output_df = input_df[input_df["pos|fdr"] < 0.01].copy()
# register hits in output file
ln.File(output_df, description="hits from schmidt22 crispra GWS").save()

Inspect data flow:

file = ln.File.filter(description="hits from schmidt22 crispra GWS").one()
file.view_flow()

hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

hello

hello

hello

hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

hello

hello

hello

hello

hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

hello

hello

hello

hello

hello

hello

hello

https://d33wubrfki0l68.cloudfront.net/0d4d430358bf90c9627f47d5f8eb4e9b79536523/41d0d/_images/4b931915a22b458ac50a7db40bde6ef18545e40f1e31e14aac3a9c74b0e9b588.svg

Sequencer upload #

Upload files from sequencer:

ln.setup.login("testuser1")
ln.track(ln.Transform(name="Chromium 10x upload", type="pipeline"))
# register output files of upload
upload_dir = ln.dev.datasets.dir_scrnaseq_cellranger(
    "perturbseq", basedir=ln.settings.storage, output_only=False
)
ln.File(upload_dir.parent / "fastq/perturbseq_R1_001.fastq.gz").save()
ln.File(upload_dir.parent / "fastq/perturbseq_R2_001.fastq.gz").save()
ln.setup.login("testuser2")

scRNA-seq bioinformatics pipeline #

Process uploaded files using a script or workflow manager: Pipelines and obtain 3 output files in a directory filtered_feature_bc_matrix/:

transform = ln.Transform(name="Cell Ranger", version="7.2.0", type="pipeline")
ln.track(transform)
# access uploaded files as inputs for the pipeline
input_files = ln.File.filter(key__startswith="fastq/perturbseq").all()
input_paths = [file.stage() for file in input_files]
# register output files
output_files = ln.File.from_dir("./mydata/perturbseq/filtered_feature_bc_matrix/")
ln.save(output_files)

Post-process these 3 files:

transform = ln.Transform(name="Postprocess Cell Ranger", version="2.0", type="pipeline")
ln.track(transform)
input_files = [f.stage() for f in output_files]
output_path = ln.dev.datasets.schmidt22_perturbseq(basedir=ln.settings.storage)
output_file = ln.File(output_path, description="perturbseq counts")
output_file.save()

Show code cell output Hide code cell output

hello

❗ record with similar name exist! did you mean to load it?

	id	__ratio__
name
Cell Ranger	sH3y1OfXH46Qkq	90.0

💡 Transform(id='xmujzZnXu2lvTS', name='Postprocess Cell Ranger', version='2.0', type='pipeline', updated_at=2023-10-04 16:39:25, created_by_id='bKeW4T6E')

💡 Run(id='cguYyM1vjsAHbuRnzJLP', run_at=2023-10-04 16:39:25, transform_id='xmujzZnXu2lvTS', created_by_id='bKeW4T6E')

hello

within hello

Inspect data flow:

output_files[0].view_flow()

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

hello

hello

hello

hello

hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

hello

hello

hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

hello

hello

hello

hello

hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

hello

hello

hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

https://d33wubrfki0l68.cloudfront.net/4a8f866525d50a19b5a2e750cd572ca67e536972/d2b6b/_images/6e76977b30bfc08f1541f4d418cdc3e89c489bbdbd803ca6b254795ad115ae11.svg

Integrate scRNA-seq & phenotypic data #

Integrate data in a notebook:

transform = ln.Transform(
    name="Perform single cell analysis, integrate with CRISPRa screen",
    type="notebook",
)
ln.track(transform)

file_ps = ln.File.filter(description__icontains="perturbseq").one()
adata = file_ps.load()
file_hits = ln.File.filter(description="hits from schmidt22 crispra GWS").one()
screen_hits = file_hits.load()

import scanpy as sc

sc.tl.score_genes(adata, adata.var_names.intersection(screen_hits.index).tolist())
filesuffix = "_fig1_score-wgs-hits.png"
sc.pl.umap(adata, color="score", show=False, save=filesuffix)
filepath = f"figures/umap{filesuffix}"
file = ln.File(filepath, key=filepath)
file.save()
filesuffix = "fig2_score-wgs-hits-per-cluster.png"
sc.pl.matrixplot(
    adata, groupby="cluster_name", var_names=["score"], show=False, save=filesuffix
)
filepath = f"figures/matrixplot_{filesuffix}"
file = ln.File(filepath, key=filepath)
file.save()

Review results#

Let’s load one of the plots:

ln.track()
file = ln.File.filter(key__contains="figures/matrixplot").one()
file.stage()

display(Image(filename=file.path))

https://d33wubrfki0l68.cloudfront.net/b3b5eb3f53a7759762d1dca2d67bd76974729731/e5dd6/_images/f096e9d4768812e880e81babbd6eeae4f64efc120154dc379ad9c346ea2ebe9d.png

We see that the image file is tracked as an input of the current notebook. The input is highlighted, the notebook follows at the bottom:

file.view_flow()

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

hello

hello

hello

hello

hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

hello

hello

hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

hello

hello

hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

hello

hello

hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

hello

hello

hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

hello

hello

hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

hello

hello

hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

hello

hello

hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

hello

hello

hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

hello

hello

hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

hello

hello

hello

hello

hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

hello

hello

hello

hello

within hello

hello

within hello

hello

within hello

hello

within hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

https://d33wubrfki0l68.cloudfront.net/011f280d07ad6984452b4806f5a4efe56a078e17/86840/_images/37a1106cac82de00320e1a03fdefafa99dec19b7adceccdf4f897752b52ff844.svg

Alternatively, we can also look at the sequence of transforms:

transform = ln.Transform.search("Bird's eye view", return_queryset=True).first()
transform.parents.df()

hello

hello

within hello

	name	short_name	version	type	latest_report_id	source_file_id	reference	reference_type	initial_version_id	updated_at	created_by_id
id
DocWlf50vCGdSG	GWS CRIPSRa analysis	None	None	notebook	None	None	None	None	None	2023-10-04 16:39:22	bKeW4T6E
xmujzZnXu2lvTS	Postprocess Cell Ranger	None	2.0	pipeline	None	None	None	None	None	2023-10-04 16:39:25	bKeW4T6E

transform.view_parents()

hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

hello

https://d33wubrfki0l68.cloudfront.net/eb06b3614369ec2426579d100bd7b229998de860/41b20/_images/d894276ae8705e8cdfc0f3faa2c44202fbb3b983b268a49ea69654443bcbee06.svg

Understand runs#

We tracked pipeline and notebook runs through run_context, which stores a Transform and a Run record as a global context.

File objects are the inputs and outputs of runs.

Query by provenance#

We can query or search for the notebook that created the file:

transform = ln.Transform.search("GWS CRIPSRa analysis", return_queryset=True).first()

hello

And then find all the files created by that notebook:

ln.File.filter(transform=transform).df()

	storage_id	key	suffix	accessor	description	version	size	hash	hash_type	transform_id	run_id	initial_version_id	updated_at	created_by_id
id
8m3q6CtdkGbhkCoY4F6z	LaHMxEPv	None	.parquet	DataFrame	hits from schmidt22 crispra GWS	None	18368	TufBUAIQVzLPDJ4sCV_kTg	md5	DocWlf50vCGdSG	o8mBOGr1iWZQwZhLnHaQ	None	2023-10-04 16:39:22	bKeW4T6E

Which transform ingested a given file?

file = ln.File.filter().first()
file.transform

hello

Transform(id='VUvyNlSP2gGxeu', name='Upload GWS CRISPRa result', type='app', updated_at=2023-10-04 16:39:17, created_by_id='DzTjkKse')

And which user?

file.created_by

hello

User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-10-04 16:39:24)

Which transforms were created by a given user?

users = ln.User.lookup()

hello

ln.Transform.filter(created_by=users.testuser2).df()

	name	short_name	version	type	latest_report_id	source_file_id	reference	reference_type	initial_version_id	updated_at	created_by_id
id
DocWlf50vCGdSG	GWS CRIPSRa analysis	None	None	notebook	None	None	None	None	None	2023-10-04 16:39:22	bKeW4T6E
sH3y1OfXH46Qkq	Cell Ranger	None	7.2.0	pipeline	None	None	None	None	None	2023-10-04 16:39:25	bKeW4T6E
xmujzZnXu2lvTS	Postprocess Cell Ranger	None	2.0	pipeline	None	None	None	None	None	2023-10-04 16:39:25	bKeW4T6E
AKgqswBfLxB9hi	Perform single cell analysis, integrate with C...	None	None	notebook	None	None	None	None	None	2023-10-04 16:39:27	bKeW4T6E
1LCd8kco9lZUz8	Project flow	project-flow	0	notebook	None	None	None	None	None	2023-10-04 16:39:30	bKeW4T6E

Which notebooks were created by a given user?

ln.Transform.filter(created_by=users.testuser2, type="notebook").df()

	name	short_name	version	type	latest_report_id	source_file_id	reference	reference_type	initial_version_id	updated_at	created_by_id
id
DocWlf50vCGdSG	GWS CRIPSRa analysis	None	None	notebook	None	None	None	None	None	2023-10-04 16:39:22	bKeW4T6E
AKgqswBfLxB9hi	Perform single cell analysis, integrate with C...	None	None	notebook	None	None	None	None	None	2023-10-04 16:39:27	bKeW4T6E
1LCd8kco9lZUz8	Project flow	project-flow	0	notebook	None	None	None	None	None	2023-10-04 16:39:30	bKeW4T6E

We can also view all recent additions to the entire database:

ln.view()

Show code cell output Hide code cell output

File

	storage_id	key	suffix	accessor	description	version	size	hash	hash_type	transform_id	run_id	initial_version_id	updated_at	created_by_id
id
PDdTwykM7mr4lggF7byV	LaHMxEPv	figures/matrixplot_fig2_score-wgs-hits-per-clu...	.png	None	None	None	28814	H0Pxpa-fZOvigo74eXHZsQ	md5	AKgqswBfLxB9hi	lPhlAwPuFLWp9U33s0Hr	None	2023-10-04 16:39:29	bKeW4T6E
IOLRy2s78M3NQ0ezVqRA	LaHMxEPv	figures/umap_fig1_score-wgs-hits.png	.png	None	None	None	118999	1-WtAvRL1d_SSjZvMMOMkg	md5	AKgqswBfLxB9hi	lPhlAwPuFLWp9U33s0Hr	None	2023-10-04 16:39:29	bKeW4T6E
uTdDb9R2RvGmJibwuv6V	LaHMxEPv	schmidt22_perturbseq.h5ad	.h5ad	AnnData	perturbseq counts	None	20659936	la7EvqEUMDlug9-rpw-udA	md5	xmujzZnXu2lvTS	cguYyM1vjsAHbuRnzJLP	None	2023-10-04 16:39:27	bKeW4T6E
kwgnRHLbwok0FYnpAZ0f	LaHMxEPv	perturbseq/filtered_feature_bc_matrix/matrix.m...	.mtx.gz	None	None	None	6	PtjMi2heO_8hpvIga-slLw	md5	sH3y1OfXH46Qkq	u1gUjLsnS2En4lqiZJEq	None	2023-10-04 16:39:25	bKeW4T6E
KHFw3UxNcey0DQjaFC9i	LaHMxEPv	perturbseq/filtered_feature_bc_matrix/barcodes...	.tsv.gz	None	None	None	6	26C4BEGZStYCFyw2sdtejA	md5	sH3y1OfXH46Qkq	u1gUjLsnS2En4lqiZJEq	None	2023-10-04 16:39:25	bKeW4T6E
3t9knMSF3Hi8hCfcOvHZ	LaHMxEPv	perturbseq/filtered_feature_bc_matrix/features...	.tsv.gz	None	None	None	6	n-rZf_F77g-XKDGjfdfFfw	md5	sH3y1OfXH46Qkq	u1gUjLsnS2En4lqiZJEq	None	2023-10-04 16:39:25	bKeW4T6E
GgLnkM74hYBmUqesNlM6	LaHMxEPv	fastq/perturbseq_R2_001.fastq.gz	.fastq.gz	None	None	None	6	FvpUaB1m1DQ2cI7KABzjmQ	md5	IaDMxS6Lwc5Hj8	HPT7LKlZ8UWMTkY3MrC3	None	2023-10-04 16:39:24	DzTjkKse

Run

	transform_id	run_at	created_by_id	report_id	is_consecutive	reference	reference_type
id
mBE6M42X84CpXBhpNGK6	VUvyNlSP2gGxeu	2023-10-04 16:39:17	DzTjkKse	None	None	None	None
o8mBOGr1iWZQwZhLnHaQ	DocWlf50vCGdSG	2023-10-04 16:39:22	bKeW4T6E	None	None	None	None
HPT7LKlZ8UWMTkY3MrC3	IaDMxS6Lwc5Hj8	2023-10-04 16:39:24	DzTjkKse	None	None	None	None
u1gUjLsnS2En4lqiZJEq	sH3y1OfXH46Qkq	2023-10-04 16:39:25	bKeW4T6E	None	None	None	None
cguYyM1vjsAHbuRnzJLP	xmujzZnXu2lvTS	2023-10-04 16:39:25	bKeW4T6E	None	None	None	None
lPhlAwPuFLWp9U33s0Hr	AKgqswBfLxB9hi	2023-10-04 16:39:27	bKeW4T6E	None	None	None	None
CHsRIb1j9VkCrXVmiY9J	1LCd8kco9lZUz8	2023-10-04 16:39:30	bKeW4T6E	None	None	None	None

Storage

	root	type	region	updated_at	created_by_id
id
LaHMxEPv	/home/runner/work/lamin-usecases/lamin-usecase...	local	None	2023-10-04 16:39:14	DzTjkKse

Transform

	name	short_name	version	type	latest_report_id	source_file_id	reference	reference_type	initial_version_id	updated_at	created_by_id
id
1LCd8kco9lZUz8	Project flow	project-flow	0	notebook	None	None	None	None	None	2023-10-04 16:39:30	bKeW4T6E
AKgqswBfLxB9hi	Perform single cell analysis, integrate with C...	None	None	notebook	None	None	None	None	None	2023-10-04 16:39:27	bKeW4T6E
xmujzZnXu2lvTS	Postprocess Cell Ranger	None	2.0	pipeline	None	None	None	None	None	2023-10-04 16:39:25	bKeW4T6E
sH3y1OfXH46Qkq	Cell Ranger	None	7.2.0	pipeline	None	None	None	None	None	2023-10-04 16:39:25	bKeW4T6E
IaDMxS6Lwc5Hj8	Chromium 10x upload	None	None	pipeline	None	None	None	None	None	2023-10-04 16:39:24	DzTjkKse
DocWlf50vCGdSG	GWS CRIPSRa analysis	None	None	notebook	None	None	None	None	None	2023-10-04 16:39:22	bKeW4T6E
VUvyNlSP2gGxeu	Upload GWS CRISPRa result	None	None	app	None	None	None	None	None	2023-10-04 16:39:17	DzTjkKse

User

	handle	email	name	updated_at
id
bKeW4T6E	testuser2	testuser2@lamin.ai	Test User2	2023-10-04 16:39:25
DzTjkKse	testuser1	testuser1@lamin.ai	Test User1	2023-10-04 16:39:24