CellTypist#
Cell types classify cells based on public and private knowledge from studying transcription, morphology, function & other properties. Established cell types have well-characterized markers and properties; however, cell subtypes and states are continuously being discovered, refined and better understood.
In this notebook, we register the immune cell type vocabulary from CellTypist, a computational tool used for cell type classification in scRNA-seq data.
Setup#
!lamin init --storage ./celltypist --schema bionty
Show code cell output
✅ saved: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-10-04 16:36:09)
✅ saved: Storage(id='7t6iUcck', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/celltypist', type='local', updated_at=2023-10-04 16:36:09, created_by_id='DzTjkKse')
💡 loaded instance: testuser1/celltypist
💡 did not register local instance on hub (if you want, call `lamin register`)
Show code cell content
# filter warnings from celltypist
import warnings
warnings.filterwarnings("ignore", message=".*The 'nopython' keyword.*")
import lamindb as ln
import lnschema_bionty as lb
import celltypist
import pandas as pd
lb.settings.species = "human" # globally set species
💡 loaded instance: testuser1/celltypist (lamindb 0.55.0)
2023-10-04 16:36:12,693:INFO - Failed to extract font properties from /usr/share/fonts/truetype/noto/NotoColorEmoji.ttf: In FT2Font: Can not load face (unknown file format; error code 0x2)
2023-10-04 16:36:12,871:INFO - generated new fontManager
hello
ln.track()
💡 notebook imports: celltypist==1.6.1 lamindb==0.55.0 lnschema_bionty==0.31.2 pandas==1.5.3
💡 Transform(id='s5mkN5NQ1ttIz8', name='CellTypist', short_name='celltypist', version='0', type=notebook, updated_at=2023-10-04 16:36:15, created_by_id='DzTjkKse')
💡 Run(id='IbNQqSJxbLsL5Da013p1', run_at=2023-10-04 16:36:15, transform_id='s5mkN5NQ1ttIz8', created_by_id='DzTjkKse')
hello
within hello
Access CellTypist records #
As a first step we will read in CellTypist’s immune cell encyclopedia
description = "CellTypist Pan Immune Atlas v2: basic cell type information"
celltypist_source_v2_url = "https://github.com/Teichlab/celltypist_wiki/raw/main/atlases/Pan_Immune_CellTypist/v2/tables/Basic_celltype_information.xlsx"
# our source data
celltypist_file = ln.File.filter(description=description).one_or_none()
if celltypist_file is None:
celltypist_df = pd.read_excel(celltypist_source_v2_url)
celltypist_file = ln.File(celltypist_df).save()
else:
celltypist_df = celltypist_file.load().head()
It provides an ontology_id
of the public Cell Ontology for the majority of records.
celltypist_df.head()
High-hierarchy cell types | Low-hierarchy cell types | Description | Cell Ontology ID | Curated markers | |
---|---|---|---|---|---|
0 | B cells | B cells | B lymphocytes with diverse cell surface immuno... | CL:0000236 | CD79A, MS4A1, CD19 |
1 | B cells | Follicular B cells | resting mature B lymphocytes found in the prim... | CL:0000843 | CXCR5, TNFRSF13B, CD22 |
2 | B cells | Proliferative germinal center B cells | proliferating germinal center B cells | CL:0000844 | MKI67, SUGCT, AICDA |
3 | B cells | Germinal center B cells | proliferating mature B cells that undergo soma... | CL:0000844 | POU2AF1, CD40, SUGCT |
4 | B cells | Memory B cells | long-lived mature B lymphocytes which are form... | CL:0000787 | CR2, CD27, MS4A1 |
The “Cell Ontology ID” is associated with multiple “Low-hierarchy cell types”:
celltypist_df.set_index(["Cell Ontology ID", "Low-hierarchy cell types"]).head(10)
High-hierarchy cell types | Description | Curated markers | ||
---|---|---|---|---|
Cell Ontology ID | Low-hierarchy cell types | |||
CL:0000236 | B cells | B cells | B lymphocytes with diverse cell surface immuno... | CD79A, MS4A1, CD19 |
CL:0000843 | Follicular B cells | B cells | resting mature B lymphocytes found in the prim... | CXCR5, TNFRSF13B, CD22 |
CL:0000844 | Proliferative germinal center B cells | B cells | proliferating germinal center B cells | MKI67, SUGCT, AICDA |
Germinal center B cells | B cells | proliferating mature B cells that undergo soma... | POU2AF1, CD40, SUGCT | |
CL:0000787 | Memory B cells | B cells | long-lived mature B lymphocytes which are form... | CR2, CD27, MS4A1 |
Age-associated B cells | B cells | CD11c+ T-bet+ memory B cells associated with a... | FCRL2, ITGAX, TBX21 | |
CL:0000788 | Naive B cells | B cells | mature B lymphocytes which express cell-surfac... | IGHM, IGHD, TCL1A |
CL:0000818 | Transitional B cells | B cells | immature B cell precursors in the bone marrow ... | CD24, MYO1C, MS4A1 |
CL:0000817 | Large pre-B cells | B-cell lineage | proliferative B lymphocyte precursors derived ... | MME, CD24, MKI67 |
Small pre-B cells | B-cell lineage | non-proliferative B lymphocyte precursors deri... | MME, CD24, IGLL5 |
Validate CellTypist records #
For any cell type record that can be validated against the public Cell Ontology, we’d like to ensure that it’s actually validated.
This will avoid that we’ll refer to the same cell type with different identifiers.
We need a Bionty
object for this:
bionty = lb.CellType.bionty()
bionty
CellType
Species: all
Source: cl, 2023-04-20
#terms: 2862
📖 CellType.df(): ontology reference table
🔎 CellType.lookup(): autocompletion of terms
🎯 CellType.search(): free text search of terms
✅ CellType.validate(): strictly validate values
🧐 CellType.inspect(): full inspection of values
👽 CellType.standardize(): convert to standardized names
🪜 CellType.diff(): difference between two versions
🔗 CellType.ontology: Pronto.Ontology object
We can now validate the "Cell Ontology ID"
column
When should I use inspect()
and when validate()
?
inspect()
gives us more logging than validate()
but runs a bit slower.
Hence, we’ll use inspect
if we suspect validation won’t pass and we want to debug why to curate data.
bionty.inspect(celltypist_df["Cell Ontology ID"], bionty.ontology_id);
This looks good!
But when inspecting the names, most of them don’t validate:
bionty.inspect(celltypist_df["Low-hierarchy cell types"], bionty.name);
❗ 97 terms (99.00%) are not validated for name: B cells, Follicular B cells, Proliferative germinal center B cells, Germinal center B cells, Memory B cells, Age-associated B cells, Naive B cells, Transitional B cells, Large pre-B cells, Small pre-B cells, Pre-pro-B cells, Pro-B cells, Cycling B cells, Cycling DCs, Cycling gamma-delta T cells, Cycling monocytes, Cycling NK cells, Cycling T cells, DC, DC1, ...
detected 9 terms with synonyms: DC1, DC2, ETP, CMP, ELP, GMP, ILC2, ILC3, pDC
→ standardize terms via .standardize()
A search tells us that terms that are named in plural in Cell Typist occur with a name in singular in the Cell Ontology:
celltypist_df["Low-hierarchy cell types"][0]
'B cells'
bionty.search(celltypist_df["Low-hierarchy cell types"][0]).head(2)
ontology_id | definition | synonyms | parents | __agg__ | __ratio__ | |
---|---|---|---|---|---|---|
name | ||||||
B cell | CL:0000236 | A Lymphocyte Of B Lineage That Is Capable Of B... | B-cell|B lymphocyte|B-lymphocyte | [CL:0000945] | b cell | 92.307692 |
cell | CL:0000000 | A Material Entity Of Anatomical Origin (Part O... | None | [] | cell | 90.000000 |
Let’s try to strip "s"
and inspect if more names are now validated. Yes, there are!
bionty.inspect(
[i.rstrip("s") for i in celltypist_df["Low-hierarchy cell types"]],
bionty.name,
);
❗ 93 terms (94.90%) are not validated for name: Follicular B cell, Proliferative germinal center B cell, Germinal center B cell, Memory B cell, Age-associated B cell, Naive B cell, Transitional B cell, Large pre-B cell, Small pre-B cell, Pre-pro-B cell, Pro-B cell, Cycling B cell, Cycling DC, Cycling gamma-delta T cell, Cycling monocyte, Cycling NK cell, Cycling T cell, DC, DC1, DC2, ...
detected 34 terms with inconsistent casing/synonyms: Follicular B cell, Germinal center B cell, Memory B cell, Naive B cell, Transitional B cell, Small pre-B cell, Pro-B cell, DC1, DC2, Endothelial cell, Epithelial cell, Erythrocyte, ETP, Fibroblast, Granulocyte, Neutrophil, CMP, ELP, GMP, ILC2, ...
→ standardize terms via .standardize()
Every “low-hierarchy cell type” has an ontology id and most “high-hierarchy cell types” also appear as “low-hierarchy cell types” in the Cell Typist table. Four, however, don’t, and therefore don’t have an ontology ID.
high_terms = celltypist_df["High-hierarchy cell types"].unique()
low_terms = celltypist_df["Low-hierarchy cell types"].unique()
high_terms_nonval = set(high_terms).difference(low_terms)
high_terms_nonval
{'B-cell lineage', 'Cycling cells', 'Erythroid', 'T cells'}
Register CellTypist records #
Let’s first add the “High-hierarchy cell types” as a column "parent"
.
This enables LaminDB to populate the parents
and children
fields, which will enable you to query for hierarchical relationships.
celltypist_df["parent"] = celltypist_df.pop("High-hierarchy cell types")
# if high and low terms are the same, no parents
celltypist_df.loc[
(celltypist_df["parent"] == celltypist_df["Low-hierarchy cell types"]), "parent"
] = None
# rename columns, drop markers
celltypist_df.drop(columns=["Curated markers"], inplace=True)
celltypist_df.rename(
columns={"Low-hierarchy cell types": "name", "Cell Ontology ID": "ontology_id"},
inplace=True,
)
celltypist_df.columns = celltypist_df.columns.str.lower()
celltypist_df.head(2)
name | description | ontology_id | parent | |
---|---|---|---|---|
0 | B cells | B lymphocytes with diverse cell surface immuno... | CL:0000236 | None |
1 | Follicular B cells | resting mature B lymphocytes found in the prim... | CL:0000843 | B cells |
Now, let’s create records from the public ontology:
public_records = lb.CellType.from_values(
celltypist_df.ontology_id, lb.CellType.ontology_id
)
hello
Let’s now amend public ontology records so that they maintain additional annotations that Cell Typist might have.
records_names = {}
public_records_dict = {r.ontology_id: r for r in public_records}
for _, row in celltypist_df.iterrows():
name = row["name"]
ontology_id = row["ontology_id"]
public_record = public_records_dict[ontology_id]
# if both name and ontology_id match public record, use public record
if name.lower() == public_record.name.lower():
records_names[name] = public_record
continue
else: # when ontology_id matches the public record and name doesn't match
# if singular form of the Celltypist name matches public name
if name.lower().rstrip("s") == public_record.name.lower():
# add the Celltypist name to the synonyms of the public ontology record
public_record.add_synonym(name)
records_names[name] = public_record
continue
if public_record.synonyms is not None:
synonyms = [s.lower() for s in public_record.synonyms.split("|")]
# if any of the public matches celltypist name
if any(
[
i.lower() in {name.lower(), name.lower().rstrip("s")}
for i in synonyms
]
):
# add the Celltypist name to the synonyms of the public ontology record
public_record.add_synonym(name)
records_names[name] = public_record
continue
# create a record only based on Celltypist metadata
records_names[name] = lb.CellType(
name=name, ontology_id=ontology_id, description=row.description
)
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
You can see certain records are created by adding the Celltypist name to the synonyms of the public record:
records_names["GMP"]
CellType(id='f5eAsw0p', name='granulocyte monocyte progenitor cell', ontology_id='CL:0000557', synonyms='GMP|granulocyte/monocyte progenitor|colony forming unit granulocyte macrophage|CFU-GM|granulocyte/monocyte precursor|granulocyte-macrophage progenitor', description='A Hematopoietic Progenitor Cell That Is Committed To The Granulocyte And Monocyte Lineages. These Cells Are Cd123-Positive, And Do Not Express Gata1 Or Gata2 But Do Express C/Ebpa, And Pu.1.', bionty_source_id='QiWE', created_by_id='DzTjkKse')
Other records are created based on Celltypist metadata:
records_names["Age-associated B cells"]
CellType(id='00ieV0IG', name='Age-associated B cells', ontology_id='CL:0000787', description='CD11c+ T-bet+ memory B cells associated with autoimmunity and aging', created_by_id='DzTjkKse')
Let’s save them to our database:
records = set(records_names.values())
ln.save(records)
Show code cell output
❗ now recursing through parents: this only happens once, but is much slower than bulk saving
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
Add parent-child relationship of the records from Celltypist#
We still need to add the renaming 4 High hierarchy terms:
list(high_terms_nonval)
['Cycling cells', 'T cells', 'B-cell lineage', 'Erythroid']
Let’s get the top hits from a search:
for term in list(high_terms_nonval):
print(f"Term: {term}")
display(bionty.search(term).head(1))
Term: Cycling cells
ontology_id | definition | synonyms | parents | __agg__ | __ratio__ | |
---|---|---|---|---|---|---|
name | ||||||
type G enteroendocrine cell | CL:0000508 | An Endocrine Cell Found In The Pyloric Gland M... | G cell | [CL:0000509, CL:0000164, CL:0000506] | type g enteroendocrine cell | 90.0 |
Term: T cells
ontology_id | definition | synonyms | parents | __agg__ | __ratio__ | |
---|---|---|---|---|---|---|
name | ||||||
T cell | CL:0000084 | A Type Of Lymphocyte Whose Defining Characteri... | T-lymphocyte|T-cell|T lymphocyte | [CL:0000542] | t cell | 92.307692 |
Term: B-cell lineage
ontology_id | definition | synonyms | parents | __agg__ | __ratio__ | |
---|---|---|---|---|---|---|
name | ||||||
cell | CL:0000000 | A Material Entity Of Anatomical Origin (Part O... | None | [] | cell | 90.0 |
Term: Erythroid
ontology_id | definition | synonyms | parents | __agg__ | __ratio__ | |
---|---|---|---|---|---|---|
name | ||||||
erythroid progenitor cell | CL:0000038 | A Progenitor Cell Committed To The Erythroid L... | None | [CL:0000839, CL:0000764] | erythroid progenitor cell | 90.0 |
So we decide to:
Add the “T cells” to the synonyms of the public “T cell” record
Create the remaining 3 terms only using their names (we think “B cell flow” shouldn’t be identified with “B cell”)
for name in high_terms_nonval:
if name == "T cells":
record = lb.CellType.from_bionty(name="T cell")
record.add_synonym(name)
record.save()
else:
record = lb.CellType(name=name)
record.save()
records_names[name] = record
hello
❗ records with similar names exist! did you mean to load one of them?
id | synonyms | __ratio__ | |
---|---|---|---|
name | |||
Cycling B cells | ibzfn1zQ | 95.0 | |
Cycling T cells | TTziQpub | 95.0 | |
Cycling NK cells | rC47wc9h | 95.0 | |
cell | Ry0JGwSD | 90.0 |
hello
hello
hello
❗ records with similar names exist! did you mean to load one of them?
id | synonyms | __ratio__ | |
---|---|---|---|
name | |||
B cell | cx8VcggA | B-lymphocyte|B cells|B-cell|B lymphocyte | 90.0 |
cell | Ry0JGwSD | 90.0 |
hello
❗ records with similar names exist! did you mean to load one of them?
id | synonyms | __ratio__ | |
---|---|---|---|
name | |||
Mid erythroid | lveE8XKg | 95.0 | |
Early erythroid | MiIxaBcE | 90.0 | |
Late erythroid | NY6Iq1SQ | 90.0 | |
Megakaryocyte-erythroid-mast cell progenitor | rDuO4MVx | 90.0 |
Now let’s add the parent records:
for _, row in celltypist_df.iterrows():
record = records_names[row["name"]]
if row["parent"] is not None:
parent_record = records_names[row["parent"]]
record.parents.add(parent_record)
Access the registry#
The previously added CellTypist ontology registry is now available in LaminDB.
To retrieve the full ontology table as a Pandas DataFrame we can use .filter
:
lb.CellType.filter().df()
name | ontology_id | abbr | synonyms | description | bionty_source_id | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|
id | ||||||||
i20ionW5 | mast cell | CL:0000097 | None | Mast cells|mastocyte|labrocyte|histaminocyte | A Cell That Is Found In Almost All Tissues Con... | QiWE | 2023-10-04 16:36:19 | DzTjkKse |
PMAesjf3 | Early lymphoid/T lymphoid | CL:0000936 | None | None | early lymphoid/T lymphocytes with lymphocyte p... | None | 2023-10-04 16:36:19 | DzTjkKse |
00ieV0IG | Age-associated B cells | CL:0000787 | None | None | CD11c+ T-bet+ memory B cells associated with a... | None | 2023-10-04 16:36:19 | DzTjkKse |
TENASE93 | alveolar macrophage | CL:0000583 | None | Alveolar macrophages|dust cell | A Tissue-Resident Macrophage Found In The Alve... | QiWE | 2023-10-04 16:36:19 | DzTjkKse |
3rJgLble | conventional dendritic cell | CL:0000990 | None | dendritic reticular cell|cDC|type 1 DC|DC1 | Conventional Dendritic Cell Is A Dendritic Cel... | QiWE | 2023-10-04 16:36:19 | DzTjkKse |
... | ... | ... | ... | ... | ... | ... | ... | ... |
gON03kRx | barrier cell | CL:0000215 | None | None | A Cell Whose Primary Function Is To Prevent Th... | QiWE | 2023-10-04 16:36:53 | DzTjkKse |
Ftvcq6k8 | Cycling cells | None | None | None | None | None | 2023-10-04 16:36:53 | DzTjkKse |
BxNjby0x | T cell | CL:0000084 | None | T cells|T-cell|T-lymphocyte|T lymphocyte | A Type Of Lymphocyte Whose Defining Characteri... | QiWE | 2023-10-04 16:36:55 | DzTjkKse |
l5NQQjl3 | B-cell lineage | None | None | None | None | None | 2023-10-04 16:36:55 | DzTjkKse |
AT1pQhJX | Erythroid | None | None | None | None | None | 2023-10-04 16:36:55 | DzTjkKse |
132 rows × 8 columns
This enables us to look for cell types by creating a lookup object from our new CellType
registry.
db_lookup = lb.CellType.lookup()
hello
db_lookup.memory_b_cell
CellType(id='67zMsufW', name='memory B cell', ontology_id='CL:0000787', synonyms='memory B-lymphocyte|memory B-cell|memory B lymphocyte|Memory B cells', description='A Memory B Cell Is A Mature B Cell That Is Long-Lived, Readily Activated Upon Re-Encounter Of Its Antigenic Determinant, And Has Been Selected For Expression Of Higher Affinity Immunoglobulin. This Cell Type Has The Phenotype Cd19-Positive, Cd20-Positive, Mhc Class Ii-Positive, And Cd138-Negative.', updated_at=2023-10-04 16:36:19, bionty_source_id='QiWE', created_by_id='DzTjkKse')
See cell type hierarchy:
db_lookup.memory_b_cell.view_parents()
Access parents of a record:
db_lookup.memory_b_cell.parents.list()
hello
within hello
hello
within hello
[CellType(id='0I51jgPp', name='mature B cell', ontology_id='CL:0000785', synonyms='mature B lymphocyte|mature B-cell|mature B-lymphocyte', description='A B Cell That Is Mature, Having Left The Bone Marrow. Initially, These Cells Are Igm-Positive And Igd-Positive, And They Can Be Activated By Antigen.', updated_at=2023-10-04 16:36:41, bionty_source_id='QiWE', created_by_id='DzTjkKse'),
CellType(id='cx8VcggA', name='B cell', ontology_id='CL:0000236', synonyms='B-lymphocyte|B cells|B-cell|B lymphocyte', description='A Lymphocyte Of B Lineage That Is Capable Of B Cell Mediated Immunity.', updated_at=2023-10-04 16:36:19, bionty_source_id='QiWE', created_by_id='DzTjkKse')]
# clean up test instance
!lamin delete --force celltypist
!rm -r ./celltypist
Show code cell output
💡 deleting instance testuser1/celltypist
✅ deleted instance settings file: /home/runner/.lamin/instance--testuser1--celltypist.env
✅ instance cache deleted
✅ deleted '.lndb' sqlite file
❗ consider manually deleting your stored data: /home/runner/work/lamin-usecases/lamin-usecases/docs/celltypist