notebook 2 - celltype annotation and beyond¶

This notebook will introduce you to the process of celltype annotation and give you a brief outlook of some of the analysis you can do with single-cell data in scanpy & besca.

Initial visualization and hand annotation is performed here to show the possible hand-annotation and how Besca is now proposing an automated annotation process, including for mix-cluster. This procedure is done in a more straighforward manner on the same dataset in the notebook available there:

https://github.com/bedapub/besca_publication_results/blob/master/hematopoietic/pbmc3k/celltype_annotation_besca_Rc.ipynb

[18]:

import warnings
warnings.filterwarnings("ignore")

[19]:

#import necessary python packages
import scanpy as sc #software suite of tools for single-cell analysis in python
import besca as bc #internal BEDA package for single cell analysis
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import scipy

sc.logging.print_header() #output an overview of the software versions that are loaded

scanpy==1.9.1 anndata==0.8.0 umap==0.5.3 numpy==1.20.0 scipy==1.9.0 pandas==1.4.3 scikit-learn==1.1.1 statsmodels==0.13.2 python-igraph==0.9.11 pynndescent==0.5.4

[20]:

# Will only print error.
# Warning (eg genes not found) are ignored.
sc.settings.verbosity=0

[21]:

from IPython.display import HTML
task = "<style>div.task { background-color: #ffc299; border-color: #ff944d; border-left: 7px solid #ff944d; padding: 1em;} </style>"
HTML(task)

[21]:

[22]:

tag = "<style>div.tag { background-color: #99ccff; border-color: #1a8cff; border-left: 7px solid #1a8cff; padding: 1em;} </style>"
HTML(tag)

[22]:

[23]:

FAIR = "<style>div.fair { background-color: #d2f7ec; border-color: #d2f7ec; border-left: 7px solid #2fbc94; padding: 1em;} </style>"
HTML(FAIR)

[23]:

Dataset¶

Here we will reload our previously completely processed dataset.

[24]:

#adata = sc.read('/tmp/adata_pbmc_FRESH_processed.h5ad')
adata = bc.datasets.pbmc3k_filtered()

[25]:

adata

[25]:

AnnData object with n_obs × n_vars = 2504 × 1719
    obs: 'CELL', 'percent_mito', 'experiment', 'n_counts', 'n_genes', 'leiden'
    var: 'ENSEMBL', 'SYMBOL', 'n_cells', 'total_counts', 'frac_reads'
    uns: 'experiment_colors', 'leiden', 'leiden_colors', 'neighbors', 'pca', 'rank_genes_groups', 'umap'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'

Celltype annotation¶

Manual cell annotation based on the expression of known marker genes¶

[26]:

#define genes (in future you will be able to access genesets from a central database GEMS)
b_cells = ['CD19', 'CD79A', 'MS4A1']
t_cells = ['CD3E', 'CD3G', 'CD3D']
t_cell_subsets = ['CD4', 'CD8A', 'CD8B']
naive_t_cell = ['SELL', 'CCR7', 'IL7R']
myeloid_cells = ['S100A8', 'S100A9', 'CST3']
monocytes = ['FCGR3A', 'FCGR3B', 'CD14'] #FCGR3A/B = CD16
dendritic_cells = ['FCER1A', 'ITGAM', 'ITGAX'] #ITGAM = CD11b #ITGAX= CD11c
NK_cells = ['NCAM1', 'NKG7', 'CD3G']

We always use the HGNC symbols of genes as the main annotation type in the AnnData object since they are the most practical to work with. In addition, the ensemble IDs of all genes is saved in adata.var according to our FAIR standard and always written out since Ensembl IDs stay very robust over time.

Based on the dataset the marker genes that are expressed may vary strongly.

[27]:

sc.settings.set_figure_params(dpi=90)
#visualize the gene expression as an overlay of the umap
#(this way you can visually identify the clusters with a high expression))
sc.pl.umap(adata, color = b_cells, color_map = 'viridis', ncols = 3)
sc.pl.umap(adata, color = t_cells, color_map = 'viridis', ncols = 3)
sc.pl.umap(adata, color = t_cell_subsets, color_map = 'viridis', ncols = 3)
sc.pl.umap(adata, color = naive_t_cell, color_map = 'viridis', ncols = 3)
sc.pl.umap(adata, color = NK_cells,  color_map = 'viridis',ncols = 3)
sc.pl.umap(adata, color = myeloid_cells,  color_map = 'viridis',ncols = 3)
sc.pl.umap(adata, color = monocytes, color_map = 'viridis', ncols = 3)
sc.pl.umap(adata, color = dendritic_cells, color_map = 'viridis', ncols = 3)

../_images/tutorials_notebook2_celltype_annotation_pbmc3k_14_0.png

../_images/tutorials_notebook2_celltype_annotation_pbmc3k_14_1.png

../_images/tutorials_notebook2_celltype_annotation_pbmc3k_14_2.png

../_images/tutorials_notebook2_celltype_annotation_pbmc3k_14_3.png

../_images/tutorials_notebook2_celltype_annotation_pbmc3k_14_4.png

../_images/tutorials_notebook2_celltype_annotation_pbmc3k_14_5.png

../_images/tutorials_notebook2_celltype_annotation_pbmc3k_14_6.png

../_images/tutorials_notebook2_celltype_annotation_pbmc3k_14_7.png

Per default scanpy plots the gene expression values saved in adata.raw (this means log1p(cp10k)). This is the reason that we can visualize all of these genes even if they are not contained in the highly variable genes.

[28]:

sc.pl.umap(adata, color = 'leiden', palette = 'tab20')

../_images/tutorials_notebook2_celltype_annotation_pbmc3k_16_0.png

[29]:

sc.pl.violin(adata, keys= t_cell_subsets, groupby= 'leiden', scale = 'count' )

../_images/tutorials_notebook2_celltype_annotation_pbmc3k_17_0.png

[30]:

#write down new cluster names (important order needs to be equivalent to above)
#will label the cell types:

new_cluster_names = ['CD14+ monocyte', #0
                     'B-cell', #1
                     'Mix T-cell', #2
                     'CD8 T-cell', #3
                     'Mix T-cell', #4
                     'Mix T-cell', #5
                     'CD8 T-cell', #6
                     'FCGR3A+ monocyte', #7
                     'CD8 T-cell', #8
                     'NK cells', #9
                     'Mix T-cell', #10
                     'pDC'] #1

bc.tl.annotate_cells_clustering(adata=adata, clustering_label='leiden', new_annotation_label='manual_annotation', new_cluster_labels=new_cluster_names)

There are also more advanced methods of cell annotation where mixed clusters (i.e. clusters that contain more than one celltype) can be reclustered seperately and then annoated. Please refer to besca’s documentation for more information.

[31]:

sc.pl.umap(adata, color = ['leiden', 'manual_annotation'], wspace = 0.2)

../_images/tutorials_notebook2_celltype_annotation_pbmc3k_20_0.png

Several clusters have been combined into one celltype. Here only a very rudimentary celltype annotation was performed. Depending on the dataset and the type of analysis this can become far more complex and detailled.

We propose to use a an automated annotation methods to decipher the cluster in an un-bias way

Coffee
Tea
Milk

This steps requires to expore initial labelling. This steps within besca will export:

+ frac_pos, computing for each gene and cluster the fraction of cells that show a gene expression.
+ average.gct, computing for each gene the average expression per cluster

write out required annotation

bc.export.labeling(adata=adata, outpath=FILEPATH_TO_FOLDER, column  = 'leiden')

[32]:

bc.export.write_labeling_to_files(adata=adata, outpath='dataset_export/', column  = 'leiden')

mapping of cells to  leiden exported successfully to cell2labels.tsv
average.gct exported successfully to file
fract_pos.gct exported successfully to file

Hierarchical signature-enrichment based approach for cell type annotation for Hematopoeitic cells¶

Besca offers a hierarchical signature-enrichment based approach for cell type annotation. The initial hierarchy can be visualized below

[33]:

import pkg_resources

config_file = pkg_resources.resource_filename('besca', 'datasets/genesets/CellNames_scseqCMs6_config.tsv')
plt = bc.pl.nomenclature_network(config_file)
plt.show()

../_images/tutorials_notebook2_celltype_annotation_pbmc3k_25_0.png

Signature loading and scoring¶

We load signature directly importing a gmt file. In case of connection to a GemS database, it is possible to use the GeMs signature loader.

[34]:

sigconfig,levsk=bc.tl.sig.read_annotconfig(config_file)

[35]:

gmt_file_anno = pkg_resources.resource_filename('besca', 'datasets/genesets/CellNames_scseqCMs6_sigs.gmt')

mymarkers = bc.tl.sig.read_GMT_sign(gmt_file_anno,directed=False)
mymarkers = bc.tl.sig.filter_siggenes(adata, mymarkers)

bc.tl.sig.combined_signature_score(adata, gmt_file_anno, verbose=False)
#

[36]:

sc.settings.set_figure_params(dpi=50)
scores = [x for x in adata.obs.columns if 'scanpy' in x]
sc.pl.umap(adata, color= scores, color_map = 'viridis')

../_images/tutorials_notebook2_celltype_annotation_pbmc3k_29_0.png

Signature can be observed in detailled to assess manually the cell types to which assign the different clusters. However Besca offers to use a signature-based hierarchical annotation to build on know relations (see network above). This annotation is based on the mann whitney scoring of the positive fraction of cell and signatures provided. Additionally, this signature-based annotation procedures will then use a consistent labelling of the dataset and cell-types, facilitating cross-study comparisons in the futur.

[37]:

mymarkers['Ubi'] = ['B2M','ACTB', 'GAPDH'] ### used for cutoff adjustment to individual dataset, can be modified

f=pd.read_csv("dataset_export/fract_pos.gct",sep="\t",skiprows=2)
df=bc.tl.sig.score_mw(f,mymarkers)

myc=np.median(df.loc['Ubi',:]*0.5) ### Set a cutoff based on Ubi and scale with values from config file

One can observe the resulting log10(pvalues) obtained

[38]:

df.iloc[0:3,0:7]

[38]:

	2	11	10	5	3	7	6
Bcell	0.648278	2.492097	0.34176	2.251536	1.089847	0.893858	3.92304
Fibroblast	1.192303	2.057905	0.205498	2.632188	1.419356	1.432105	1.132777
Endothelial	0.920688	1.62331	1.262809	1.029862	0.966018	2.927318	2.748618

[39]:

#Cluster attribution based on cutoff
df=df.drop('Ubi')
sigscores={}
for mysig in list(df.index):
    sigscores[mysig]=bc.tl.sig.getset(df,mysig,sigconfig.loc[mysig,'Cutoff']*myc)

[40]:

sigconfig.loc["CD8Tcell","Cutoff"]=1.2
# RECOMPUTING SIG SCORE WITH NEW CUTOFF
sigscores={}
for mysig in list(df.index):
    sigscores[mysig]=bc.tl.sig.getset(df,mysig,sigconfig.loc[mysig,'Cutoff']*myc)


cnames=bc.tl.sig.make_anno(df,sigscores,sigconfig,levsk)
cnames

[40]:

	celltype0	celltype1	celltype2	celltype3
2	Hematopoietic	Tcell	CD4Tcell	NaiCD4Tcell
11	Hematopoietic	pDC	pDC	pDC
10	Hematopoietic	Tcell	CD4Tcell	NaiCD4Tcell
5	Hematopoietic	Tcell	CD4Tcell	NaiCD4Tcell
3	Hematopoietic	Tcell	CD8Tcell	CytotoxCD8Tcell
7	Hematopoietic	Myeloid	Monocyte	NClassMonocyte
6	Hematopoietic	Tcell	CD8Tcell	NaiCD8Tcell
8	Hematopoietic	Tcell	CD8Tcell	NaiCD8Tcell
4	Hematopoietic	Tcell	CD4Tcell	NaiCD4Tcell
9	Hematopoietic	NKcell	CD56dimNK	CD56dimNK
1	Hematopoietic	Blymphocyte	Bcell	NaiBcell
0	Hematopoietic	Myeloid	Monocyte	ClassMonocyte

[41]:

label_files = pkg_resources.resource_filename('besca', '/datasets/nomenclature/CellTypes_v1.tsv')

cnamesDBlabel = bc.tl.sig.obtain_dblabel( label_files, cnames )
cnamesDBlabel

[41]:

	celltype0	celltype1	celltype2	celltype3
2	hematopoietic cell	T cell	CD4-positive, alpha-beta T cell	naive thymus-derived CD4-positive, alpha-beta ...
11	hematopoietic cell	plasmacytoid dendritic cell	plasmacytoid dendritic cell	plasmacytoid dendritic cell
10	hematopoietic cell	T cell	CD4-positive, alpha-beta T cell	naive thymus-derived CD4-positive, alpha-beta ...
5	hematopoietic cell	T cell	CD4-positive, alpha-beta T cell	naive thymus-derived CD4-positive, alpha-beta ...
3	hematopoietic cell	T cell	CD8-positive, alpha-beta T cell	CD8-positive, alpha-beta cytotoxic T cell
7	hematopoietic cell	myeloid leukocyte	monocyte	non-classical monocyte
6	hematopoietic cell	T cell	CD8-positive, alpha-beta T cell	naive thymus-derived CD8-positive, alpha-beta ...
8	hematopoietic cell	T cell	CD8-positive, alpha-beta T cell	naive thymus-derived CD8-positive, alpha-beta ...
4	hematopoietic cell	T cell	CD4-positive, alpha-beta T cell	naive thymus-derived CD4-positive, alpha-beta ...
9	hematopoietic cell	natural killer cell	cytotoxic CD56-dim natural killer cell	cytotoxic CD56-dim natural killer cell
1	hematopoietic cell	lymphocyte of B lineage	B cell	naive B cell
0	hematopoietic cell	myeloid leukocyte	monocyte	classical monocyte

[42]:

adata.obs['celltype0']=bc.tl.sig.add_anno(adata,cnamesDBlabel,'celltype0','leiden')
adata.obs['celltype1']=bc.tl.sig.add_anno(adata,cnamesDBlabel,'celltype1','leiden')
adata.obs['celltype2']=bc.tl.sig.add_anno(adata,cnamesDBlabel,'celltype2','leiden')
adata.obs['celltype3']=bc.tl.sig.add_anno(adata,cnamesDBlabel,'celltype3','leiden')

[43]:

sc.pl.umap(adata,color=['leiden','celltype2', 'celltype1',
                        'CD8A', 'CD8B', # CD8 T cell markers
                        'CD4' ,    'GPR183', 'CMTM8' # CD4 t cell markers
                       ], ncols=2) #,'celltype3'

../_images/tutorials_notebook2_celltype_annotation_pbmc3k_38_0.png

Here we believe some clusters of T cells are still mixed and that a more refined annotation can be done if recluster the lymphocytes. This procedure is applied below

Reclustering Lymphocytes¶

[44]:

adata.obs['leiden_original'] = adata.obs['leiden'].copy()

[45]:

adata_rc = bc.tl.rc.recluster ( adata, celltype_label = 'celltype1',
                               celltype= ('T cell',  'natural killer cell'), resolution=1.3)

[46]:

sc.pl.umap( adata_rc, color = ['celltype2','leiden',
                              # Some markers
                              'CD3G', 'CD8A', 'CD4', 'IL7R', 'NKG7', 'GNLY'])

../_images/tutorials_notebook2_celltype_annotation_pbmc3k_43_0.png

Exporting the additional labelling allows to regenerate frac_pos required file for the annotation function.

[47]:

adata_rc = bc.st.additional_labeling(adata_rc, 'leiden', 'Leiden_Reclustering', 'Leiden Reclustering on Lymphocytes', 'BescaTutorial', 'dataset_export')

rank genes per label calculated using method wilcoxon.
dataset_export/labelings/Leiden_Reclustering/WilxRank.gct written out
dataset_export/labelings/Leiden_Reclustering/WilxRank.pvalues.gct written out
dataset_export/labelings/Leiden_Reclustering/WilxRank.logFC.gct written out
mapping of cells to  leiden exported successfully to cell2labels.tsv
average.gct exported successfully to file
fract_pos.gct exported successfully to file
labelinfo.tsv successfully written out

[48]:

mymarkers['Ubi'] = ['B2M','ACTB', 'GAPDH']
clusters = 'Leiden_Reclustering'
f=pd.read_csv('dataset_export' + "/labelings/"+clusters+"/fract_pos.gct",sep="\t",skiprows=2)
df=bc.tl.sig.score_mw(f,mymarkers)
myc=np.median(df.loc['Ubi',:]*0.5) ### Set a cutoff based on Ubi and scale with values from config file

[49]:

# RECOMPUTING SIG SCORE WITH NEW CUTOFF
df=df.drop('Ubi')
sigscores={}
for mysig in list(df.index):
    sigscores[mysig]=bc.tl.sig.getset(df,mysig,sigconfig.loc[mysig,'Cutoff']*myc)
    #sigscores[mysig]=bc.tl.sig.getset(df,mysig,10)

[50]:

toExclude = [x  for x in levsk[1] if not x == 'Tcell' and not x == 'NKcell']

cnames=bc.tl.sig.make_anno(df,sigscores,sigconfig,levsk,  toexclude= toExclude)

cnamesDBlabel = bc.tl.sig.obtain_dblabel(label_files, cnames )


adata_rc.obs['celltype0']=bc.tl.sig.add_anno(adata_rc,cnamesDBlabel,'celltype0','leiden')
adata_rc.obs['celltype2']=bc.tl.sig.add_anno(adata_rc,cnamesDBlabel,'celltype2','leiden')
adata_rc.obs['celltype3']=bc.tl.sig.add_anno(adata_rc,cnamesDBlabel,'celltype3','leiden')

[51]:

sc.pl.umap(adata_rc,color=['celltype3', 'celltype2',
                           'CD3G', 'CD8A', 'CD4', 'IL7R', 'NKG7', 'GNLY', #, 'S100A'
                           'IL7R', 'SELL'
                          ],
           ncols=1)

../_images/tutorials_notebook2_celltype_annotation_pbmc3k_49_0.png

[52]:

names_2 = []
names_3 = []
for i in range( cnames.shape[0]) :
    # Orderigng lexo. order
    names_2 += [cnamesDBlabel['celltype2'][str(i)]]
    names_3 += [cnamesDBlabel['celltype3'][str(i)]]

[53]:

bc.tl.rc.annotate_new_cellnames( adata, adata_rc, names = names_2, new_label='celltype2', method = 'leiden')

bc.tl.rc.annotate_new_cellnames( adata, adata_rc, names = names_3, new_label='celltype3', method = 'leiden')

[54]:

sc.pl.umap(adata,color=['celltype2',
                       'celltype3'], ncols=1, alpha= 0.9, size= 30)

../_images/tutorials_notebook2_celltype_annotation_pbmc3k_52_0.png

Here we write out the celltype annotation so that we can use reload it in later analysis steps. There is a corresponding besca function for reloading the annotation.

[55]:

#save AnnData Object for comparision purposes
adata.write('dataset_export/data_pbmc_processed_annotated.h5ad')

unbiased marker genes for each cluster¶

[56]:

#perform differential gene expression between each cluster and all other cells using scanpy
sc.tl.rank_genes_groups(adata, groupby='celltype3')

[57]:

#visualize the top 3 marker genes as a dot plot
sc.pl.rank_genes_groups_dotplot(adata, n_genes=3, groupby = 'leiden')

../_images/tutorials_notebook2_celltype_annotation_pbmc3k_57_0.png

[58]:

sc.pl.rank_genes_groups_matrixplot(adata, n_genes=3, use_raw=True, groupby = 'leiden')

../_images/tutorials_notebook2_celltype_annotation_pbmc3k_58_0.png

[59]:

sc.pl.rank_genes_groups_stacked_violin(adata, n_genes=3, groupby='leiden')

../_images/tutorials_notebook2_celltype_annotation_pbmc3k_59_0.png

Celltype Quantification¶

[60]:

#counts per celltype
data = bc.tl.count_occurrence(adata, add_percentage=True, count_variable='celltype3')
display(data)

	Counts	Percentage
naive thymus-derived CD4-positive, alpha-beta T cell	624	24.92
classical monocyte	438	17.49
naive thymus-derived CD8-positive, alpha-beta T cell	372	14.86
naive B cell	317	12.66
CD8-positive, alpha-beta cytotoxic T cell	267	10.66
non-classical monocyte	159	6.35
central memory CD4-positive, alpha-beta T cell	138	5.51
effector memory CD8-positive, alpha-beta T cell	122	4.87
CD4-positive, alpha-beta T cell	38	1.52
plasmacytoid dendritic cell	29	1.16

[61]:

#generate a basic bar plot of the table above
fig = plt.figure(figsize=(15,6))

#visualize distribution before logarithmization
ax1 = fig.add_subplot(1, 2, 1)
ax1 = sns.barplot(data = data, x = data.index.tolist(), y = 'Counts' )
plt.xticks(rotation=90)

#visualize distribution after logarithmization
ax2 = fig.add_subplot(1, 2, 2)
ax2 = sns.barplot(data = data, x = data.index.tolist(), y = 'Percentage' )
plt.xticks(rotation=90)

[61]:

(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
 [Text(0, 0, 'naive thymus-derived CD4-positive, alpha-beta T cell'),
  Text(1, 0, 'classical monocyte'),
  Text(2, 0, 'naive thymus-derived CD8-positive, alpha-beta T cell'),
  Text(3, 0, 'naive B cell'),
  Text(4, 0, 'CD8-positive, alpha-beta cytotoxic T cell'),
  Text(5, 0, 'non-classical monocyte'),
  Text(6, 0, 'central memory CD4-positive, alpha-beta T cell'),
  Text(7, 0, 'effector memory CD8-positive, alpha-beta T cell'),
  Text(8, 0, 'CD4-positive, alpha-beta T cell'),
  Text(9, 0, 'plasmacytoid dendritic cell')])

../_images/tutorials_notebook2_celltype_annotation_pbmc3k_62_1.png