notebook 2 - celltype annotation and beyond¶

This notebook will introduce you to the process of celltype annotation and give you a brief outlook of some of the analysis you can do with single-cell data in scanpy & besca.

[1]:

#import necessary python packages
import scanpy.api as sc #software suite of tools for single-cell analysis in python
import besca as bc #internal BEDA package for single cell analysis
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import scipy

sc.logging.print_versions() #output an overview of the software versions that are loaded

scanpy==1.4.5.1 anndata==0.7.1 umap==0.3.10 numpy==1.16.2 scipy==1.4.1 pandas==0.24.1 scikit-learn==0.22.1 statsmodels==0.11.1 python-igraph==0.7.1 louvain==0.6.1

[2]:

from IPython.display import HTML
task = "<style>div.task { background-color: #ffc299; border-color: #ff944d; border-left: 7px solid #ff944d; padding: 1em;} </style>"
HTML(task)

[2]:

[3]:

tag = "<style>div.tag { background-color: #99ccff; border-color: #1a8cff; border-left: 7px solid #1a8cff; padding: 1em;} </style>"
HTML(tag)

[3]:

[4]:

FAIR = "<style>div.fair { background-color: #d2f7ec; border-color: #d2f7ec; border-left: 7px solid #2fbc94; padding: 1em;} </style>"
HTML(FAIR)

[4]:

Dataset¶

Here we will reload our previously completely processed dataset.

[5]:

#adata = sc.read('/tmp/adata_pbmc_FRESH_processed.h5ad')
adata = bc.datasets.pbmc_storage_processed()
adata.obs.drop(columns=['celltype'], inplace=True)

[6]:

adata

[6]:

AnnData object with n_obs × n_vars = 27901 × 1433
    obs: 'CELL', 'CONDITION', 'experiment', 'sample_type', 'storage_condition', 'donor', 'batch', 'n_counts', 'n_genes', 'percent_mito', 'louvain'
    var: 'ENSEMBL', 'SYMBOL', 'n_cells', 'total_counts', 'frac_reads', 'mean', 'mean_log1p', 'coeffvar', 'coeffvar_log1p'
    uns: 'donor_colors', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'storage_condition_colors', 'umap'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'

Celltype annotation¶

manual cell annotation based on the expression of known marker genes¶

[7]:

#define genes (in future you will be able to access genesets from a central database GEMS)
b_cells = ['CD19', 'CD79A', 'MS4A1']
t_cells = ['CD3E', 'CD3G', 'CD3D']
t_cell_subsets = ['CD4', 'CD8A', 'CD8B']
naive_t_cell = ['SELL', 'CCR7', 'IL7R']
myeloid_cells = ['S100A8', 'S100A9', 'CST3']
monocytes = ['FCGR3A', 'FCGR3B', 'CD14'] #FCGR3A/B = CD16
dendritic_cells = ['FCER1A', 'ITGAM', 'ITGAX'] #ITGAM = CD11b #ITGAX= CD11c
NK_cells = ['NCAM1', 'NKG7', 'CD3G']

We always use the HGNC symbols of genes as the main annotation type in the AnnData object since they are the most practical to work with. In addition, the ensemble IDs of all genes is saved in adata.var according to our FAIR standard and always written out since Ensembl IDs stay very robust over time.

Based on the dataset the marker genes that are expressed may vary strongly.

[8]:

sc.settings.set_figure_params(dpi=90)
#visualize the gene expression as an overlay of the umap
#(this way you can visually identify the clusters with a high expression))
sc.pl.umap(adata, color = b_cells, color_map = 'viridis', ncols = 3)
sc.pl.umap(adata, color = t_cells, color_map = 'viridis', ncols = 3)
sc.pl.umap(adata, color = t_cell_subsets, color_map = 'viridis', ncols = 3)
sc.pl.umap(adata, color = naive_t_cell, color_map = 'viridis', ncols = 3)
sc.pl.umap(adata, color = NK_cells,  color_map = 'viridis',ncols = 3)
sc.pl.umap(adata, color = myeloid_cells,  color_map = 'viridis',ncols = 3)
sc.pl.umap(adata, color = monocytes, color_map = 'viridis', ncols = 3)
sc.pl.umap(adata, color = dendritic_cells, color_map = 'viridis', ncols = 3)

../_images/tutorials_notebook2_celltype_annotation_12_0.png

../_images/tutorials_notebook2_celltype_annotation_12_1.png

../_images/tutorials_notebook2_celltype_annotation_12_2.png

../_images/tutorials_notebook2_celltype_annotation_12_3.png

../_images/tutorials_notebook2_celltype_annotation_12_4.png

../_images/tutorials_notebook2_celltype_annotation_12_5.png

../_images/tutorials_notebook2_celltype_annotation_12_6.png

../_images/tutorials_notebook2_celltype_annotation_12_7.png

Per default scanpy plots the gene expression values saved in adata.raw (this means log1p(cp10k)). This is the reason that we can visualize all of these genes even if they are not contained in the highly variable genes.

[9]:

sc.pl.umap(adata, color = 'louvain', palette = 'tab20')

../_images/tutorials_notebook2_celltype_annotation_14_0.png

[10]:

#write down new cluster names (important order needs to be equivalent to above)
#will label the cell types:

new_cluster_names = ['CD14+ monocyte', #0
                     'CD4 T-cell', #1
                     'CD8 T-cell', #2
                     'CD8 T-cell', #3
                     'NK cell', #4
                     'B-cell', #5
                     'CD4 T-cell', #6
                     'CD4 T-cell', #7
                     'CD8 T-cell', #8
                     'FCGR3A+ monocyte', #9
                     'pDC', #10
                     'unknown', #11
                     'CD14+ monocyte'] #12

bc.tl.annotate_cells_clustering(adata=adata, clustering_label='louvain', new_annotation_label='celltype', new_cluster_labels=new_cluster_names)

There are also more advanced methods of cell annotation where mixed clusters (i.e. clusters that contain more than one celltype) can be reclustered seperately and then annoated. Please refer to besca’s documentation for more information.

[11]:

sc.pl.umap(adata, color = ['louvain', 'celltype'], wspace = 0.2)

../_images/tutorials_notebook2_celltype_annotation_17_0.png

Several clusters have been combined into one celltype. Here only a very rudimentary celltype annotation was performed. Depending on the dataset and the type of analysis this can become far more complex and detailled.

write out celltype annotation bc.st.export_celltype(adata=adata, basepath='FILEPATH_TO_FOLDER_CONTAINING_RAW_SUBFOLDER')

[12]:

bc.st.export_celltype(adata=adata, basepath='/tmp/dataset_introduction/')

mapping of cells to  celltype exported successfully to cell2labels.tsv
average.gct exported successfully to file
fract_pos.gct exported successfully to file
labelinfo.tsv successfully written out

Here we write out the celltype annotation so that we can use reload it in later analysis steps. There is a correspondng besca function for reloading the annotation.

[13]:

#save AnnData Object for comparision purposes
adata.write('/tmp/dataset_introduction/adata_pbmc_FRESH_processed_annotated.h5ad')

unbiased marker genes for each cluster¶

[14]:

#perform differential gene expression between each louvain cluster and all other cells using scanpy
sc.tl.rank_genes_groups(adata, groupby='celltype')

[15]:

#visualize the top 3 marker genes as a dot plot
sc.pl.rank_genes_groups_dotplot(adata, n_genes=3, groupby = 'louvain')

WARNING: dendrogram data not found (using key=dendrogram_louvain). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.
WARNING: Groups are not reordered because the `groupby` categories and the `var_group_labels` are different.
categories: 0, 1, 2, etc.
var_group_labels: B-cell, CD4 T-cell, CD8 T-cell, etc.

../_images/tutorials_notebook2_celltype_annotation_25_1.png

[16]:

sc.pl.rank_genes_groups_matrixplot(adata, n_genes=3, use_raw=True, groupby = 'louvain')

WARNING: Groups are not reordered because the `groupby` categories and the `var_group_labels` are different.
categories: 0, 1, 2, etc.
var_group_labels: B-cell, CD4 T-cell, CD8 T-cell, etc.

../_images/tutorials_notebook2_celltype_annotation_26_1.png

[17]:

sc.pl.rank_genes_groups_stacked_violin(adata, n_genes=3, groupby='louvain')

WARNING: Groups are not reordered because the `groupby` categories and the `var_group_labels` are different.
categories: 0, 1, 2, etc.
var_group_labels: B-cell, CD4 T-cell, CD8 T-cell, etc.

../_images/tutorials_notebook2_celltype_annotation_27_1.png

Celltype Quantification¶

[18]:

#counts per celltype
data = bc.tl.count_occurance(adata, add_percentage=True, count_variable='celltype')
display(data)

	Counts	Percentage
CD8 T-cell	8688	31.14
CD4 T-cell	7326	26.26
CD14+ monocyte	4386	15.72
NK cell	3295	11.81
B-cell	3028	10.85
FCGR3A+ monocyte	565	2.03
pDC	377	1.35
unknown	236	0.85

[19]:

#generate a basic bar plot of the table above
fig = plt.figure(figsize=(15,6))

#visualize distribution before logarithmization
ax1 = fig.add_subplot(1, 2, 1)
ax1 = sns.barplot(data = data, x = data.index.tolist(), y = 'Counts' )
plt.xticks(rotation=90)

#visualize distribution after logarithmization
ax2 = fig.add_subplot(1, 2, 2)
ax2 = sns.barplot(data = data, x = data.index.tolist(), y = 'Percentage' )
plt.xticks(rotation=90)

[19]:

(array([0, 1, 2, 3, 4, 5, 6, 7]), <a list of 8 Text xticklabel objects>)

../_images/tutorials_notebook2_celltype_annotation_30_1.png

[20]:

#counts on a donor level
data = bc.tl.count_occurance_subset(adata, count_variable='celltype', subset_variable='donor')
display(data)

	Donor_3A	Donor_1	Donor_2
B-cell	1539	697	792
CD14+ monocyte	1903	1443	1040
CD4 T-cell	2901	2410	2015
CD8 T-cell	3046	2488	3154
FCGR3A+ monocyte	120	254	191
NK cell	1613	989	693
pDC	119	121	137
unknown	70	120	46

[21]:

#make tidy for plotting
data['celltype'] = data.index.tolist()
data_plot = data.melt(value_name='Counts', id_vars='celltype', var_name='donor')

#generate a plot
fig = plt.figure(figsize=(15,6))
sns.barplot(data=data_plot, x = 'celltype', y = 'Counts', hue='donor', dodge=True)
plt.xticks(rotation=90)

[21]:

(array([0, 1, 2, 3, 4, 5, 6, 7]), <a list of 8 Text xticklabel objects>)

../_images/tutorials_notebook2_celltype_annotation_32_1.png

[22]:

#stacked bar plot incorporated in besca
fig = bc.pl.celllabel_quant_stackedbar(adata, count_variable='celltype', subset_variable = 'donor');

../_images/tutorials_notebook2_celltype_annotation_33_0.png