notebook 2 - celltype annotation and beyond

This notebook will introduce you to the process of celltype annotation and give you a brief outlook of some of the analysis you can do with single-cell data in scanpy & besca.

[1]:
#import necessary python packages
import scanpy.api as sc #software suite of tools for single-cell analysis in python
import besca as bc #internal BEDA package for single cell analysis
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import scipy

sc.logging.print_versions() #output an overview of the software versions that are loaded
scanpy==1.4.5.1 anndata==0.7.1 umap==0.3.10 numpy==1.16.2 scipy==1.4.1 pandas==0.24.1 scikit-learn==0.22.1 statsmodels==0.11.1 python-igraph==0.7.1 louvain==0.6.1
[2]:
from IPython.display import HTML
task = "<style>div.task { background-color: #ffc299; border-color: #ff944d; border-left: 7px solid #ff944d; padding: 1em;} </style>"
HTML(task)
[2]:
[3]:
tag = "<style>div.tag { background-color: #99ccff; border-color: #1a8cff; border-left: 7px solid #1a8cff; padding: 1em;} </style>"
HTML(tag)
[3]:
[4]:
FAIR = "<style>div.fair { background-color: #d2f7ec; border-color: #d2f7ec; border-left: 7px solid #2fbc94; padding: 1em;} </style>"
HTML(FAIR)
[4]:

Dataset

Here we will reload our previously completely processed dataset.

[5]:
#adata = sc.read('/tmp/adata_pbmc_FRESH_processed.h5ad')
adata = bc.datasets.pbmc_storage_processed()
adata.obs.drop(columns=['celltype'], inplace=True)

[6]:
adata
[6]:
AnnData object with n_obs × n_vars = 27901 × 1433
    obs: 'CELL', 'CONDITION', 'experiment', 'sample_type', 'storage_condition', 'donor', 'batch', 'n_counts', 'n_genes', 'percent_mito', 'louvain'
    var: 'ENSEMBL', 'SYMBOL', 'n_cells', 'total_counts', 'frac_reads', 'mean', 'mean_log1p', 'coeffvar', 'coeffvar_log1p'
    uns: 'donor_colors', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'storage_condition_colors', 'umap'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'

Celltype annotation

manual cell annotation based on the expression of known marker genes

[7]:
#define genes (in future you will be able to access genesets from a central database GEMS)
b_cells = ['CD19', 'CD79A', 'MS4A1']
t_cells = ['CD3E', 'CD3G', 'CD3D']
t_cell_subsets = ['CD4', 'CD8A', 'CD8B']
naive_t_cell = ['SELL', 'CCR7', 'IL7R']
myeloid_cells = ['S100A8', 'S100A9', 'CST3']
monocytes = ['FCGR3A', 'FCGR3B', 'CD14'] #FCGR3A/B = CD16
dendritic_cells = ['FCER1A', 'ITGAM', 'ITGAX'] #ITGAM = CD11b #ITGAX= CD11c
NK_cells = ['NCAM1', 'NKG7', 'CD3G']

We always use the HGNC symbols of genes as the main annotation type in the AnnData object since they are the most practical to work with. In addition, the ensemble IDs of all genes is saved in adata.var according to our FAIR standard and always written out since Ensembl IDs stay very robust over time.

Based on the dataset the marker genes that are expressed may vary strongly.

[8]:
sc.settings.set_figure_params(dpi=90)
#visualize the gene expression as an overlay of the umap
#(this way you can visually identify the clusters with a high expression))
sc.pl.umap(adata, color = b_cells, color_map = 'viridis', ncols = 3)
sc.pl.umap(adata, color = t_cells, color_map = 'viridis', ncols = 3)
sc.pl.umap(adata, color = t_cell_subsets, color_map = 'viridis', ncols = 3)
sc.pl.umap(adata, color = naive_t_cell, color_map = 'viridis', ncols = 3)
sc.pl.umap(adata, color = NK_cells,  color_map = 'viridis',ncols = 3)
sc.pl.umap(adata, color = myeloid_cells,  color_map = 'viridis',ncols = 3)
sc.pl.umap(adata, color = monocytes, color_map = 'viridis', ncols = 3)
sc.pl.umap(adata, color = dendritic_cells, color_map = 'viridis', ncols = 3)
../_images/tutorials_notebook2_celltype_annotation_12_0.png
../_images/tutorials_notebook2_celltype_annotation_12_1.png
../_images/tutorials_notebook2_celltype_annotation_12_2.png
../_images/tutorials_notebook2_celltype_annotation_12_3.png
../_images/tutorials_notebook2_celltype_annotation_12_4.png
../_images/tutorials_notebook2_celltype_annotation_12_5.png
../_images/tutorials_notebook2_celltype_annotation_12_6.png
../_images/tutorials_notebook2_celltype_annotation_12_7.png

Per default scanpy plots the gene expression values saved in adata.raw (this means log1p(cp10k)). This is the reason that we can visualize all of these genes even if they are not contained in the highly variable genes.

[9]:
sc.pl.umap(adata, color = 'louvain', palette = 'tab20')
../_images/tutorials_notebook2_celltype_annotation_14_0.png
[10]:
#write down new cluster names (important order needs to be equivalent to above)
#will label the cell types:

new_cluster_names = ['CD14+ monocyte', #0
                     'CD4 T-cell', #1
                     'CD8 T-cell', #2
                     'CD8 T-cell', #3
                     'NK cell', #4
                     'B-cell', #5
                     'CD4 T-cell', #6
                     'CD4 T-cell', #7
                     'CD8 T-cell', #8
                     'FCGR3A+ monocyte', #9
                     'pDC', #10
                     'unknown', #11
                     'CD14+ monocyte'] #12

bc.tl.annotate_cells_clustering(adata=adata, clustering_label='louvain', new_annotation_label='celltype', new_cluster_labels=new_cluster_names)

There are also more advanced methods of cell annotation where mixed clusters (i.e. clusters that contain more than one celltype) can be reclustered seperately and then annoated. Please refer to besca’s documentation for more information.

[11]:
sc.pl.umap(adata, color = ['louvain', 'celltype'], wspace = 0.2)
../_images/tutorials_notebook2_celltype_annotation_17_0.png

Several clusters have been combined into one celltype. Here only a very rudimentary celltype annotation was performed. Depending on the dataset and the type of analysis this can become far more complex and detailled.

write out celltype annotation bc.st.export_celltype(adata=adata, basepath='FILEPATH_TO_FOLDER_CONTAINING_RAW_SUBFOLDER')

[12]:
bc.st.export_celltype(adata=adata, basepath='/tmp/dataset_introduction/')
mapping of cells to  celltype exported successfully to cell2labels.tsv
average.gct exported successfully to file
fract_pos.gct exported successfully to file
labelinfo.tsv successfully written out

Here we write out the celltype annotation so that we can use reload it in later analysis steps. There is a correspondng besca function for reloading the annotation.

[13]:
#save AnnData Object for comparision purposes
adata.write('/tmp/dataset_introduction/adata_pbmc_FRESH_processed_annotated.h5ad')

unbiased marker genes for each cluster

[14]:
#perform differential gene expression between each louvain cluster and all other cells using scanpy
sc.tl.rank_genes_groups(adata, groupby='celltype')
[15]:
#visualize the top 3 marker genes as a dot plot
sc.pl.rank_genes_groups_dotplot(adata, n_genes=3, groupby = 'louvain')
WARNING: dendrogram data not found (using key=dendrogram_louvain). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.
WARNING: Groups are not reordered because the `groupby` categories and the `var_group_labels` are different.
categories: 0, 1, 2, etc.
var_group_labels: B-cell, CD4 T-cell, CD8 T-cell, etc.
../_images/tutorials_notebook2_celltype_annotation_25_1.png
[16]:
sc.pl.rank_genes_groups_matrixplot(adata, n_genes=3, use_raw=True, groupby = 'louvain')
WARNING: Groups are not reordered because the `groupby` categories and the `var_group_labels` are different.
categories: 0, 1, 2, etc.
var_group_labels: B-cell, CD4 T-cell, CD8 T-cell, etc.
../_images/tutorials_notebook2_celltype_annotation_26_1.png
[17]:
sc.pl.rank_genes_groups_stacked_violin(adata, n_genes=3, groupby='louvain')
WARNING: Groups are not reordered because the `groupby` categories and the `var_group_labels` are different.
categories: 0, 1, 2, etc.
var_group_labels: B-cell, CD4 T-cell, CD8 T-cell, etc.
../_images/tutorials_notebook2_celltype_annotation_27_1.png

Celltype Quantification

[18]:
#counts per celltype
data = bc.tl.count_occurance(adata, add_percentage=True, count_variable='celltype')
display(data)
Counts Percentage
CD8 T-cell 8688 31.14
CD4 T-cell 7326 26.26
CD14+ monocyte 4386 15.72
NK cell 3295 11.81
B-cell 3028 10.85
FCGR3A+ monocyte 565 2.03
pDC 377 1.35
unknown 236 0.85
[19]:
#generate a basic bar plot of the table above
fig = plt.figure(figsize=(15,6))

#visualize distribution before logarithmization
ax1 = fig.add_subplot(1, 2, 1)
ax1 = sns.barplot(data = data, x = data.index.tolist(), y = 'Counts' )
plt.xticks(rotation=90)

#visualize distribution after logarithmization
ax2 = fig.add_subplot(1, 2, 2)
ax2 = sns.barplot(data = data, x = data.index.tolist(), y = 'Percentage' )
plt.xticks(rotation=90)
[19]:
(array([0, 1, 2, 3, 4, 5, 6, 7]), <a list of 8 Text xticklabel objects>)
../_images/tutorials_notebook2_celltype_annotation_30_1.png
[20]:
#counts on a donor level
data = bc.tl.count_occurance_subset(adata, count_variable='celltype', subset_variable='donor')
display(data)
Donor_3A Donor_1 Donor_2
B-cell 1539 697 792
CD14+ monocyte 1903 1443 1040
CD4 T-cell 2901 2410 2015
CD8 T-cell 3046 2488 3154
FCGR3A+ monocyte 120 254 191
NK cell 1613 989 693
pDC 119 121 137
unknown 70 120 46
[21]:
#make tidy for plotting
data['celltype'] = data.index.tolist()
data_plot = data.melt(value_name='Counts', id_vars='celltype', var_name='donor')

#generate a plot
fig = plt.figure(figsize=(15,6))
sns.barplot(data=data_plot, x = 'celltype', y = 'Counts', hue='donor', dodge=True)
plt.xticks(rotation=90)
[21]:
(array([0, 1, 2, 3, 4, 5, 6, 7]), <a list of 8 Text xticklabel objects>)
../_images/tutorials_notebook2_celltype_annotation_32_1.png
[22]:
#stacked bar plot incorporated in besca
fig = bc.pl.celllabel_quant_stackedbar(adata, count_variable='celltype', subset_variable = 'donor');
../_images/tutorials_notebook2_celltype_annotation_33_0.png