hands on tutorial¶

In this tutorial, we will go through the analysis of a single cell rnaseq dataset.

The analysis will be fragmented into more detailled than in the standard workflow, thus some of the steps are grouped in wrapper functions in besca

[2]:

#import necessary python packages
import scanpy as sc #software suite of tools for single-cell analysis in python
import besca as bc #internal BEDA package for single cell analysis
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import scipy

sc.logging.print_versions() #output an overview of the software versions that are loaded

scanpy==1.5.1 anndata==0.7.3 umap==0.4.3 numpy==1.17.5 scipy==1.4.1 pandas==1.0.3 scikit-learn==0.23.1 statsmodels==0.11.1 python-igraph==0.8.2 leidenalg==0.8.1

[3]:

from IPython.display import HTML
task = "<style>div.task { background-color: #ffc299; border-color: #ff944d; border-left: 7px solid #ff944d; padding: 1em;} </style>"
HTML(task)

[3]:

[4]:

tag = "<style>div.tag { background-color: #99ccff; border-color: #1a8cff; border-left: 7px solid #1a8cff; padding: 1em;} </style>"
HTML(tag)

[4]:

[5]:

FAIR = "<style>div.fair { background-color: #d2f7ec; border-color: #d2f7ec; border-left: 7px solid #2fbc94; padding: 1em;} </style>"
HTML(FAIR)

[5]:

Dataset Loading¶

[6]:

#adata = sc.read(filepath)
adata = bc.datasets.Kotliarov2020_raw()

How many cells does this dataset contain?
1. How many genes does this dataset contain?
  1. Which categories of sample annotation are present?
    1. Do we have different batchs ? Are they equilibrated ?
      1. How many cells are from each Donor? Is it an equal distribution?

[7]:

## solution 1 and 2
adata

[7]:

AnnData object with n_obs × n_vars = 58654 × 32738
    obs: 'CELL', 'CONDITION', 'sample_type', 'donor', 'tenx_lane', 'cohort', 'batch', 'sampleid', 'timepoint', 'percent_mito'
    var: 'ENSEMBL', 'SYMBOL', 'feature_type'

[8]:

## solution 3
adata.obs.head(5)

[8]:

	CELL	CONDITION	sample_type	donor	tenx_lane	cohort	batch	sampleid	timepoint	percent_mito
10X_CiteSeq_donor256.AAACCTGAGAGCCCAA-1	10X_CiteSeq_donor256.AAACCTGAGAGCCCAA-1	PBMC_healthy	PBMC	donor256	H1B1ln1	H1N1	1	256	d0	0.041481
10X_CiteSeq_donor273.AAACCTGAGGCGTACA-1	10X_CiteSeq_donor273.AAACCTGAGGCGTACA-1	PBMC_healthy	PBMC	donor273	H1B1ln1	H1N1	1	273	d0	0.049020
10X_CiteSeq_donor256.AAACCTGCAGGTGGAT-1	10X_CiteSeq_donor256.AAACCTGCAGGTGGAT-1	PBMC_healthy	PBMC	donor256	H1B1ln1	H1N1	1	256	d0	0.017332
10X_CiteSeq_donor200.AAACCTGCAGTATCTG-1	10X_CiteSeq_donor200.AAACCTGCAGTATCTG-1	PBMC_healthy	PBMC	donor200	H1B1ln1	H1N1	1	200	d0	0.017222
10X_CiteSeq_donor233.AAACCTGCATCACAAC-1	10X_CiteSeq_donor233.AAACCTGCATCACAAC-1	PBMC_healthy	PBMC	donor233	H1B1ln1	H1N1	1	233	d0	0.033969

[52]:

adata.obs   = adata.obs.astype( {'batch' :  'category'})

[9]:

## solution 4
bc.tl.count_occurrence(adata,'batch')

[9]:

	Counts
1	30668
2	27986

[10]:

## solution 5

temp=bc.tl.count_occurrence(adata,'donor')

sns.barplot(y=temp.index,x=temp.Counts,color='gray',orient='h')
#temp

[10]:

<matplotlib.axes._subplots.AxesSubplot at 0x2aaafd823310>

../_images/tutorials_scRNAseq_tutorial_12_1.png

Quality Control¶

[11]:

fig, ax = plt.subplots(1)
fig.set_figwidth(8)
fig.set_figheight(5)
fig.tight_layout()

bc.pl.transcript_capture_efficiency(adata,ax=ax)

../_images/tutorials_scRNAseq_tutorial_14_0.png

[12]:

fig = bc.pl.librarysize_overview(adata, bins=100)

../_images/tutorials_scRNAseq_tutorial_15_0.png

Preprocessing¶

what do you think are reasonable cutoffs for this dataset? Why do we filter for these particular cutoffs?

[13]:

## solution 6

min_genes = 500
min_counts= 1000
min_cells = 30
max_genes = 3000
max_counts = 20000
max_mito = 0.05

[14]:

fig, ((ax1, ax2, ax3), (ax4, ax5, ax6))= plt.subplots(ncols=3, nrows=2)
fig.set_figwidth(16)
fig.set_figheight(8)
fig.tight_layout(pad=4.5)

bc.pl.kp_genes(adata, min_genes=min_genes, ax = ax1)
bc.pl.kp_counts(adata, min_counts = min_counts, ax = ax2)
bc.pl.kp_cells(adata, min_cells=min_cells, ax = ax3)

bc.pl.max_counts(adata, max_counts=max_counts, ax = ax4)
bc.pl.max_genes(adata, max_genes = max_genes, ax = ax5)
bc.pl.max_mito(adata, max_mito=max_mito, annotation_type='SYMBOL', species='human', ax = ax6)

../_images/tutorials_scRNAseq_tutorial_19_0.png

[15]:

sc.pl.violin(adata, ['percent_mito','n_genes', 'n_counts'], groupby='donor',jitter=0.1,rotation=90)

../_images/tutorials_scRNAseq_tutorial_20_0.png

[16]:

adata = bc.pp.filter(adata, max_counts=max_counts, max_genes = max_genes, max_mito = max_mito, min_genes = min_genes, min_cells = min_cells, min_counts = min_counts)

started with  58654  total cells and  32738  total genes
removed 25 cells that expressed more than 3000 genes
removed 8715 cells that did not express at least 500  genes
removed 0 cells that had more than 20000  counts
removed 143 cells that did not have at least 1000 counts
removed 17856 genes that were not expressed in at least 30 cells
removed  2260  cells that expressed  5.0 percent mitochondrial genes or more
finished with 47511  total cells and 14882 total genes

Were there differences in viability between Donors?

[17]:

## Solution 7

sc.pl.violin(adata, ['percent_mito','n_genes', 'n_counts'], groupby='donor',jitter=0.1,rotation=90)

../_images/tutorials_scRNAseq_tutorial_23_0.png

[18]:

adata.obs.donor.value_counts()
temp=bc.tl.count_occurrence(adata,'donor')

sns.barplot(y=temp.index,x=temp.Counts,color='gray',orient='h')
#temp

[18]:

<matplotlib.axes._subplots.AxesSubplot at 0x2aaaffde7cd0>

../_images/tutorials_scRNAseq_tutorial_24_1.png

[19]:

bc.pl.top_genes_counts(adata=adata, top_n=25, ax=None)

[19]:

<matplotlib.axes._subplots.AxesSubplot at 0x2aacc998d950>

../_images/tutorials_scRNAseq_tutorial_25_1.png

Which are the most commonly expressed genes? Is this what you would expect?

Normalization¶

[20]:

#normalize per cell
sc.pp.normalize_per_cell(adata, counts_per_cell_after=1e4)

#save a copy (logarithmized for better understanding of values) to adata.raw
adata.raw = sc.pp.log1p(adata, copy= True)

write out cp10k values for loading into database

bc.st.export_cp10k(adata=adata, basepath='FILEPATH_TO_FOLDER_CONTAINING_RAW_SUBFOLDER')

What does a cp10k of 10 mean? Why do we normalize to cp10k?

variance in the dataset¶

[21]:

# calculate mean and coefficient of variation for each gene and add to adata.var
adata.var['mean'] = adata.X.todense().mean(axis = 0).tolist()[0]
adata.var['mean_log1p'] = np.log2(adata.var.get('mean').values)
adata.var['coeffvar'] = scipy.stats.variation(adata.X.todense(), axis = 0)
adata.var['coeffvar_log1p'] = np.log2(adata.var.get('coeffvar').values)

#generate a plot of our data to visualize
x = adata.var.mean_log1p.to_frame()
y = adata.var.coeffvar_log1p.to_frame()

#calculate linear fit between mean and cv
from sklearn import linear_model

#generate linear model
lm = linear_model.LinearRegression()
model = lm.fit(x,y)

#make predictions
X = range(int(min(x['mean_log1p'])), int(max(x['mean_log1p'])))
predictions = lm.predict(pd.DataFrame(data = list(X)))

#generate plot
fig = plt.figure(figsize=(10,6))
ax = fig.add_subplot(1, 1, 1)
ax.scatter(x = x, y = y, alpha = 0.5, s=1, color = 'black')
ax.set_xlabel('log2(mean gene expression)')
ax.set_ylabel('log2(coefficient of variation)')
ax.set_title('mean variance trend')
ax.plot(X, predictions, color = 'red', linestyle = 'dashed', label="linear regression")
ax.legend()

[21]:

<matplotlib.legend.Legend at 0x2aab0080a350>

../_images/tutorials_scRNAseq_tutorial_32_1.png

What trend can we see in the dataset? Why does it make sense that we observe this trend? Which genes are the ones that interest us?

selection highly variable genes¶

[22]:

variable_genes_min_mean = 0.01
variable_genes_max_mean = 5
variable_genes_min_disp = 0.4

#identify genes with variable expression
filter_result = sc.pp.filter_genes_dispersion(adata.X, min_mean=variable_genes_min_mean, max_mean=variable_genes_max_mean, min_disp=variable_genes_min_disp)
sc.pl.filter_genes_dispersion(filter_result)
nbr_variable_genes = sum(filter_result.gene_subset)
print('number of variable genes selected ', nbr_variable_genes )

../_images/tutorials_scRNAseq_tutorial_35_0.png

number of variable genes selected  1567

[23]:

#perform the actual filtering
adata = adata[:, filter_result.gene_subset]

Why do we reduce the number of genes that we look at? What do the genes we select have in common?

[24]:

#log transform our data (is easier to work with numbers like this)
sc.pp.log1p(adata)

unwanted sources of variance¶

[25]:

#define several input factors
random_seed = 0 #define a random seed so that for stochastic processes you get reproducible results
max_value = 10

What is a potential issue when removing unwanted sources of variance?

[26]:

#remove variance due to difference in counts and percent mitochondrial gene content
adata = sc.pp.regress_out(adata, ['n_counts', 'percent_mito'], copy=True)

[27]:

sc.pp.scale(adata, max_value=10) # Scale data to unit variance and zero mean, and cut-off at max value 10

write out regressedOut values for loading into database

bc.st.export_regressedOut(adata = adata, basepath='FILEPATH_TO_FOLDER_CONTAINING_RAW_SUBFOLDER')

Besides regressOut what else might we consider correcting for? Why is it possible in this dataset?

Principle component analysis¶

[28]:

sc.settings.set_figure_params(dpi=90)

#calculate 50 principle components of the dataset
sc.tl.pca(adata, random_state = random_seed, svd_solver = 'arpack')

#visualize the amount of variance explained by each PC
sc.pl.pca_variance_ratio(adata)

#visualize the loadings onto the first 3 PCs
sc.pl.pca_loadings(adata)

../_images/tutorials_scRNAseq_tutorial_47_0.png

../_images/tutorials_scRNAseq_tutorial_47_1.png

Do we observe differences in the amount of variance that the PCs explain in the corrected vs the uncorrected data? why might this be?

batch correction¶

Batch correction is perfomed here using the bbknn graph based method.

[29]:

adata_uncorrected = adata.copy()

We will continue doing each step both for the corrected and the uncorrected dataset so that we can compare our results.

nearest neighbor calculation¶

Computes a neighborhood graph of the cells based on the first 50 principle components.

[30]:

%%time
sc.pp.neighbors(adata_uncorrected, n_neighbors=15, random_state = random_seed, n_pcs=50)

CPU times: user 21.4 s, sys: 3.74 s, total: 25.1 s
Wall time: 27 s

The second batch correction method we will look at for comparision purposes is called bbknn. This method introduces the batch correction during the calculation of the nearest neighbors. It splits your dataset into a smaller number of batches and ensures that each cell is connected to atleast a small number of other cells from each batch. This will bring the more closely related cells from each batch closer together. A huge benefit is that it works extremely quickly.

[31]:

%%time
#perform batch correction using bbknn (graph based batch correction method)
import bbknn
bbknn.bbknn(adata, batch_key='batch') #perform alternative batch correction method

CPU times: user 4.08 s, sys: 16.9 ms, total: 4.1 s
Wall time: 4.1 s

Leiden clustering¶

[32]:

sc.tl.leiden(adata_uncorrected, random_state = random_seed)
#sc.tl.louvain(adata_mnn, random_state=random_seed) # Please uncommentn if you calculated adata_mnn (see above)
sc.tl.leiden(adata, random_state = random_seed)

UMAP¶

[33]:

%%time

##Warning this code block will take 1-2mins to execute
sc.tl.umap(adata_uncorrected, random_state = random_seed)
sc.tl.umap(adata, random_state = random_seed)

CPU times: user 1min 43s, sys: 1.81 s, total: 1min 45s
Wall time: 1min 40s

Due to the larger dataset calculating the t-SNE plot would already take several minutes wich is why we left it out

Visualize our Results¶

[34]:

sc.settings.set_figure_params(dpi=90)
print('uncorrected')
sc.pl.umap(adata_uncorrected, color = ['donor', 'leiden', 'batch'], wspace = 0.4)


print('bbknn')
sc.pl.umap(adata, color = ['donor', 'leiden', 'batch'], wspace = 0.4)

uncorrected

../_images/tutorials_scRNAseq_tutorial_62_1.png

mnnpy
bbknn

../_images/tutorials_scRNAseq_tutorial_62_3.png

What do you think about the batch correction method? Are there large differences?

Observe cell type/markers¶

This will aid us in labeling the celltypes found in the complete dataset.

After this point we will only proceed with the batch corrected dataset.

[35]:

#define genesets for cell type identification
b_cells = ['CD19', 'CD79A', 'MS4A1']
t_cells = ['CD3E', 'CD3G', 'CD3D']
t_cell_subsets = ['CD4', 'CD8A', 'CD8B']
naive_t_cell = ['SELL', 'CCR7', 'IL7R']
myeloid_cells = ['S100A8', 'S100A9', 'CST3']
monocytes = ['FCGR3A', 'CD14'] #FCGR3A/B = CD16
dendritic_cells = ['FCER1A', 'ITGAM', 'ITGAX'] #ITGAM = CD11b #ITGAX= CD11c
NK_cells = ['NCAM1', 'NKG7', 'CD3G']

[36]:

sc.settings.set_figure_params(dpi=90)
#visualize the gene expression as an overlay of the umap
#(this way you can visually identify the clusters with a high expression))
print('b_cells')
sc.pl.umap(adata, color = b_cells, color_map = 'viridis', ncols = 3)
print('t_cells')
sc.pl.umap(adata, color = t_cells, color_map = 'viridis', ncols = 3)
print('t_cell_subsets')
sc.pl.umap(adata, color = t_cell_subsets, color_map = 'viridis', ncols = 3)
print('NK_cells')
sc.pl.umap(adata, color = NK_cells,  color_map = 'viridis',ncols = 3)
print('myeloid_cells')
sc.pl.umap(adata, color = myeloid_cells,  color_map = 'viridis',ncols = 3)
print('monocytes')
sc.pl.umap(adata, color = monocytes, color_map = 'viridis', ncols = 3)
print('dendritic_cells')
sc.pl.umap(adata, color = dendritic_cells, color_map = 'viridis', ncols = 3)

b_cells

../_images/tutorials_scRNAseq_tutorial_67_1.png

t_cells

../_images/tutorials_scRNAseq_tutorial_67_3.png

t_cell_subsets

../_images/tutorials_scRNAseq_tutorial_67_5.png

NK_cells

../_images/tutorials_scRNAseq_tutorial_67_7.png

myeloid_cells

../_images/tutorials_scRNAseq_tutorial_67_9.png

monocytes

../_images/tutorials_scRNAseq_tutorial_67_11.png

dendritic_cells

../_images/tutorials_scRNAseq_tutorial_67_13.png

[37]:

print('t_cell_subsets')
sc.pl.dotplot(adata, groupby='louvain', var_names = t_cells + t_cell_subsets)

t_cell_subsets

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-37-e377582e863d> in <module>
      1 print('t_cell_subsets')
----> 2 sc.pl.dotplot(adata, groupby='louvain', var_names = t_cells + t_cell_subsets)

~/miniconda3/envs/besca_dev/lib/python3.7/site-packages/scanpy/plotting/_anndata.py in dotplot(adata, var_names, groupby, use_raw, log, num_categories, expression_cutoff, mean_only_expressed, color_map, dot_max, dot_min, standard_scale, smallest_dot, figsize, dendrogram, gene_symbols, var_group_positions, var_group_labels, var_group_rotation, layer, show, save, **kwds)
   1809         num_categories,
   1810         layer=layer,
-> 1811         gene_symbols=gene_symbols,
   1812     )
   1813

~/miniconda3/envs/besca_dev/lib/python3.7/site-packages/scanpy/plotting/_anndata.py in _prepare_dataframe(adata, var_names, groupby, use_raw, log, num_categories, layer, gene_symbols)
   2883         if groupby not in adata.obs_keys():
   2884             raise ValueError(
-> 2885                 'groupby has to be a valid observation. '
   2886                 f'Given {groupby}, valid observations: {adata.obs_keys()}'
   2887             )

ValueError: groupby has to be a valid observation. Given louvain, valid observations: ['CELL', 'CONDITION', 'sample_type', 'donor', 'tenx_lane', 'cohort', 'batch', 'sampleid', 'timepoint', 'percent_mito', 'n_counts', 'n_genes', 'leiden']

[ ]:

sc.pl.umap(adata, color = ['leiden'], palette = 'tab20')

One could annotate the cell types using the observed markers. Using Besca, We advise to use the auto-annot modules if you have a reference dataset, or Signature-based annotation.

[ ]:

# The annotated and batch corrected datasets can be loaded using:  ( will download the dataset from Zenodo. Could take some time initially)

adata_annotation = bc.datasets.Kotliarov2020_processed()
adata_annotation

Quantify celltypes per donor¶

[ ]:

sc.pl.umap(adata_annotation, color = ['donor', 'celltype0', 'celltype1', 'celltype2', 'celltype3'], ncols = 1)

[ ]:

fig = bc.pl.celllabel_quant_stackedbar(adata_annotation, count_variable='donor', subset_variable = 'celltype1')

Little excursion into differential gene expression analysis¶

In this dataset, there are not exactly specific conditions to observe. However we could try to see whether there are large variability between donors. This is just for illustration purposes.

[ ]:

sc.tl.rank_genes_groups(adata_annotation, use_raw = True, groupby = 'donor')

[ ]:

sc.pl.rank_genes_groups_dotplot(adata_annotation, n_genes=5, groupby = 'donor')

[ ]:

sc.pl.rank_genes_groups_matrixplot(adata_annotation, n_genes=5, use_raw=True, groupby = 'donor')

[ ]:

sc.pl.rank_genes_groups_stacked_violin(adata_annotation, n_genes=2, groupby='donor')