
helper functions


Extract the AnnData object saved in adata.raw

besca.subset_adata(adata, filter_criteria[, ...])

Subset AnnData object into new object

besca.convert_ensembl_to_symbol(gene_list[, ...])

Convert ENSEMBL gene ids to SYMBOLS Uses the python package mygene to look up the supplied list of ENSEMBLE Ids and return the equivalent list of Symbols.

besca.convert_symbol_to_ensembl(gene_list[, ...])

Convert SYMBOLS to ENSEMBL gene ids Uses the python package mygene to look up the supplied list of SYMBOLS and return the equivalent list of ENSEMBLE GENEIDs.


Extract the AnnData object saved in adata.raw

besca.get_means(adata, mycat[, condition])

Calculates average and fraction expression per category in adata.obs

besca.get_ameans(adata, mycat[, condition])

Calculates average and fraction expression per category in adata.obs Based artihmetic mean expression and fraction cells expressing gene per category (works on linear scale).

besca.concate_adata(adata1, adata2)

Concatenate two adata objects based on the observations


filter(adata[, max_genes, min_genes, ...])

Filter cell outliers based on counts, numbers of genes expressed, number of cells expressing a gene and mitochondrial gene content.

filter_gene_list(adata, filepath[, use_raw, ...])

Function to remove all genes specified in a gene list read from file.

frac_pos(adata[, threshold])

Calculate the fraction of cells positive for expression of a gene.


Cacluate the fraction of reads being attributed to a specific gene.


Calculate the mean expression of a gene.

top_expressed_genes(adata[, top_n])

Give out the genes most frequently expressed in cells.

fraction_counts(adata[, species, name, ...])

Function to calculate fraction of counts per cell from a gene list.

top_counts_genes(adata[, top_n])

Give out the genes that contribute the largest fraction to the total UMI counts.


Perform geometric normalization on CITEseq data.

valOutlier(adata[, nmads, rlib_loc])

Estimates and returns the thresholds to use for gene/cell filtering based on outliers calculated from the deviation to the median QCs.

scTransform(adata[, hvg, n_genes, rlib_loc])

Function to call scTransform normalization or HVG selection from Python.


kp_genes(adata[, threshold, min_genes, ax, ...])

visualize the minimum gene per cell threshold.

kp_counts(adata[, min_counts, ax, figsize])

visualize the minimum UMI counts per cell threshold.

kp_cells(adata[, threshold, min_cells, ax, ...])

visualize the minimum number of cells expressing a gene threshold.

max_counts(adata[, max_counts, ax, figsize])

visualize maximum UMI counts per cell threshold.

max_genes(adata[, max_genes, ax, figsize])

visualize maximum number of genes per cell threshold.

max_mito(adata[, max_mito, annotation_type, ...])

visualize maximum mitochondrial gene percentage threshold.

dropouts(adata[, ax, bins, figsize])

Plot number of dropouts.

detected_genes(adata[, ax, bins, figsize])

Plot number of detected genes.

library_size(adata[, ax, bins, figsize])

Plot library size.

librarysize_overview(adata[, bins, figsize])

Generates overview figure of libarysize, dropouts and detected genes.

transcript_capture_efficiency(adata[, ax, ...])

Plot total gene counts vs detection probability.

top_genes_counts(adata[, top_n, ax, figsize])

plot top n genes that contribute to fraction of counts per cell

gene_expr_split(adata, genes[, ...])

visualize gene expression of two groups as a split violin plot

gene_expr_split_stacked(adata, genes, ...[, ...])

Stacked violin plot for visualization of genes expression.

box_per_ind(plotdata, y_axis, x_axis[, ...])

plot boxplot with values per individual.

stacked_split_violin(tidy_data, x_axis, ...)

plot stacked split violin plots.

celllabel_quant_boxplot(adata, ...[, ...])

generate a box and whisker plot with overlayed swarm plot of celltype abundances

celllabel_quant_stackedbar(adata, ...[, ...])

Generate a stacked bar plot of the percentage of labelcounts within each AnnData subset

dot_heatmap(adata, genes[, group_by, ...])

Generate a dot plot, filled with heatmap of individuals cells gene expression.

dot_heatmap_split(adata, genes, split_by[, ...])

Generate a dot plot, filled with heatmap of individuals cells gene expression to compare two conditions.

dot_heatmap_split_greyscale(adata, genes, ...)

Generate a dot plot, filled with heatmap of individuals cells gene expression to compare two conditions (greyscale).

update_qualitative_palette(adata, palette[, ...])

Update adata object such that the umap will adhere to the palette provided.

nomenclature_network(config_file[, ...])

Plot a nomenclature network based on annotation config file.

riverplot_2categories(adata, categories[, ...])

Generate a riverplot/sanker diagram between two categories.

flex_dotplot(df, X, Y, HUE, SIZE, title[, ...])

Generate a dot plot showing average expression and fraction positive cells


count_occurrence(adata[, count_variable, ...])

Generate dataframe containing the label counts/percentages of a specific column in adata.obs

count_occurrence_subset(adata, subset_variable)

count occurrence of a label in adata.obs after subseting adata object

count_occurrence_subset_conditions(adata, ...)

count occurrence of a label for each condition in adata.obs after subseting adata object

annotate_cells_clustering(adata, ...[, ...])

Function to add annotation to adata.obs based on clustering This function replaces the original cluster labels located in the column clustering_label with the new values specified in the list new_cluster_lables.

report(adata_pred, celltype, method, ...[, ...])

reports basic metrics, produces confusion matrices and plots umap of prediction

plot_confusion_matrix(y_true, y_pred, ...[, ...])

plots confusion matrices


batch correction

Collection of functions to perform batch correction.

batch_correct(adata, batch_to_correct)

function to perform batch correction

postprocess_mnnpy(adata, bdata)

postprocessing to generate a newly functional AnnData object

differential gene expression

Collection of functions to aid in differential gene expression analysis.

perform_dge(adata, design_matrix, ...[, ...])

Perform differential gene expression between two conditions over many adata subsets.

plot_interactive_volcano(top_table_path, outdir)

plot an interactive volcano plot based on toptable file.

get_de(adata, mygroup[, demethod, topnr, ...])

Get a table of significant DE genes at certain cutoffs Based on an AnnData object and an annotation category (e.g.

signature scoring

Collection of functions to aid in signature scoring.

combined_signature_score(adata[, GMT_file, ...])

Super Wrapper function to compute combined signature score for UP and DN scores.

compute_signed_score(adata, signature_dict)

Compute signed score combining UP and DN for all signatures in signature_dict This function combines genesets (signatures) scores.

filter_by_set(strs, universe_set)

Remove strings from the list that are not in the universe set

filter_siggenes(adata, signature_dict)

Filter all signatures in signature_dict to remove genes not present in adata

convert_siggenes(signature_dict, conversion)

Convert signature genes with a ortholog conversion Series

read_GMT_sign(GMT_file[, UP_suffix, ...])

Read gmt file to extract signed genesets.

getset(df, signame_complete, threshold)

Handles missing signatures aux function for make_anno Based on a dataframe of p-values, a signature name and a cutoff check if sign is present :param df: a dataframe of p-values per signature and cell cluster :type df: panda.DataFrame :param signame_complete: signature name :type signame_complete: str :param threshold: cutoff used for cluster attributions :type threshold: numpy.float64

score_mw(f, mymarkers)

Score Clusters based on a set of immune signatures to generate a df of pvals Takes as an input a dataframe with fractions per clusters and a dictionary of signatures Performs a Mann-Whitney test per each column and Signature, returns -10logpValues

add_anno(adata, cnames, mycol[, clusters])

Adds annotation generated with make_anno to a AnnData object Takes as input the AnnData object to which annotation will be appended and the annotation pd Generates a pd.Series that can be added as a adata.obs column with annotation at a level

make_anno(df, sigscores, sigconfig, levsk[, ...])

Annotate cell types Based on a dataframe of -log10pvals, a cutoff and a signature set generate cell annotation Hierarchical model of Immune cell annotation.


Reads the configuration file for signature-based hierarhical annotation.

match_cluster(adata, obsquery, obsqueryval)

Matches categories from adata.obs to each other.

obtain_new_label(nomenclature_file, cnames)

Matches the cnames obtained by the make_annot function or a list of label names to the db label (standardized label from a nomenclature file).

obtain_dblabel(nomenclature_file, cnames[, ...])

Matches the cnames obtained by the make_annot function to the db label (standardized label).

get_gems(setName, BASE_URL[, UP_suffix, ...])

Connect to GEMS, dowload related geneset (specified by setName, can be a prefix/suffix) and return them This function combines genesets (signatures) scores (UP and DN) genes. Non directionaly geneset are by default considered as UP. :param setName: setName to find in GeMs (can be a subset) :type setName: str :param BASE_URL: GeMS url for the api. Should look like: 'http://' + hostname + ':' + localport :type BASE_URL: str :param UP_suffix: str suffix indicating that the suffix indicating the signature is in UP direction. This should be the end of the signatures names ($) :type UP_suffix: str | default = "_UP" :param DN_suffix: str suffix indicating that the suffix indicating the signature is in DN direction. This should be the end of the signatures names ($) :type DN_suffix: str | default = "_DN".

insert_gems(BASE_URL, genesets, params[, ...])

Insert genesets into the local gems server url_host will depend on GeMs deployement. Could be stored in crendential files. :param BASE_URL: an string 'http://' + hostname + ':' + localport :type BASE_URL: class:str :param genesets: a list of dict; each dict is a signature; key values should mapp the headers :type genesets: list :param params: The command-line arguments for GMTx file upload (see below) based on GeMs structure :type params: list of strings. :param headers: each element is a key of the GEMs setup in place. Minimal requirement for a geneset would be setName, desc and genes (minimal GMT) :type headers: list of string.

get_similar_geneset(request, BASE_URL[, ...])

Encapsulating small similary research. Will look for simalirity within GeMs and the mongoDB collections and returns the associated geneseets. :param request: request specificity, if the hosted collection is large, one might need to specify more into details the geneset. :type request: string :param BASE_URL: GeMS url for the api. Should look like: 'http://' + hostname + ':' + localport :type BASE_URL: str :param UP_suffix: str suffix indicating that the suffix indicating the signature is in UP direction. This should be the end of the signatures names ($) :type UP_suffix: str | default = "_UP" :param DN_suffix: str suffix indicating that the suffix indicating the signature is in DN direction. This should be the end of the signatures names ($) :type DN_suffix: str | default = "_DN".

export_annotconfig(sigconfig, levsk, ...[, ...])

Export the configuration defined in sigconfig and levsk Order might changed compared to the original sig.

convert_to_directed(signature_dict[, direction])

Convert a simple dictionary into one with direction compatible with combined_signature_score

make_gmtx(setName, desc, User, Source, ...)

Construct a gmtx file according to format conventions for import into Gems. :param setName: informative set name e.g. Pembro_induced_MC38CD8Tcell, Plasma_mdb, TGFB_Stromal_i :type setName: str :param desc: informative and verbose signature description; for cell type signatures use nomenclature, if coef used explain what it represents; link to study if present; e.g. Genes higher expressed in Pembro vs. vehicle in non-naive CD8-positive T cells in MC38 in vivo exp. ID time T2; coefs are log2FC :type desc: str :param User: related to signature origin e.g. Public (for literature-derived sets), own user ID for analysis-derived sets, rtsquad, scsquad, gred, other :type User: str :param Source: source of the signature, one of Literature scseq, Literature, besca, scseqmongodb, internal scseq, pRED, Chugai, gRED, other :type Source: str :param Subtype: specific subtype e.g. onc, all, healthy, disease :type Subtype: str :param domain: one of pathway, biological process, cellular component,molecular function, phenotype, perturbation, disease, misc, microRNA targets, transcription factor targets, cell marker, tissue marker :type domain: str :param genesetname: shared across different signatures of a specific type e.g. besca_marker, dblabel_marker, Pembro_induced_MC38CD8Tcell, FirstAuthorYearPublication :type genesetname: str :param genes: tab-separated list of genes with/without a coefficient e.g. Vim | 2.4 Bin1 | 2.02 or Vim Bin1 :type genes: str :param studyID: study name as in scMongoDB/bescaviz; only when source=internal scseq :type studyID: str | default = None :param analysisID: analysis name as in scMongoDB/bescaviz; only when source=internal scseq :type analysisID: str | default = None :param application: specify which application will read the geneset e.g. rtbeda_CIT, bescaviz, celltypeviz :type application: str | default = None :param celltype: for cell markers, specify celltype according to dblabel_short convention to facilitate reuse :type celltype: str | default = None :param coef_type: specify what the coefficient corresponds too, e.g. logFC, gini, SAM, score, ... :type coef_type: str | default = score.

write_gmtx_forgems(signature_dict, GMT_file)

Writes a gmtx file that can later be uploaded to GeMS.

silhouette_computation(adata[, cluster, ...])

Compute the average and per cell (ie samples) silhouette score for the cluster label (should be present in dataobs) (level 3 annotation), computed level 2 annotation and a random cell assignbation.

match_label(vector_label, nomenclature_file)

Return a table matching values in vector label.


Collection of functions to perform reclustering on selected subclusters.

recluster(adata, celltype[, celltype_label, ...])

Perform subclustering on specific celltype to identify subclusters.

annotate_new_cellnames(adata, ...[, ...])

annotate new cellnames to each of the subclusters identified by running recluster.


Collection of functions to perform auto-annot : annotating a sc datasets based on a reference one.

read_data(train_paths, train_datasets, ...)

Function to read in training and testing datasets

read_raw(train_paths, train_datasets, ...)

read from adata.raw and revert log1p normalization

read_adata(train_paths, train_datasets, ...)

read adata files of training and testing datasets

merge_data(adata_trains, adata_pred[, ...])

read adata files of training and testing datasets


concatenates training anndata objects

scanorama_merge(adata_trains, adata_pred, ...)

corrects datasets using scanorama and merge training datasets subsequently

remove_genes(adata_trains, adata_pred, ...)

removes all genes not in gene set

intersect_genes(adata_train, adata_pred)

removes all genes not in all datasets

remove_nonshared(adata_train, adata_pred[, ...])

removes all celltypes not in all datasets

fit(adata_train, method, celltype[, njobs, ...])

fits classifier on training dataset

linear_svm(train, y_train)

fits linear svm on training dataset

rbf_svm(train, y_train)

fits radial basis function kernel svm on training dataset

sgd_svm(train, y_train)

fits linear svm on training dataset using stochastic gradient descent

random_forest(train, y_train, njobs)

fits a random forest of a thousand esitamtors with balance class weight on training dataset.

logistic_regression(train, y_train, njobs)

multiclass crossvalidated logistic regression with balanced class weight.

logistic_regression_ovr(train, y_train, njobs)

multiclass crossvalidated logistic regression with balanced class weight.

logistic_regression_elastic(train, y_train, ...)

multiclass crossvalidated logistic regression with balanced class weight.

adata_predict(classifier, scaler, ...[, ...])

predicts on testing set using trained classifier

predict(classifier, scaler, adata_pred[, ...])

predicts on testing set using trained classifier

adata_pred_prob(classifier, scaler, ...[, ...])

predicts on testing set using trained classifier and returns class probability for every cell and every class

predict_proba(classifier, scaler, adata_pred)

predicts on testing set using trained classifier and returns probabilities

report(adata_pred, celltype, method, ...[, ...])

reports basic metrics, produces confusion matrices and plots umap of prediction Writes out a csv file containing all accuracy and f1 scores.

scanvi_predict(adata_trains, adata_pred, ...)

merges all datasets and predicts on testing set with scANVI.

scvi_merge(adata_trains, adata_pred)

merges all datasets and stores learnt representation in obsm


plots a umap of all merged datasets coloured by dataset of origin.


read_mtx(filepath[, annotation, use_genes, ...])

Read matrix.mtx, genes.tsv, barcodes.tsv to AnnData object. By specifiying an input folder this function reads the contained matrix.mtx, genes.tsv and barcodes.tsv files to an AnnData object. In case annotation = True it also adds the annotation contained in metadata.tsv to the object. :param filepath: filepath as string to the directory containg the matrix.mtx, genes.tsv, barcodes.tsv and if applicable metadata.tsv :type filepath: str :param annotation: boolian identifier if an annotation file is also located in the folder and should be added to the AnnData object :type annotation: bool (default = True) :param use_genes: either SYMBOL or ENSEMBL. Other genenames are not yet supported. :type use_genes: str :param species: string specifying the species, only needs to be used when no Gene Symbols are supplied and you only have the ENSEMBLE gene ids to perform a lookup. :type species: str | default = 'human' :param citeseq: string indicating if only gene expression values (gex_only) or only protein expression values ('citeseq_only') or everything is read if None is specified :type citeseq: 'gex_only' or 'citeseq_only' or False or None | default = None.

add_cell_labeling(adata, filepath[, label])

add a labeling written out in the FAIR formating to adata.obs

assert_adata(adata[, attempFix])

Asserts that an adata object is containing information needed for the besca pipeline to run and export information.


X_to_mtx(adata[, outpath, write_metadata, ...])

export adata object to mtx format (matrix.mtx, genes.tsv, barcodes.tsv)

raw_to_mtx(adata[, outpath, write_metadata, ...])

export adata.raw to .mtx (matrix.mtx, genes.tsv, barcodes, tsv)

clustering(adata[, outpath, export_average, ...])

export mapping of cells to clusters to .tsv file

write_labeling_to_files(adata[, outpath, ...])

export mapping of cells to specified label to .tsv file

labeling_info([outpath, description, ...])

write out labeling info for uploading to database

analysis_metadata(adata[, outpath, ...])

export plotting coordinates to analysis_metadata.tsv

generate_gep(adata[, filename, column, ...])

Generate Gene Expression Profile (GEP) from scRNA-seq annotations

ranked_genes(adata[, type, outpath, ...])

export marker genes for each cluster to .gct file

pseudobulk(adata[, outpath, column, label, ...])

export pseudobulk profiles of cells to .gct files


read_matrix(root_path[, citeseq, ...])

Read matrix file as expected for the standard workflow.

filtering_cells_genes_min(adata, ...)

filtering_mito_genes_max(adata, ...)

export_cp10k(adata, basepath)

Export raw cp10k to FAIR format for loading into database

export_regressedOut(adata, basepath)

Export regressedOut to FAIR format for loading into database

export_clustering(adata, basepath, method)

Export cluster to cell mapping to FAIR format for loading into database

export_metadata(adata, basepath[, n_pcs, ...])

Export metadata in FAIR format for loading into database

export_rank(adata, basepath[, type, ...])

Export ranked genes to FAIR format for loading into database

export_celltype(adata, basepath)

Export celltype annotation to cell mapping in FAIR format for loading into database

additional_labeling(adata, labeling_to_use, ...)

Standard Workflow function to export an additional labeling besides louvain to FAIR format.

celltype_labeling(adata, labeling_author, ...)

Standard Workflow function to export an additional labeling besides louvain to FAIR format.