besca

helper functions

besca.get_raw(adata)

Extract the AnnData object saved in adata.raw

besca.subset_adata(adata, filter_criteria[, ...])

Subset AnnData object into new object

besca.convert_ensembl_to_symbol(gene_list[, ...])

Convert ENSEMBL gene ids to SYMBOLS Uses the python package mygene to look up the supplied list of ENSEMBLE Ids and return the equivalent list of Symbols.

besca.convert_symbol_to_ensembl(gene_list[, ...])

Convert SYMBOLS to ENSEMBL gene ids Uses the python package mygene to look up the supplied list of SYMBOLS and return the equivalent list of ENSEMBLE GENEIDs.

besca.get_raw(adata)

Extract the AnnData object saved in adata.raw

besca.get_means(adata, mycat[, condition])

Calculates average and fraction expression per category in adata.obs

besca.get_ameans(adata, mycat[, condition])

Calculates average and fraction expression per category in adata.obs Based artihmetic mean expression and fraction cells expressing gene per category (works on linear scale).

besca.concate_adata(adata1, adata2)

Concatenate two adata objects based on the observations

preprocessing

filter(adata[, max_genes, min_genes, ...])

Filter cell outliers based on counts, numbers of genes expressed, number of cells expressing a gene and mitochondrial gene content.

filter_gene_list(adata, filepath[, use_raw, ...])

Function to remove all genes specified in a gene list read from file.

frac_pos(adata[, threshold])

Calculate the fraction of cells positive for expression of a gene.

frac_reads(adata)

Cacluate the fraction of reads being attributed to a specific gene.

mean_expr(adata)

Calculate the mean expression of a gene.

top_expressed_genes(adata[, top_n])

Give out the genes most frequently expressed in cells.

fraction_counts(adata[, species, name, ...])

Function to calculate fraction of counts per cell from a gene list.

top_counts_genes(adata[, top_n])

Give out the genes that contribute the largest fraction to the total UMI counts.

normalize_geometric(adata)

Perform geometric normalization on CITEseq data.

valOutlier(adata[, nmads, rlib_loc])

Estimates and returns the thresholds to use for gene/cell filtering based on outliers calculated from the deviation to the median QCs.

scTransform(adata[, hvg, n_genes, rlib_loc])

Function to call scTransform normalization or HVG selection from Python.

plotting

kp_genes(adata[, threshold, min_genes, ax, ...])

visualize the minimum gene per cell threshold.

kp_counts(adata[, min_counts, ax, figsize])

visualize the minimum UMI counts per cell threshold.

kp_cells(adata[, threshold, min_cells, ax, ...])

visualize the minimum number of cells expressing a gene threshold.

max_counts(adata[, max_counts, ax, figsize])

visualize maximum UMI counts per cell threshold.

max_genes(adata[, max_genes, ax, figsize])

visualize maximum number of genes per cell threshold.

max_mito(adata[, max_mito, annotation_type, ...])

visualize maximum mitochondrial gene percentage threshold.

dropouts(adata[, ax, bins, figsize])

Plot number of dropouts.

detected_genes(adata[, ax, bins, figsize])

Plot number of detected genes.

library_size(adata[, ax, bins, figsize])

Plot library size.

librarysize_overview(adata[, bins, figsize])

Generates overview figure of libarysize, dropouts and detected genes.

transcript_capture_efficiency(adata[, ax, ...])

Plot total gene counts vs detection probability.

top_genes_counts(adata[, top_n, ax, figsize])

plot top n genes that contribute to fraction of counts per cell

gene_expr_split(adata, genes[, ...])

visualize gene expression of two groups as a split violin plot

gene_expr_split_stacked(adata, genes, ...[, ...])

Stacked violin plot for visualization of genes expression.

box_per_ind(plotdata, y_axis, x_axis[, ...])

plot boxplot with values per individual.

stacked_split_violin(tidy_data, x_axis, ...)

plot stacked split violin plots.

celllabel_quant_boxplot(adata, ...[, ...])

generate a box and whisker plot with overlayed swarm plot of celltype abundances

celllabel_quant_stackedbar(adata, ...[, ...])

Generate a stacked bar plot of the percentage of labelcounts within each AnnData subset

dot_heatmap(adata, genes[, group_by, ...])

Generate a dot plot, filled with heatmap of individuals cells gene expression.

dot_heatmap_split(adata, genes, split_by[, ...])

Generate a dot plot, filled with heatmap of individuals cells gene expression to compare two conditions.

dot_heatmap_split_greyscale(adata, genes, ...)

Generate a dot plot, filled with heatmap of individuals cells gene expression to compare two conditions (greyscale).

update_qualitative_palette(adata, palette[, ...])

Update adata object such that the umap will adhere to the palette provided.

nomenclature_network(config_file[, ...])

Plot a nomenclature network based on annotation config file.

riverplot_2categories(adata, categories[, ...])

Generate a riverplot/sanker diagram between two categories.

flex_dotplot(df, X, Y, HUE, SIZE, title[, ...])

Generate a dot plot showing average expression and fraction positive cells

tools

count_occurrence(adata[, count_variable, ...])

Generate dataframe containing the label counts/percentages of a specific column in adata.obs

count_occurrence_subset(adata, subset_variable)

count occurrence of a label in adata.obs after subseting adata object

count_occurrence_subset_conditions(adata, ...)

count occurrence of a label for each condition in adata.obs after subseting adata object

annotate_cells_clustering(adata, ...[, ...])

Function to add annotation to adata.obs based on clustering This function replaces the original cluster labels located in the column clustering_label with the new values specified in the list new_cluster_lables.

report(adata_pred, celltype, method, ...[, ...])

reports basic metrics, produces confusion matrices and plots umap of prediction

plot_confusion_matrix(y_true, y_pred, ...[, ...])

plots confusion matrices

toolkits

batch correction

Collection of functions to perform batch correction.

batch_correct(adata, batch_to_correct)

function to perform batch correction

postprocess_mnnpy(adata, bdata)

postprocessing to generate a newly functional AnnData object

differential gene expression

Collection of functions to aid in differential gene expression analysis.

perform_dge(adata, design_matrix, ...[, ...])

Perform differential gene expression between two conditions over many adata subsets.

plot_interactive_volcano(top_table_path, outdir)

plot an interactive volcano plot based on toptable file.

get_de(adata, mygroup[, demethod, topnr, ...])

Get a table of significant DE genes at certain cutoffs Based on an AnnData object and an annotation category (e.g.

signature scoring

Collection of functions to aid in signature scoring.

combined_signature_score(adata[, GMT_file, ...])

Super Wrapper function to compute combined signature score for UP and DN scores.

compute_signed_score(adata, signature_dict)

Compute signed score combining UP and DN for all signatures in signature_dict This function combines genesets (signatures) scores.

filter_by_set(strs, universe_set)

Remove strings from the list that are not in the universe set

filter_siggenes(adata, signature_dict)

Filter all signatures in signature_dict to remove genes not present in adata

convert_siggenes(signature_dict, conversion)

Convert signature genes with a ortholog conversion Series

read_GMT_sign(GMT_file[, UP_suffix, ...])

Read gmt file to extract signed genesets.

getset(df, signame_complete, threshold)

Handles missing signatures aux function for make_anno Based on a dataframe of p-values, a signature name and a cutoff check if sign is present :param df: a dataframe of p-values per signature and cell cluster :type df: panda.DataFrame :param signame_complete: signature name :type signame_complete: str :param threshold: cutoff used for cluster attributions :type threshold: numpy.float64

score_mw(f, mymarkers)

Score Clusters based on a set of immune signatures to generate a df of pvals Takes as an input a dataframe with fractions per clusters and a dictionary of signatures Performs a Mann-Whitney test per each column and Signature, returns -10logpValues

add_anno(adata, cnames, mycol[, clusters])

Adds annotation generated with make_anno to a AnnData object Takes as input the AnnData object to which annotation will be appended and the annotation pd Generates a pd.Series that can be added as a adata.obs column with annotation at a level

make_anno(df, sigscores, sigconfig, levsk[, ...])

Annotate cell types Based on a dataframe of -log10pvals, a cutoff and a signature set generate cell annotation Hierarchical model of Immune cell annotation.

read_annotconfig(configfile)

Reads the configuration file for signature-based hierarhical annotation.

match_cluster(adata, obsquery, obsqueryval)

Matches categories from adata.obs to each other.

obtain_new_label(nomenclature_file, cnames)

Matches the cnames obtained by the make_annot function or a list of label names to the db label (standardized label from a nomenclature file).

obtain_dblabel(nomenclature_file, cnames[, ...])

Matches the cnames obtained by the make_annot function to the db label (standardized label).

get_gems(setName, BASE_URL[, UP_suffix, ...])

Connect to GEMS, dowload related geneset (specified by setName, can be a prefix/suffix) and return them This function combines genesets (signatures) scores (UP and DN) genes. Non directionaly geneset are by default considered as UP. :param setName: setName to find in GeMs (can be a subset) :type setName: str :param BASE_URL: GeMS url for the api. Should look like: 'http://' + hostname + ':' + localport :type BASE_URL: str :param UP_suffix: str suffix indicating that the suffix indicating the signature is in UP direction. This should be the end of the signatures names ($) :type UP_suffix: str | default = "_UP" :param DN_suffix: str suffix indicating that the suffix indicating the signature is in DN direction. This should be the end of the signatures names ($) :type DN_suffix: str | default = "_DN".

insert_gems(BASE_URL, genesets, params[, ...])

Insert genesets into the local gems server url_host will depend on GeMs deployement. Could be stored in crendential files. :param BASE_URL: an string 'http://' + hostname + ':' + localport :type BASE_URL: class:str :param genesets: a list of dict; each dict is a signature; key values should mapp the headers :type genesets: list :param params: The command-line arguments for GMTx file upload (see below) based on GeMs structure :type params: list of strings. :param headers: each element is a key of the GEMs setup in place. Minimal requirement for a geneset would be setName, desc and genes (minimal GMT) :type headers: list of string.

get_similar_geneset(request, BASE_URL[, ...])

Encapsulating small similary research. Will look for simalirity within GeMs and the mongoDB collections and returns the associated geneseets. :param request: request specificity, if the hosted collection is large, one might need to specify more into details the geneset. :type request: string :param BASE_URL: GeMS url for the api. Should look like: 'http://' + hostname + ':' + localport :type BASE_URL: str :param UP_suffix: str suffix indicating that the suffix indicating the signature is in UP direction. This should be the end of the signatures names ($) :type UP_suffix: str | default = "_UP" :param DN_suffix: str suffix indicating that the suffix indicating the signature is in DN direction. This should be the end of the signatures names ($) :type DN_suffix: str | default = "_DN".

export_annotconfig(sigconfig, levsk, ...[, ...])

Export the configuration defined in sigconfig and levsk Order might changed compared to the original sig.

convert_to_directed(signature_dict[, direction])

Convert a simple dictionary into one with direction compatible with combined_signature_score

make_gmtx(setName, desc, User, Source, ...)

Construct a gmtx file according to format conventions for import into Gems. :param setName: informative set name e.g. Pembro_induced_MC38CD8Tcell, Plasma_mdb, TGFB_Stromal_i :type setName: str :param desc: informative and verbose signature description; for cell type signatures use nomenclature, if coef used explain what it represents; link to study if present; e.g. Genes higher expressed in Pembro vs. vehicle in non-naive CD8-positive T cells in MC38 in vivo exp. ID time T2; coefs are log2FC :type desc: str :param User: related to signature origin e.g. Public (for literature-derived sets), own user ID for analysis-derived sets, rtsquad, scsquad, gred, other :type User: str :param Source: source of the signature, one of Literature scseq, Literature, besca, scseqmongodb, internal scseq, pRED, Chugai, gRED, other :type Source: str :param Subtype: specific subtype e.g. onc, all, healthy, disease :type Subtype: str :param domain: one of pathway, biological process, cellular component,molecular function, phenotype, perturbation, disease, misc, microRNA targets, transcription factor targets, cell marker, tissue marker :type domain: str :param genesetname: shared across different signatures of a specific type e.g. besca_marker, dblabel_marker, Pembro_induced_MC38CD8Tcell, FirstAuthorYearPublication :type genesetname: str :param genes: tab-separated list of genes with/without a coefficient e.g. Vim | 2.4 Bin1 | 2.02 or Vim Bin1 :type genes: str :param studyID: study name as in scMongoDB/bescaviz; only when source=internal scseq :type studyID: str | default = None :param analysisID: analysis name as in scMongoDB/bescaviz; only when source=internal scseq :type analysisID: str | default = None :param application: specify which application will read the geneset e.g. rtbeda_CIT, bescaviz, celltypeviz :type application: str | default = None :param celltype: for cell markers, specify celltype according to dblabel_short convention to facilitate reuse :type celltype: str | default = None :param coef_type: specify what the coefficient corresponds too, e.g. logFC, gini, SAM, score, ... :type coef_type: str | default = score.

write_gmtx_forgems(signature_dict, GMT_file)

Writes a gmtx file that can later be uploaded to GeMS.

silhouette_computation(adata[, cluster, ...])

Compute the average and per cell (ie samples) silhouette score for the cluster label (should be present in dataobs) (level 3 annotation), computed level 2 annotation and a random cell assignbation.

match_label(vector_label, nomenclature_file)

Return a table matching values in vector label.

reclustering

Collection of functions to perform reclustering on selected subclusters.

recluster(adata, celltype[, celltype_label, ...])

Perform subclustering on specific celltype to identify subclusters.

annotate_new_cellnames(adata, ...[, ...])

annotate new cellnames to each of the subclusters identified by running recluster.

auto-annot

Collection of functions to perform auto-annot : annotating a sc datasets based on a reference one.

read_data(train_paths, train_datasets, ...)

Function to read in training and testing datasets

read_raw(train_paths, train_datasets, ...)

read from adata.raw and revert log1p normalization

read_adata(train_paths, train_datasets, ...)

read adata files of training and testing datasets

merge_data(adata_trains, adata_pred[, ...])

read adata files of training and testing datasets

naive_merge(adata_trains)

concatenates training anndata objects

scanorama_merge(adata_trains, adata_pred, ...)

corrects datasets using scanorama and merge training datasets subsequently

remove_genes(adata_trains, adata_pred, ...)

removes all genes not in gene set

intersect_genes(adata_train, adata_pred)

removes all genes not in all datasets

remove_nonshared(adata_train, adata_pred[, ...])

removes all celltypes not in all datasets

fit(adata_train, method, celltype[, njobs, ...])

fits classifier on training dataset

linear_svm(train, y_train)

fits linear svm on training dataset

rbf_svm(train, y_train)

fits radial basis function kernel svm on training dataset

sgd_svm(train, y_train)

fits linear svm on training dataset using stochastic gradient descent

random_forest(train, y_train, njobs)

fits a random forest of a thousand esitamtors with balance class weight on training dataset.

logistic_regression(train, y_train, njobs)

multiclass crossvalidated logistic regression with balanced class weight.

logistic_regression_ovr(train, y_train, njobs)

multiclass crossvalidated logistic regression with balanced class weight.

logistic_regression_elastic(train, y_train, ...)

multiclass crossvalidated logistic regression with balanced class weight.

adata_predict(classifier, scaler, ...[, ...])

predicts on testing set using trained classifier

predict(classifier, scaler, adata_pred[, ...])

predicts on testing set using trained classifier

adata_pred_prob(classifier, scaler, ...[, ...])

predicts on testing set using trained classifier and returns class probability for every cell and every class

predict_proba(classifier, scaler, adata_pred)

predicts on testing set using trained classifier and returns probabilities

report(adata_pred, celltype, method, ...[, ...])

reports basic metrics, produces confusion matrices and plots umap of prediction Writes out a csv file containing all accuracy and f1 scores.

scanvi_predict(adata_trains, adata_pred, ...)

merges all datasets and predicts on testing set with scANVI.

scvi_merge(adata_trains, adata_pred)

merges all datasets and stores learnt representation in obsm

visualise_scvi_merge(adata_concat)

plots a umap of all merged datasets coloured by dataset of origin.

Import

read_mtx(filepath[, annotation, use_genes, ...])

Read matrix.mtx, genes.tsv, barcodes.tsv to AnnData object. By specifiying an input folder this function reads the contained matrix.mtx, genes.tsv and barcodes.tsv files to an AnnData object. In case annotation = True it also adds the annotation contained in metadata.tsv to the object. :param filepath: filepath as string to the directory containg the matrix.mtx, genes.tsv, barcodes.tsv and if applicable metadata.tsv :type filepath: str :param annotation: boolian identifier if an annotation file is also located in the folder and should be added to the AnnData object :type annotation: bool (default = True) :param use_genes: either SYMBOL or ENSEMBL. Other genenames are not yet supported. :type use_genes: str :param species: string specifying the species, only needs to be used when no Gene Symbols are supplied and you only have the ENSEMBLE gene ids to perform a lookup. :type species: str | default = 'human' :param citeseq: string indicating if only gene expression values (gex_only) or only protein expression values ('citeseq_only') or everything is read if None is specified :type citeseq: 'gex_only' or 'citeseq_only' or False or None | default = None.

add_cell_labeling(adata, filepath[, label])

add a labeling written out in the FAIR formating to adata.obs

assert_adata(adata[, attempFix])

Asserts that an adata object is containing information needed for the besca pipeline to run and export information.

export

X_to_mtx(adata[, outpath, write_metadata, ...])

export adata object to mtx format (matrix.mtx, genes.tsv, barcodes.tsv)

raw_to_mtx(adata[, outpath, write_metadata, ...])

export adata.raw to .mtx (matrix.mtx, genes.tsv, barcodes, tsv)

clustering(adata[, outpath, export_average, ...])

export mapping of cells to clusters to .tsv file

write_labeling_to_files(adata[, outpath, ...])

export mapping of cells to specified label to .tsv file

labeling_info([outpath, description, ...])

write out labeling info for uploading to database

analysis_metadata(adata[, outpath, ...])

export plotting coordinates to analysis_metadata.tsv

generate_gep(adata[, filename, column, ...])

Generate Gene Expression Profile (GEP) from scRNA-seq annotations

ranked_genes(adata[, type, outpath, ...])

export marker genes for each cluster to .gct file

pseudobulk(adata[, outpath, column, label, ...])

export pseudobulk profiles of cells to .gct files

standardworkflow

read_matrix(root_path[, citeseq, ...])

Read matrix file as expected for the standard workflow.

filtering_cells_genes_min(adata, ...)

filtering_mito_genes_max(adata, ...)

export_cp10k(adata, basepath)

Export raw cp10k to FAIR format for loading into database

export_regressedOut(adata, basepath)

Export regressedOut to FAIR format for loading into database

export_clustering(adata, basepath, method)

Export cluster to cell mapping to FAIR format for loading into database

export_metadata(adata, basepath[, n_pcs, ...])

Export metadata in FAIR format for loading into database

export_rank(adata, basepath[, type, ...])

Export ranked genes to FAIR format for loading into database

export_celltype(adata, basepath)

Export celltype annotation to cell mapping in FAIR format for loading into database

additional_labeling(adata, labeling_to_use, ...)

Standard Workflow function to export an additional labeling besides louvain to FAIR format.

celltype_labeling(adata, labeling_author, ...)

Standard Workflow function to export an additional labeling besides louvain to FAIR format.