besca¶
helper functions¶
|
Extract the AnnData object saved in adata.raw |
|
Subset AnnData object into new object |
|
Convert ENSEMBL gene ids to SYMBOLS Uses the python package mygene to look up the supplied list of ENSEMBLE Ids and return the equivalent list of Symbols. |
|
Convert SYMBOLS to ENSEMBL gene ids Uses the python package mygene to look up the supplied list of SYMBOLS and return the equivalent list of ENSEMBLE GENEIDs. |
|
Extract the AnnData object saved in adata.raw |
|
Calculates average and fraction expression per category in adata.obs |
|
Calculates average and fraction expression per category in adata.obs Based artihmetic mean expression and fraction cells expressing gene per category (works on linear scale). |
|
Concatenate two adata objects based on the observations |
preprocessing¶
|
Filter cell outliers based on counts, numbers of genes expressed, number of cells expressing a gene and mitochondrial gene content. |
|
Function to remove all genes specified in a gene list read from file. |
|
Calculate the fraction of cells positive for expression of a gene. |
|
Cacluate the fraction of reads being attributed to a specific gene. |
|
Calculate the mean expression of a gene. |
|
Give out the genes most frequently expressed in cells. |
|
Function to calculate fraction of counts per cell from a gene list. |
|
Give out the genes that contribute the largest fraction to the total UMI counts. |
|
Perform geometric normalization on CITEseq data. |
|
Estimates and returns the thresholds to use for gene/cell filtering based on outliers calculated from the deviation to the median QCs. |
|
Function to call scTransform normalization or HVG selection from Python. |
plotting¶
|
visualize the minimum gene per cell threshold. |
|
visualize the minimum UMI counts per cell threshold. |
|
visualize the minimum number of cells expressing a gene threshold. |
|
visualize maximum UMI counts per cell threshold. |
|
visualize maximum number of genes per cell threshold. |
|
visualize maximum mitochondrial gene percentage threshold. |
|
Plot number of dropouts. |
|
Plot number of detected genes. |
|
Plot library size. |
|
Generates overview figure of libarysize, dropouts and detected genes. |
|
Plot total gene counts vs detection probability. |
|
plot top n genes that contribute to fraction of counts per cell |
|
visualize gene expression of two groups as a split violin plot |
|
Stacked violin plot for visualization of genes expression. |
|
plot boxplot with values per individual. |
|
plot stacked split violin plots. |
|
generate a box and whisker plot with overlayed swarm plot of celltype abundances |
|
Generate a stacked bar plot of the percentage of labelcounts within each AnnData subset |
|
Generate a dot plot, filled with heatmap of individuals cells gene expression. |
|
Generate a dot plot, filled with heatmap of individuals cells gene expression to compare two conditions. |
|
Generate a dot plot, filled with heatmap of individuals cells gene expression to compare two conditions (greyscale). |
|
Update adata object such that the umap will adhere to the palette provided. |
|
Plot a nomenclature network based on annotation config file. |
|
Generate a riverplot/sanker diagram between two categories. |
|
Generate a dot plot showing average expression and fraction positive cells |
tools¶
|
Generate dataframe containing the label counts/percentages of a specific column in adata.obs |
|
count occurrence of a label in adata.obs after subseting adata object |
|
count occurrence of a label for each condition in adata.obs after subseting adata object |
|
Function to add annotation to adata.obs based on clustering This function replaces the original cluster labels located in the column clustering_label with the new values specified in the list new_cluster_lables. |
|
reports basic metrics, produces confusion matrices and plots umap of prediction |
|
plots confusion matrices |
toolkits¶
batch correction¶
Collection of functions to perform batch correction.
|
function to perform batch correction |
|
postprocessing to generate a newly functional AnnData object |
differential gene expression¶
Collection of functions to aid in differential gene expression analysis.
|
Perform differential gene expression between two conditions over many adata subsets. |
|
plot an interactive volcano plot based on toptable file. |
|
Get a table of significant DE genes at certain cutoffs Based on an AnnData object and an annotation category (e.g. |
signature scoring¶
Collection of functions to aid in signature scoring.
|
Super Wrapper function to compute combined signature score for UP and DN scores. |
|
Compute signed score combining UP and DN for all signatures in signature_dict This function combines genesets (signatures) scores. |
|
Remove strings from the list that are not in the universe set |
|
Filter all signatures in signature_dict to remove genes not present in adata |
|
Convert signature genes with a ortholog conversion Series |
|
Read gmt file to extract signed genesets. |
|
Handles missing signatures aux function for make_anno Based on a dataframe of p-values, a signature name and a cutoff check if sign is present :param df: a dataframe of p-values per signature and cell cluster :type df: panda.DataFrame :param signame_complete: signature name :type signame_complete: str :param threshold: cutoff used for cluster attributions :type threshold: numpy.float64 |
|
Score Clusters based on a set of immune signatures to generate a df of pvals Takes as an input a dataframe with fractions per clusters and a dictionary of signatures Performs a Mann-Whitney test per each column and Signature, returns -10logpValues |
|
Adds annotation generated with make_anno to a AnnData object Takes as input the AnnData object to which annotation will be appended and the annotation pd Generates a pd.Series that can be added as a adata.obs column with annotation at a level |
|
Annotate cell types Based on a dataframe of -log10pvals, a cutoff and a signature set generate cell annotation Hierarchical model of Immune cell annotation. |
|
Reads the configuration file for signature-based hierarhical annotation. |
|
Matches categories from adata.obs to each other. |
|
Matches the cnames obtained by the make_annot function or a list of label names to the db label (standardized label from a nomenclature file). |
|
Matches the cnames obtained by the make_annot function to the db label (standardized label). |
|
Connect to GEMS, dowload related geneset (specified by setName, can be a prefix/suffix) and return them This function combines genesets (signatures) scores (UP and DN) genes. Non directionaly geneset are by default considered as UP. :param setName: setName to find in GeMs (can be a subset) :type setName: str :param BASE_URL: GeMS url for the api. Should look like: 'http://' + hostname + ':' + localport :type BASE_URL: str :param UP_suffix: str suffix indicating that the suffix indicating the signature is in UP direction. This should be the end of the signatures names ($) :type UP_suffix: str | default = "_UP" :param DN_suffix: str suffix indicating that the suffix indicating the signature is in DN direction. This should be the end of the signatures names ($) :type DN_suffix: str | default = "_DN". |
|
Insert genesets into the local gems server url_host will depend on GeMs deployement. Could be stored in crendential files. :param BASE_URL: an string 'http://' + hostname + ':' + localport :type BASE_URL: class:str :param genesets: a list of dict; each dict is a signature; key values should mapp the headers :type genesets: list :param params: The command-line arguments for GMTx file upload (see below) based on GeMs structure :type params: list of strings. :param headers: each element is a key of the GEMs setup in place. Minimal requirement for a geneset would be setName, desc and genes (minimal GMT) :type headers: list of string. |
|
Encapsulating small similary research. Will look for simalirity within GeMs and the mongoDB collections and returns the associated geneseets. :param request: request specificity, if the hosted collection is large, one might need to specify more into details the geneset. :type request: string :param BASE_URL: GeMS url for the api. Should look like: 'http://' + hostname + ':' + localport :type BASE_URL: str :param UP_suffix: str suffix indicating that the suffix indicating the signature is in UP direction. This should be the end of the signatures names ($) :type UP_suffix: str | default = "_UP" :param DN_suffix: str suffix indicating that the suffix indicating the signature is in DN direction. This should be the end of the signatures names ($) :type DN_suffix: str | default = "_DN". |
|
Export the configuration defined in sigconfig and levsk Order might changed compared to the original sig. |
|
Convert a simple dictionary into one with direction compatible with combined_signature_score |
|
Construct a gmtx file according to format conventions for import into Gems. :param setName: informative set name e.g. Pembro_induced_MC38CD8Tcell, Plasma_mdb, TGFB_Stromal_i :type setName: str :param desc: informative and verbose signature description; for cell type signatures use nomenclature, if coef used explain what it represents; link to study if present; e.g. Genes higher expressed in Pembro vs. vehicle in non-naive CD8-positive T cells in MC38 in vivo exp. ID time T2; coefs are log2FC :type desc: str :param User: related to signature origin e.g. Public (for literature-derived sets), own user ID for analysis-derived sets, rtsquad, scsquad, gred, other :type User: str :param Source: source of the signature, one of Literature scseq, Literature, besca, scseqmongodb, internal scseq, pRED, Chugai, gRED, other :type Source: str :param Subtype: specific subtype e.g. onc, all, healthy, disease :type Subtype: str :param domain: one of pathway, biological process, cellular component,molecular function, phenotype, perturbation, disease, misc, microRNA targets, transcription factor targets, cell marker, tissue marker :type domain: str :param genesetname: shared across different signatures of a specific type e.g. besca_marker, dblabel_marker, Pembro_induced_MC38CD8Tcell, FirstAuthorYearPublication :type genesetname: str :param genes: tab-separated list of genes with/without a coefficient e.g. Vim | 2.4 Bin1 | 2.02 or Vim Bin1 :type genes: str :param studyID: study name as in scMongoDB/bescaviz; only when source=internal scseq :type studyID: str | default = None :param analysisID: analysis name as in scMongoDB/bescaviz; only when source=internal scseq :type analysisID: str | default = None :param application: specify which application will read the geneset e.g. rtbeda_CIT, bescaviz, celltypeviz :type application: str | default = None :param celltype: for cell markers, specify celltype according to dblabel_short convention to facilitate reuse :type celltype: str | default = None :param coef_type: specify what the coefficient corresponds too, e.g. logFC, gini, SAM, score, ... :type coef_type: str | default = score. |
|
Writes a gmtx file that can later be uploaded to GeMS. |
|
Compute the average and per cell (ie samples) silhouette score for the cluster label (should be present in dataobs) (level 3 annotation), computed level 2 annotation and a random cell assignbation. |
|
Return a table matching values in vector label. |
reclustering¶
Collection of functions to perform reclustering on selected subclusters.
|
Perform subclustering on specific celltype to identify subclusters. |
|
annotate new cellnames to each of the subclusters identified by running recluster. |
auto-annot¶
Collection of functions to perform auto-annot : annotating a sc datasets based on a reference one.
|
Function to read in training and testing datasets |
|
read from adata.raw and revert log1p normalization |
|
read adata files of training and testing datasets |
|
read adata files of training and testing datasets |
|
concatenates training anndata objects |
|
corrects datasets using scanorama and merge training datasets subsequently |
|
removes all genes not in gene set |
|
removes all genes not in all datasets |
|
removes all celltypes not in all datasets |
|
fits classifier on training dataset |
|
fits linear svm on training dataset |
|
fits radial basis function kernel svm on training dataset |
|
fits linear svm on training dataset using stochastic gradient descent |
|
fits a random forest of a thousand esitamtors with balance class weight on training dataset. |
|
multiclass crossvalidated logistic regression with balanced class weight. |
|
multiclass crossvalidated logistic regression with balanced class weight. |
|
multiclass crossvalidated logistic regression with balanced class weight. |
|
predicts on testing set using trained classifier |
|
predicts on testing set using trained classifier |
|
predicts on testing set using trained classifier and returns class probability for every cell and every class |
|
predicts on testing set using trained classifier and returns probabilities |
|
reports basic metrics, produces confusion matrices and plots umap of prediction Writes out a csv file containing all accuracy and f1 scores. |
|
merges all datasets and predicts on testing set with scANVI. |
|
merges all datasets and stores learnt representation in obsm |
|
plots a umap of all merged datasets coloured by dataset of origin. |
Import¶
|
Read matrix.mtx, genes.tsv, barcodes.tsv to AnnData object. By specifiying an input folder this function reads the contained matrix.mtx, genes.tsv and barcodes.tsv files to an AnnData object. In case annotation = True it also adds the annotation contained in metadata.tsv to the object. :param filepath: filepath as string to the directory containg the matrix.mtx, genes.tsv, barcodes.tsv and if applicable metadata.tsv :type filepath: str :param annotation: boolian identifier if an annotation file is also located in the folder and should be added to the AnnData object :type annotation: bool (default = True) :param use_genes: either SYMBOL or ENSEMBL. Other genenames are not yet supported. :type use_genes: str :param species: string specifying the species, only needs to be used when no Gene Symbols are supplied and you only have the ENSEMBLE gene ids to perform a lookup. :type species: str | default = 'human' :param citeseq: string indicating if only gene expression values (gex_only) or only protein expression values ('citeseq_only') or everything is read if None is specified :type citeseq: 'gex_only' or 'citeseq_only' or False or None | default = None. |
|
add a labeling written out in the FAIR formating to adata.obs |
|
Asserts that an adata object is containing information needed for the besca pipeline to run and export information. |
export¶
|
export adata object to mtx format (matrix.mtx, genes.tsv, barcodes.tsv) |
|
export adata.raw to .mtx (matrix.mtx, genes.tsv, barcodes, tsv) |
|
export mapping of cells to clusters to .tsv file |
|
export mapping of cells to specified label to .tsv file |
|
write out labeling info for uploading to database |
|
export plotting coordinates to analysis_metadata.tsv |
|
Generate Gene Expression Profile (GEP) from scRNA-seq annotations |
|
export marker genes for each cluster to .gct file |
|
export pseudobulk profiles of cells to .gct files |
standardworkflow¶
|
Read matrix file as expected for the standard workflow. |
|
|
|
|
|
Export raw cp10k to FAIR format for loading into database |
|
Export regressedOut to FAIR format for loading into database |
|
Export cluster to cell mapping to FAIR format for loading into database |
|
Export metadata in FAIR format for loading into database |
|
Export ranked genes to FAIR format for loading into database |
|
Export celltype annotation to cell mapping in FAIR format for loading into database |
|
Standard Workflow function to export an additional labeling besides louvain to FAIR format. |
|
Standard Workflow function to export an additional labeling besides louvain to FAIR format. |