# Tools ## annotate_loci ``` Description: This tool takes one input file with genomic coordinates in its first column and an additional file with gene locus information. It then tries to annoate all genomic regions from the first file, line by line, with the annotations from the second file by overlapping the coordinates. Usage: annotate_loci -i FILE -loci FILE -format gct|topTable -i input file with loci information (required), ie first column must contain a coordinate string CHR:BEGIN-END , ie separated by colon and dash. If the input format is gct or topTable then all subsequent columns sent to stdout -loci input file with loci information (required), tab-delimited format: CHR BEGIN END STRAND GENE SYMBOL DESCRIPTION -format gct|topTable input file -i is in gct|topTable format (optional) -verbose show more information (optional) Report bugs and feedback to roland.schmucki@roche.com ``` ## count2tpm ``` Description: Calculate normalized read counts for input GCT file. Source: https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/ Note that NaN are output as zero 0. Usage: count2tpm -i GCT-file -l Length-file [-cpm|rpkm|tpm] [-log2|log10] [-col INT] [-digits INT] Mandatory input parameters: -g GCT file with read counts per gene (unique gene identifier in 1st column): -l tab-delimited file with gene identifier in 1st and gene length in 2nd columns, respectively. These files can be found in the corresponding genome annotation folders, e.g. for human in folder //hg38/gtf/refseq/ Optional input parameters: -tpm transcript per million (default) -rpkm reads per kilobase of exon per million reads mapped -cpm counts per million mapped reads -log2 log2 transform output (adding 0.01) -log10 log10 transform output (adding 0.01) -col if input length file contains several columns, then specify the column number with this index (default last column) -digits number of digits after comma for output (default 3) Report bugs and feedback to roland.schmucki@roche.com ``` ## expression2gct ``` Description: Convert biokit expression gene count files into GCT format. Usage: expression2gct -infile='list of files' -outfile-prefix STRING Mandatory input parameters: -infile: either a file containing the paths to input expression files OR list of space/comma-separated files, e.g. -infile='sample1.expression,sample2.expression' -outfile-prefix: there are 2 - 4 output files, already existing files of same name will be overwritten: STRING_rpkm.gct STRING_count.gct Optional input parameters: -use-unique-counts (use the unique rpkm/read counts, default is multiple) -old-biokit-format (use if expression file was generated with Biokit v3.8 or earlier; the annotation is in column #7 instead of #8) Report bugs and feedback to roland.schmucki@roche.com ``` ## extract_sequence ``` Description: Extract from an input fasta or fastq file sequences by ids from another input file. Usage: extract_sequence [-verbose] [-delimiter='. TAB'] [-useEntireIdLine] [-quick] [-not] -ids ids_file -fasta|fastq fasta_file Mandatory parameters: -ids file name containing sequence id's -fasta|fastq file name containing the fasta or fastq sequences Optional parameters: -delimiter delimiter on the sequence id line -useEntireIdLine use the entire line as id and not split line by -delimiter into fields -quick stop search after the first match -not inverse the search, ie output sequences that are not in the ids file -verbose output additional information Report bugs and feedback to roland.schmucki@roche.com ``` ## make_cls ``` Description: Create a phenotype CLS file from a given GCT and annotation file. The CLS format is defined here: https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#Phenotype_Data_Formats Usage: make_cls -gct FILE -i FILE Mandatory input parameters: -gct GCT_FILE input file in GCT format -i ANNOTATION_FILE input file with sample annotations The annotation file is a 2 column tab-delimited file (comments or header mark with #) column 1: sample name as given in input GCT file column 2: sample group Report bugs and feedback to roland.schmucki@roche.com ``` ## make_design_contrast_matrix ``` Description: Usage: make_design_contrast_matrix [-prefix STRING] -gct FILE -i FILE Mandatory parameters: -gct GCT_FILE input file in GCT format -i ANNOTATION_FILE input file with sample annotations with 2 columns (see below) Optional parameters: -prefix STRING a string for the output prefix The annotation file is a 2 column tab-delimited file (comments or header mark with #) column 1: sample name as given in input GCT file column 2: sample group Report bugs and feedback to roland.schmucki@roche.com ``` ## mean ``` Description: Calculate means per sample conditions. Usage: mean [-skip INT] [-gzip] -i INFILE -s SAMPLE_ANNOTATIONS Mandatory parameters: -i FILE inptu file with sample data, e.g. read counts -s FILE input file with sample annotations The SAMPLE_ANNOTATIONS file is tab-delimited input file with at least 2 columns: column 1: sample name column 2: sample condition The read count from INFILE are averaged (mean) for each sample condition. The INFILE headers should match with the sample names specified in the SAMPLE_ANNOTATIONS file. Note that the header line must begin with '#' or with 'ID' Optional parameters: -skip INT denotes how many columns from the INFILE should be skipped and not used for calculation, e.g. skip ID or description columns, default 6 -gzip use if INFILE is gzipped Report bugs and feedback to roland.schmucki@roche.com ``` ## merge_fastq ``` Description: Merge reads from several fastq files into one fastq file. IMPORTANT: input files for mate R1 and R2 reads must be in the same ORDER. Usage: merge_fastq [-sbatch] [-t INT] [-old-version] [-script-prefix STR] [-bsub-path STR] -i FILE Mandatory parameters: -i input_file tab-delimited file with 2 columns: input gzipped fastq file, output gzipped fastq file Optional parameters: -sbatch use "sbatch" for submitting to queue, default is "bsub" -t integer number of minutes for queuing system, default 360 = 6 hours, only for "sbatch" -old-version use old version which is much slower -script-prefix prefix for temp scripts, e.g. path, default ./merge_fastq -bsub-path path to bsub command on the shpc, default bsub Report bugs and feedback to roland.schmucki@roche.com ``` ## merge_gct ``` Usage: merge_gct [-h] FILE1 FILE2 [FILE3 ...] Merge GCT files Optional arguments -h display this help and exit Contact roland.schmucki@roche.com ``` ## minmax_gct ``` Filter away all features from a GCT file if the row MIN or MAX is lower/greater/lower equal (MIN-EQUAL)/greater equal (MAX-EQUAL) than a user given threshold. Use MIN-/MAX-REVERSE to output reversed comparison. Results are redirected to the standard output. 3 input arguments required: 1. input GCT file 2. threshold value (real number) 3. MIN or MAX or MIN-EQUAL or MAX-EQUAL or MIN-REVERSE or MAX-REVERSE Contact roland.schmucki@roche.com ``` ## parse_gtf ``` Description: Parse a GTF file and output attributes to stdout. Usage: parse_gtf [-h] -gtf INFILE + 1 Optional argument from below Optional arguments -output-fields="gene_name,gene_synonym,product" The above example will output all gtf fields named "gene_name", "gene_synonym", and "product" -refseq This option will work on a gtf from refseq and do the following 0) discard lines that are not exons 1) get gene number from the field Dbxref "GeneID: and replace the input gene_id value with the gene number 2) remove transcript_id fields if they contain "rnaN" numbers where N is an integer 3) remove version number from transcript_id field (e.g. transcript_id "NR_046018.2" --> transcript_id "NR_046018") 4) remove duplicated transcript_id fields Note there will be a warning if there is no proper transcript accession id (mostly miRNAs) Report bugs and feedback to roland.schmucki@roche.com ``` ## reorder_gct ``` Usage: reorder_gct [-h] -g GCT_FILE -s SAMPLE_FILE Re-order samples (columns) in the GCT file by the names given in the SAMPLE file. -g input GCT file -s input SAMPLE file with re-ordered sample names (file must contain exactly one column) Optional arguments -h display this help and exit Contact roland.schmucki@roche.com ``` ## replace_header_gct ``` Usage: replace_header_gct [-h] -g GCT_FILE -s SAMPLE_FILE Replace the sample names in the GCT file by the names given in the SAMPLE file. -g input GCT file -s input SAMPLE file 2 columns required: 1st: sample names in input GCT file 2nd: sample names in output GCT file Optional arguments -h display this help and exit Contact roland.schmucki@roche.com ``` ## sort_gct ``` Usage: sort_gct [-h] -g GCT_FILE [-c 1|2] [-n] [-r] Sorts input GCT file by column 1 (default) or 2 in numeric or alphabetic (default) order -g input GCT file -c column 1 or 2 (default is 1) -n order numerically or alphabetically (default) -r reverse order Optional arguments -h display this help and exit Contact roland.schmucki@roche.com ``` ## subset_gct ``` Usage: subset_gct [-h] -g GCT_FILE -k KEYS_FILE -s SAMPLES_FILE Creates a subset of the input GCT file -g input GCT file -k input KEYS file 1 column required: 1st: keys (e.g. Genes) in input GCT file -s input SAMPLES file 1 column required: 1st: samples names in input GCT file to output Optional arguments -h display this help and exit Contact roland.schmucki@roche.com ```