Tools

annotate_loci

Description: 

This tool takes one input file with genomic coordinates in its first column and 
an additional file with gene locus information. It then tries to annoate all 
genomic regions from the first file, line by line, 
with the annotations from the second file by overlapping the coordinates. 

Usage: annotate_loci -i FILE -loci FILE -format gct|topTable 

	-i                    input file with loci information (required), ie first column 
	                      must contain a coordinate string 
	                      CHR:BEGIN-END , ie separated by colon and dash. If the input 
	                      format is gct or topTable then all subsequent columns sent to stdout 
	-loci                 input file with loci information (required), tab-delimited format: 
	                      CHR   BEGIN   END   STRAND   GENE   SYMBOL  DESCRIPTION 
	-format gct|topTable  input file -i is in gct|topTable format (optional) 
	-verbose              show more information (optional) 

Report bugs and feedback to roland.schmucki@roche.com 

count2tpm

Description: 

Calculate normalized read counts for input GCT file. 
Source: 
https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/ 
Note that NaN are output as zero 0. 

Usage: count2tpm -i GCT-file -l Length-file [-cpm|rpkm|tpm] [-log2|log10] [-col INT] [-digits INT] 


Mandatory input parameters: 

	-g     GCT file with read counts per gene (unique gene identifier in 1st column): 
	-l     tab-delimited file with gene identifier in 1st and gene length in 
	       2nd columns, respectively. 
	       These files can be found in the corresponding genome annotation folders, 
	       e.g. for human in folder /<path to genomes folder>/hg38/gtf/refseq/ 


Optional input parameters: 

	-tpm     transcript per million (default) 
	-rpkm    reads per kilobase of exon per million reads mapped 
	-cpm     counts per million mapped reads 
	-log2    log2 transform output (adding 0.01) 
	-log10   log10 transform output (adding 0.01) 
	-col     if input length file contains several columns, then specify 
	         the column number with this index (default last column) 
	-digits  number of digits after comma for output (default 3) 


 Report bugs and feedback to roland.schmucki@roche.com 

expression2gct

Description: 

Convert biokit expression gene count files into GCT format. 

Usage: expression2gct -infile='list of files' -outfile-prefix STRING 

Mandatory input parameters: 

        -infile: either a file containing the paths to input expression files OR 
                 list of space/comma-separated files, e.g. 
                 -infile='sample1.expression,sample2.expression' 

        -outfile-prefix: there are 2 - 4 output files, already existing files of same 
                         name will be overwritten:
                         STRING_rpkm.gct 
                         STRING_count.gct 

Optional input parameters: 

       -use-unique-counts  (use the unique rpkm/read counts, default is multiple) 

       -old-biokit-format  (use if expression file was generated with Biokit v3.8 or 
                           earlier; the annotation is in column #7 instead of #8) 


Report bugs and feedback to roland.schmucki@roche.com 

extract_sequence

Description: 

Extract from an input fasta or fastq file sequences by ids from another input file.

Usage: extract_sequence [-verbose] [-delimiter='. TAB'] [-useEntireIdLine] 
          [-quick] [-not] -ids ids_file -fasta|fastq fasta_file 

Mandatory parameters: 

	  -ids              file name containing sequence id's 
	  -fasta|fastq      file name containing the fasta or fastq sequences 

Optional parameters: 

	  -delimiter        delimiter on the sequence id line 
	  -useEntireIdLine  use the entire line as id and not split line by 
	                    -delimiter into fields 
	  -quick            stop search after the first match 
	  -not              inverse the search, ie output sequences that are 
	                    not in the ids file 
	  -verbose          output additional information 


Report bugs and feedback to roland.schmucki@roche.com 

make_cls

Description: 

Create a phenotype CLS file from a given GCT and annotation file. 	The CLS format is defined here: 
https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#Phenotype_Data_Formats 

Usage: make_cls -gct FILE  -i FILE 

Mandatory input parameters: 

	  -gct GCT_FILE         input file in GCT format 
	  -i   ANNOTATION_FILE  input file with sample annotations 

	The annotation file is a 2 column tab-delimited file (comments or header mark with #) 
	  column 1: sample name as given in input GCT file 
	  column 2: sample group 


Report bugs and feedback to roland.schmucki@roche.com 

make_design_contrast_matrix

Description: 

Usage: make_design_contrast_matrix [-prefix STRING] -gct FILE -i FILE 

Mandatory parameters: 

	-gct GCT_FILE         input file in GCT format 
	-i   ANNOTATION_FILE  input file with sample annotations with 2 columns (see below) 

Optional parameters: 

	-prefix STRING        a string for the output prefix 

	The annotation file is a 2 column tab-delimited file (comments or header mark with #) 
	  column 1: sample name as given in input GCT file 
	  column 2: sample group 

Report bugs and feedback to roland.schmucki@roche.com

mean

Description: 

Calculate means per sample conditions. 

Usage: mean [-skip INT] [-gzip]  -i INFILE  -s SAMPLE_ANNOTATIONS 

Mandatory parameters: 
	  -i FILE  inptu file with sample data, e.g. read counts 
	  -s FILE  input file with sample annotations 

The SAMPLE_ANNOTATIONS file is tab-delimited input file with at least 2 columns: 
	 column 1: sample name 
	 column 2: sample condition 

The read count from INFILE are averaged (mean) for each sample condition. 
The INFILE headers should match with the sample names specified in the SAMPLE_ANNOTATIONS file. 
Note that the header line must begin with '#' or with 'ID' 

Optional parameters: 

	  -skip INT     denotes how many columns from the INFILE should be skipped and 
	                not used for calculation, e.g. skip ID or description columns, default 6 
	  -gzip         use if INFILE is gzipped

Report bugs and feedback to roland.schmucki@roche.com 

merge_fastq

Description: 

Merge reads from several fastq files into one fastq file. 
IMPORTANT: input files for mate R1 and R2 reads must be in the same ORDER. 

Usage: merge_fastq [-sbatch] [-t INT] [-old-version] [-script-prefix STR] [-bsub-path STR] -i FILE 

Mandatory parameters: 

  -i  input_file  tab-delimited file with 2 columns: input gzipped fastq file, 
                  output gzipped fastq file 


Optional parameters: 

  -sbatch         use "sbatch" for submitting to queue, default is "bsub" 
  -t  integer     number of minutes for queuing system, default 360 = 6 hours, only for "sbatch" 
  -old-version    use old version which is much slower 
  -script-prefix  prefix for temp scripts, e.g. path, default ./merge_fastq 
  -bsub-path      path to bsub command on the shpc, default bsub 


Report bugs and feedback to roland.schmucki@roche.com 

merge_gct

  Usage: merge_gct [-h] FILE1 FILE2 [FILE3 ...]
  
  Merge GCT files

  Optional arguments

    -h   display this help and exit

   Contact roland.schmucki@roche.com

minmax_gct

  Filter away all features from a GCT file if the row 
  MIN or MAX is lower/greater/lower equal (MIN-EQUAL)/greater equal (MAX-EQUAL)
  than a user given threshold. Use MIN-/MAX-REVERSE to output reversed comparison.
  Results are redirected to the standard output.

  3 input arguments required:

    1. input GCT file
    2. threshold value (real number)
    3. MIN or MAX or MIN-EQUAL or MAX-EQUAL or MIN-REVERSE or MAX-REVERSE

  Contact roland.schmucki@roche.com

parse_gtf

Description: 

Parse a GTF file and output attributes to stdout. 

Usage: parse_gtf [-h] -gtf INFILE  + 1 Optional argument from below 

Optional arguments

	-output-fields="gene_name,gene_synonym,product" 

The above example will output all gtf fields named "gene_name", "gene_synonym", and "product" 

	-refseq 

This option will work on a gtf from refseq and do the following  
             0) discard lines that are not exons 
             1) get gene number from the field Dbxref "GeneID: 
                and replace the input gene_id value with the gene number 
             2) remove transcript_id fields if they contain "rnaN" numbers where 
                N is an integer 
             3) remove version number from transcript_id field 
                (e.g. transcript_id "NR_046018.2" --> transcript_id "NR_046018") 
             4) remove duplicated transcript_id fields 

Note there will be a warning if there is no proper transcript accession id (mostly miRNAs) 


Report bugs and feedback to roland.schmucki@roche.com

reorder_gct

  Usage: reorder_gct [-h] -g GCT_FILE -s SAMPLE_FILE
  
  Re-order samples (columns) in the GCT file by
  the names given in the SAMPLE file.

    -g   input GCT file
    -s   input SAMPLE file with re-ordered sample names (file must contain exactly one column)

  Optional arguments

    -h   display this help and exit

   Contact roland.schmucki@roche.com

replace_header_gct

  Usage: replace_header_gct [-h] -g GCT_FILE -s SAMPLE_FILE
  
  Replace the sample names in the GCT file by
  the names given in the SAMPLE file.

    -g   input GCT file
    -s   input SAMPLE file
         2 columns required: 
           1st: sample names in input GCT file
           2nd: sample names in output GCT file

  Optional arguments

    -h   display this help and exit

   Contact roland.schmucki@roche.com

sort_gct

  Usage: sort_gct [-h] -g GCT_FILE [-c 1|2] [-n] [-r]

  Sorts input GCT file by column 1 (default) or 2 in numeric or alphabetic (default) order
  

    -g   input GCT file
    -c   column 1 or 2 (default is 1)
    -n   order numerically or alphabetically (default) 
    -r   reverse order

  Optional arguments

    -h   display this help and exit

   Contact roland.schmucki@roche.com

subset_gct

  Usage: subset_gct [-h] -g GCT_FILE -k KEYS_FILE -s SAMPLES_FILE

  Creates a subset of the input GCT file 
  
    -g   input GCT file
    -k   input KEYS file
         1 column required: 
           1st: keys (e.g. Genes) in input GCT file
    -s   input SAMPLES file
         1 column required:
           1st: samples names in input GCT file to output
  Optional arguments

    -h   display this help and exit

   Contact roland.schmucki@roche.com