miRanalyzer standalone version
April 26, 2011
Contents
1 Introduction
The miRanalyzer standalone version requires the following third party software that
needs to be installed first:
- Vienna RNA package for RNA Secondary Structure Prediction and
ComparisonVienna package
- Bowtie - An ultrafast memory-efficient short read aligner (Bowtie)
- Weka: Data Mining Software in Java (Weka). The tested version is weka
3.5.3. NOTE that the “weka.jar” file (named exactly like this, i.e. without
version) must be in the same folder as the miRanalyzer.jar (see below).
2 Downloads
In this section the available programs and perl scripts can be downloaded. Please
read next section on how to install
- groupReads: this script can be used to convert fastq to RC (read-count)
format.
- makeSeqObj: this java jar can be used to to prepare the genome sequences
(see below). Please note that the current version has a very inefficient
memory usage and you might assign several Gigas of memory (for example
-Xmx5000m).
- miRanalyzer jar file: the newest version of the miRanalyzer algorithm
(miRanalyzer_0.2, 02/14/2011)
- models for version 0.2: the newest model files (they may just work with
the newest version!)
- start-up DB: the start-up database (see below)
3 Installation
miRanalyzer relies on a huge number of libraries like mature microRNAs,
chromosome sequences, bowtie indexes, etc. These data needs to be stored in a
local file-base database. Before using miRanalyzer, this database needs to be
generated.
3.1 General structure of the database
The easiest way to generate the basic structure of the database is to download the
miRanalyzer start-up DB following this steps:
- download the start-up DB to your miRanalyzer directory (like
/home/user)
- extract the tar: tar xvzf miRanalyzerDB.tgz
Now you will see a directory named miRanalyzerDB which is the base directory . Inside
the base directory you will find 4 directories, the miRanalyzer jar file, the
makeSeqObj jar file, the fastq convertion perl script and a example input file from
Drosophila melanogaster. Please note that all folders must have exactly the names as
used in this manual, otherwise miRanalyzer can not access the data. The folders are
the following:
- bowtie: the folder where the bowtie indexes must be (see below)
- model: this folder holds the model files for the prediction of new
microRNAs. Download all model files here: miRanalyzer model files
- out: the default output folder, a folder with the name of the input file will
be generated in this directory.
- seqOBJ: the genome sequences in miRanalyzer format (to download see
the table below).
The bowtie folder needs several subfolders which need to be named exactly as
follows:
- genome: the indexes of the whole genome sequences. The names must be
the same as used for the genome sequences (in seqOBJ folder)
- mature: the indexes for the mature microRNAs. Download here all indexes
and put them into the ’mature’ folder.
- maturestar: the indexes for the maturestar microRNAs. Download here
all indexes and put them into the ’maturestar’ folder.
- maturestarunobs: the indexes for the maturestar microRNAs which are
not in miRBase, i.e. that have not been observed before. Download here
all indexes and put them into the ’maturestarunobs’ folder.
- hairpin: the indexes for the hairpin - precursor- sequences of the
microRNAs. Download here all indexes and put them into the ’hairpin’
folder.
- translibs: the indexes of the other libraries which should be used. The files
can be generated with bowtie-built. The basename of colorspace libraries
must finish in ’_c’. For example rfam_c*. Note that, however the input
name would just be ’rfam’, as miRanalyzer add the ’_c’ in the case that
color space sequences have been indicated in the input.
3.2 Populate the database
3.2.1 Bowtie indexes
The bowtie indexes for the miRBase can be downloaded above. Please, note that
the mature microRNA sequences are to short to work with normal read
lengths. Therefore 25 Gs are added in order to be able to use Bowtie. Please,
make sure using the latest version of miRBase indexes provided here, or
take into account the 25 Gs added when preparing your own microRNA
references. Please, use the bowtie-built algorithm or download the bowtie
indexes from the bowtie pagefor the genome assemblies and transcribed
libraries.
3.2.2 Genome sequence data
For the prediction of new microRNAs, miRanalyzer needs to extract sequences from
the genome assemblies. In order to facilitate a rapid access to this information,
miRanalyzer uses preprocessed fasta files. Table 1 shows the assemblies which are
currently available and provides the download links. The user can generate these files
for other species or unpuplished assemblies using the makeSeqObj Java program. The
program takes a (multi)fasta as input and generates the corresponding zip
file.
3.2.3 Models
The latest models for animal and plant prediction of new microRNAs can be found
here (if not downloaded along with the start-up DB). They must be placed in the
“models” folder within the miRanalyzer directory.
4 Using miRanalyzer
Once the database is generated, miRanalyzer can be used. In general, the command
line parameters must be given in the following format: parameterName=value. The
program uses the following input parameters:
4.1 Mandatory parameters
- input=file: the name of the input file. The input can be given in read-count
format (’read sequence’ ’read frequency’) or in multi fasta format (see the
manual for more information)
- dbPath=path: the absolutecomplete path to the miRanalyzer database
- species=’assembly name’: Assembly/basename of bowtie index, i.e.
hg18,mm8,rn4,etc.
- speciesShort=’short name of species’: This must be the abbreviation of
the species used in miRBase for example hsa for Homo sapiens or mmu
for Mus musculus.
- kingdom=plant or kingdom=animal: In order to set the models and
features for the prediction of new microRNAs the program needs to know
whether the species is a plant or animal.
- bowtiePath=path: the path with the Bowtie binaries
4.2 Optional parameters
- translibs=’parameter string’: the additional libraries which should be used.
This parameter actually takes into account four different values which are
’encoded’ into a parameter string. The parameter string has the following
format: ’name of bowtie basename’:’name of library’:’maximum number
of considered matches’:’remove value’. Only the basename of the bowtie
index is mandatory, the other three are optional. If not set, the default
values are: a) ’name of library’ is set to the name of the bowtie basename
(miRanalyzer uses this name in the output), b) ’maximum number of
considered matches’ is set to 20 (this value is used as bowtie parameter -k
–> the maximum number of reported alignments) and c) ’remove value’
is set to 5, i.e. if a read matches to more that 5 reference sequences it is
removed and therefore not used in the prediction of new microRNAs.
- solid=true: if this parameter is set, then the input is treated as color space
format.
- output=’output folder’: by default the output is written in the ’out’ folder
in the miRanalyzer directory. The default folder is named after the input.
For example, if the input is called “hsa_G0.txt”, then within the ’out’
folder a subfolder will be generated named ’hsa_G0’ where the output is
written.
- minReadLength=value: the minimum read length. Note that this length is
also used as seed length in the bowtie alignment against known microRNAs
and the genome (default: 17).
- minReadLengthTrans=value: the seed length in bowtie alignments against
the transcribed libraries (default: 20).
- maxReadLength=value: the maximum read length. miRanalyzer performs
an internal re-grouping with this length (default: 26).
- noMM=value: allowed number of mismatches to known microRNAs
(default: 1)
- noMMTrans=value: allowed number of mismatches to transcribed libraries
(default: 1)
- noMMGenome=value: allowed number of mismatches to the genome
(default: 1)
- score=value: the minimal score so that the candidate is considered a new
microRNA. The value must be between 0 and 1, although very low values
will lead to drastically overpredicting. (default: 0.9)
- minNoPositives=value: miRanalyzer predicts using 5 models (5 different
negative sets). This parameter determines the minimum number of models
which predicts a candidate to be a new microRNA (default: 3).
- maxNoGenome=value: the maximum number of considered matches to
genome (-k option in bowtie) (default: 8)
- maxNoKnown=value: the maximum number of considered matches to
known microRNAs (-k option in bowtie) (default: 10)
4.3 An example
This would be an example to launch the example data file provided with the start-up
db. Note that you need to adapt the dbPath and bowtiePath to your local values. For
very big input files, it might be necessary to increase the memory (-Xmx option of
Java VM).
java -Xmx2000m -jar miRanalyzer.jar input=SRR069503.rc
dbPath=/home/user/miRanalyzer species=dm3 speciesShort=dme kingdom=animal
bowtiePath=/usr/local/bin/bowtie64 translibs=rfam:RFam:15:5
5 miRanalyzer output
miRanalyzer produces several different of output files. In the following *
can be mature, maturestar, matureunobs, hairpin or any of the transcribed
libraries.
5.1 *_unique.txt:
A summary of the reference sequences which have been uniquely matched. That
means all reads just map to this and not any other reference sequence. The follwing
columns exist:
- name: the name of the reference sequence
- #uniqueReads: the number of unique reads mapped
- readCount: the read count sum of all unique reads
- norm_expressed_all: the read count divided by the total read count in
the experiment, times 100 (percentage)
- norm_expressed_mapped: the read count divided by the read count of all
unique reads mapped to this library, times 100 (percentage)
5.2 *_ambig.txt
Sometimes a read maps with the same quality (number of mismatches and length) to
more than one reference sequence. In such cases, miRanalyzer groups together those
reference sequences and reports them as ambigous matches. We do so, as sometimes
the members of a family have very similar sequences and cannot be readily
distinguished. However, by means of this ambigous matches at least the microRNA
family can be detected. The format of the file is the same as the “*_unique”
file.
5.3 *_reads.txt
A summary on the reads level.
- ReadSequence: the sequence of the read
- ReadCount: the read count (number of copies)
- NoMatches: the number of reference sequences the read has mapped to
- alignLength: the length of the longest alignment
- name: the names of the reference sequences the read mappes to
- readID: a unique ID assigned by miRanalyzer
5.4 *_parsed.txt
The parsed bowtie output for the library. It holds just the longest alignments with
the lowest number of mismatches (see Bowtie manual for more details).
5.5 *_grouped.txt
All reference sequences mapped by at least one read (no distinction between unique
and ambigous matches!).
- name: the name of the reference sequence
- readCountSum: the sum of all read counts mapped to this reference
sequence
- uniqueReadSum: the number of unique reads mapped to this reference
- readsString: is an encoded string with information of all reads mapped to
this sequence. The information on the different reads is separated by ’#’.
For each read the read sequence, the read count, and the position in the
alignment (0-based) is given (separated by ’:’).
5.6 newMicroRNA.txt & newMicroRNA_pure.txt
The “newMicroRNA.txt” and “newMicroRNA_pure.txt” files hold a summary on the
newly predicted microRNAs. Depending on the configuration, some reads which
mapped to the transcriptome or RFam are also used for the prediction of
new microRNAs. Those reads migth be more prone to cause false positive
predictions and therefore, in the file “newMicroRNA_pure.txt” only those
microRNAs with previously unmapped reads are reported. The columns of the files
are:
- name: the name of the candidate
- chrom: the chromosome
- chromStart: the start coordinate
- chromEnd: the end coordinate
- strand: the strand
- #unique reads: the number of unique reads
- read count: the read count sum of all unique reads
- #posPredictions: the number of times this new microRNA has been
predicted (in either of the 5 models)
- norm. read count: the normalized read count (divided by the total read
count of all reads in the analysis)
- cluster sequence: the sequence formed by all reads
5.7 Candidates.txt and Candidates_pure.txt
“Candidates.txt” and “Candidates_pure.txt” files contain additional information on
the newly predicted microRNAs. The first column is the name of the candidate. Then
follows a string which holdsseparated by “;”: the precursor sequence, the secondary
structure, the mean free energy of the structure, the chromosome, chromosomic
coordinate start, chromosomic coordinate end, the number of times it was predicted
and the reads which compose this new microRNA. The different reads are separated
by ’:’. The information for the different reads is the following (separated by ’@’): read
sequence, position in precursor (0-based), read count and aligment length.