miRanalyzer standalone version

Table of Contents

1 Introduction

The miRanalyzer standalone version requires the following third party software that needs to be installed first:

2 Downloads

In this section the available programs and Perl scripts can be downloaded. Please read next section on how to install

3 Installation

miRanalyzer relies on a huge number of libraries like mature microRNAs, chromosome sequences, bowtie indexes, etc. These data needs to be stored in a local file-base database. Before using miRanalyzer, this database needs to be generated.

3.1 General structure of the database

The easiest way to generate the basic structure of the database is to download the miRanalyzer start-up DB following this steps:
Now you will see a directory named miRanalyzerDB which is the base directory . Inside the base directory you will find 4 directories, the miRanalyzer jar file, the makeSeqObj jar file, the fastq convertion perl script and a example input file from Drosophila melanogaster. Please note that all folders must have exactly the names as used in this manual, otherwise miRanalyzer can not access the data. The folders are the following:
The bowtie folder needs several subfolders which need to be named exactly as follows:

3.2 Populate the database

3.2.1 Bowtie indexes

The bowtie indexes for the miRBase can be downloaded above. Please, note that the mature microRNA sequences are to short to work with normal read lengths. Therefore 25 Gs are added in order to be able to use Bowtie. Please, make sure using the latest version of miRBase indexes provided here, or take into account the 25 Gs added when preparing your own microRNA references. Please, use the bowtie-built algorithm or download the bowtie indexes from the bowtie pagefor the genome assemblies and transcribed libraries.

3.2.2 Genome sequence data

For the prediction of new microRNAs, miRanalyzer needs to extract sequences from the genome assemblies. In order to facilitate a rapid access to this information, miRanalyzer uses preprocessed fasta files. Table 1 shows the assemblies which are currently available and provides the download links. The user can generate these files for other species or unpuplished assemblies using the makeSeqObj Java program. The program takes a (multi)fasta as input and generates the corresponding zip file.
Species (short name) Assembly/Version Database Download link
Homo sapiens (hsa) hg18,hg19 UCSC hg18, hg19
Mus musculus (mmu) mm8,mm9 UCSC mm8, mm9
Ratus norvegicus (rno) rn4 UCSC rn4
Pan troglodytes (ptr) panTro2 UCSC pantro2
Macaca mulatta (rma) rheMac2 UCSC rhemac2
Bos taurus (bta) bosTau4 UCSC bostau4
Canis familiaris (cfa) canFam2 UCSC canfam2
Gallus gallus (gga) galGal3 UCSC galgal3
Gasterosteus aculeatus (gac) gasAcu1 UCSC gasacu1
Xenopus tropicalis (xtr) xenTro2 UCSC xentro2
Danio rerio (dre) danRer6 UCSC danrer6
Taeniopygia guttata (tgu) taeGut1 UCSC taegut1
Tetraodon nigroviridis (tni) tetNig2 UCSC tetnig2
Monodelphis domestica (mdo) monDom5 UCSC mondom5
Anopleles gambiae (aga) anoGam1 UCSC anogam1
Apis mellifera (ame) apiMel3 UCSC apimel3
Drosophila melanogaster (dme) dm3 UCSC dm3
Caenorhabditis elegans (cel) ce6 UCSC ce6
Bombyx mori (bmo) bm2 (silkworm_genome_v2.0) SilkDB bm2
Pea Aphid (pap) peaAph2 peaAph2 peaaph2
Arabidopsis thaliana (ath) tair9 TAIR tair9
Zea mays (zma) zm1 (ZmB73_AGPv1_genome) PlantGDB zm1
Vitis vinifera (vvi) vv12x (Genoscope 12x) PlantGDB vv12x
Oryza sativa (osa) OSgenomeV6.1 (osa6) PlantGDB osv61
Medicago truncatula (mtr) mt3 M. truncatula Genome Project mt3

3.2.3 Models

The latest models for animal and plant prediction of new microRNAs can be found here (if not downloaded along with the start-up DB). They must be placed in the “models” folder within the miRanalyzer directory.

4 Using miRanalyzer

Once the database is generated, miRanalyzer can be used. In general, the command line parameters must be given in the following format: parameterName=value. Note that, the program will analyse the input file extension to infer the format. So if the input file is in fasta format, the file will need ’fasta’ or ’fa’ extention. The program cann also read fastq files (need ’fastq’ extension). In this case, an adapter sequence needs to be given.
The program uses the following input parameters:

4.1 Mandatory parameters

4.2 Optional parameters

4.3 An example

This would be an example to launch the example data file provided with the start-up db. Note that you need to adapt the dbPath and bowtiePath to your local values. For very big input files, it might be necessary to increase the memory (-Xmx option of Java VM).
java -Xmx2000m -jar miRanalyzer.jar input=SRR069503.rc dbPath=/home/user/miRanalyzer species=dm3 speciesShort=dme kingdom=animal bowtiePath=/usr/local/bin/bowtie64 translibs=rfam:RFam:15:5

5 miRanalyzer output

miRanalyzer produces several different of output files. In the following * can be mature, maturestar, matureunobs, hairpin or any of the transcribed libraries.

5.1 *_unique.txt:

A summary of the reference sequences which have been uniquely matched. That means all reads just map to this and not any other reference sequence. The follwing columns exist:

5.2 *_ambig.txt

Sometimes a read maps with the same quality (number of mismatches and length) to more than one reference sequence. In such cases, miRanalyzer groups together those reference sequences and reports them as ambigous matches. We do so, as sometimes the members of a family have very similar sequences and cannot be readily distinguished. However, by means of this ambigous matches at least the microRNA family can be detected. The format of the file is the same as the “*_unique” file.

5.3 *_reads.txt

A summary on the reads level.

5.4 *_parsed.txt

The parsed bowtie output for the library. It holds just the longest alignments with the lowest number of mismatches (see Bowtie manual for more details).

5.5 *_grouped.txt

All reference sequences mapped by at least one read (no distinction between unique and ambigous matches!).

5.6 newMicroRNA.txt

The “newMicroRNA.txt” files hold a summary on the newly predicted microRNAs. Depending on the configuration, some reads which mapped to the transcriptome or RFam are also used for the prediction of new microRNAs. Those reads migth be more prone to cause false positive predictions and therefore, in the file “newMicroRNA_pure.txt” only those microRNAs with previously unmapped reads are reported. The columns of the files are:

5.7 Candidates.txt

“Candidates.txt” files contain additional information on the newly predicted microRNAs. The first column is the name of the candidate. Then follows a string which holdsseparated by “;”: the precursor sequence, the secondary structure, the mean free energy of the structure, the chromosome, chromosomic coordinate start, chromosomic coordinate end, the number of times it was predicted and the reads which compose this new microRNA. The different reads are separated by ’:’. The information for the different reads is the following (separated by ’@’): read sequence, position in precursor (0-based), read count and aligment length.
y y