Reference manual

Introduction

MethFlow is an optimized, open-source pipeline which performs DNA methylation profiling, detection of sequence variants, full integration with our methylation database, NGSmethDB, and differential methylation analysis. Briefly, the pipeline performs the following steps:

Format conversion: convert SRA files to FASTQ by means of SRA Toolkit. This only applies if the input data comes from Sequence Read Archive (SRA) public repository.
Adapter and low quality bases trimming by means of Trimmomatic.
Alignment against one or two assemblies: firstly, short reads are aligned against the first assembly (assembly 1 from now on) producing uniquely-mapped, multiple-mapped and unmapped reads. Uniquely-mapped reads are kept to use in the next step. Secondly, multiple-mapped and/or unmapped reads are aligned against the second assembly (assembly 2 from now on) producing uniquely-mapped, multiple-mapped and unmapped reads. Uniquely-mapped reads are merged with previously obtained uniquely-mapped reads and used in the next step. Bismark is used as aligner.
Elimination of known technical artifacts by BSeQC.
Detection of DNA methylation and sequence variants by MethylExtract.
Get methylation maps from NGSmethDB.
Differential methylation analysis by methylKit and MOABS and generate a consensus of both.

Implementations

MethFlow pipeline was inplemented in three ways. We first provide the software optimized to run in a powerful and user-friendly cloud environment. Second, for users requiring the maximal level of data privacy, we developed MethFlow^VM, a ready-to-use, fully-configured virtual machine which is able to run on most operating systems (Windows, Linux or Mac). With MethFlow^VM the user will no longer need to upload private data to any public server. Finally, advanced users can download the source code from a public repository, which allows installing and customizing MethFlow on any operating system. See Links and downloads for links to connect to the cloud app or to download the virtual machine or the standalone programs. The cloud app contains an intuitive menu which facilitates its use. The instructions of this manual are for the command line of the VM and standalone programs.

MethFlow app in the cloud

Connect to precisionFDA.

MethFlow Virtual Machine

Install VirtualBox.
Install VirtualBox Extension Pack.
Download MethFlow^VM (mirror).
Import MethFlow^VM to VirtualBox by double-clicking.
Optional: add a shared folder (strongly recommended).
Run MethFlow^VM.

MethFlow standalone programs

Dependencies

All these programs must be in the PATH.

Local Installation

Execute the following commands:

git clone https://github.com/bioinfoUGR/MethFlow.git
cd MethFlow
chmod +x MethFlow MethFlow_api MethFlow_diffmeth MethFlow_manager Trimmomatic.sh

In the Trimmomatic.sh file, replace the value of TRIMMOMATIC_PATH by the path of Trimmomatic.jar file.
Add Trimmomatic.sh to the PATH.

The local database

Set your working folder

At first startup, you will be asked which working folder you want to use. If you ignore this question, your home folder (/home/methflow in MethFlow^VM) will be used as working folder. We strongly recommended to use a shared folder as working folder.

If you want to change the working folder, open a terminal and type the following command:

MethFlow_manager working_folder

Set your assembly collection

Tell MethFlow where the assembly collection is by typing:

MethFlow_manager assembly_collection Assemblies

This command looks for a folder named Assemblies inside the working folder. If the desired folder is outside the working folder, use the ‐‐out option.

Set your adapter collection

Tell MethFlow where the adapter collection is by typing:

MethFlow_manager adapter_collection Adapters

This command looks for a folder named Adapters inside the working folder. If the desired folder is outside the working folder, use the ‐‐out option.

Set your root input folder

Tell MethFlow where to look for the input folders:

MethFlow_manager root_input_folder Inputs

This command looks for a folder named Inputs inside the working folder. If the desired folder is outside the working folder, use the ‐‐out option.

Set your root output folder

Tell MethFlow where to kept the output folders:

MethFlow_manager root_output_folder Outputs

This command creates a folder named Outputs inside the working folder. If the desired folder is outside the working folder, use the ‐‐out option.

Set intermediates folder

Tell MethFlow where to kept the intermediates output folders:

MethFlow_manager intermediates_folder Intermediates

This command creates a folder named Intermediates inside the root output folder. If the desired folder is outside the root output folder, use the ‐‐out option.

Set plots folder

Tell MethFlow where to kept the plots output folders:

MethFlow_manager plots_folder Plots

This command creates a folder named Plots inside the root output folder. If the desired folder is outside the root output folder, use the ‐‐out option.

Set meth folder

Tell MethFlow where to kept the methylation maps:

MethFlow_manager meth_folder Meth

This command creates a folder named Meth inside the root output folder. If the desired folder is outside the root output folder, use the ‐‐out option.

Set diffmeth folder

Tell MethFlow where to kept the differential methylation maps:

MethFlow_manager diffmeth_folder Diffmeth

This command creates a folder named Diffmeth inside the root output folder. If the desired folder is outside the root output folder, use the ‐‐out option.

Launch MethFlow

Using default options

Now, launch MethFlow with default options:

MethFlow

This command looks inside the working folder and asks you for:

Assembly 1 folder. This folder should contain FASTA or multiFASTA files and must be inside the assembly collection folder. Optionally, it could contain Bismark Bowtie2 indexes.
Adapter file. This file must be a multiFASTA file inside the adapter collection.
Input data folder. This folder should contain all the input datasets of a sample in SRA, FASTQ, SAM or BAM format (all files must be in the same format) and must be inside the root input folder.
Output data folder. This folder will be create inside the root output folder.

Methylation maps calculated by MethFlow are located at meth folder inside the root output folder.

Using two assemblies

If you want to use a second assembly, launch MethFlow as follow:

MethFlow ‐‐assembly2

In this case, MethFlow asks you for:

Assembly 1 folder. This folder should contain FASTA or multiFASTA files and must be inside the assembly collection folder. Optionally, it could contain Bismark Bowtie2 indexes.
Assembly 2 folder. This folder should contain FASTA or multiFASTA files and must be inside the assembly collection folder. Optionally, it could contain Bismark Bowtie2 indexes.
What type of reads you want to use against the assembly 2: multiple-mapped reads, unmapped reads or both.
Adapter file. This file must be a multiFASTA file inside the adapter collection.
Input data folder. This folder should contain all the input datasets of a sample in SRA, FASTQ, SAM or BAM format (all files must be in the same format) and must be inside the root input folder.
Output data folder. This folder will be create inside the root output folder.

Methylation maps calculated by MethFlow are located at meth folder inside the root output folder.

Enable NGSmethDB API client

Use the option ‐‐enable_api to activate NGSmethDB API client functionaly. MethFlow asks you what samples you want to download from the NGSmethDB. Methylation maps downloaded from NGSmethDB are located at meth folder inside the root output folder.

Enable differential methylation analysis

Use the option ‐‐enable_diffmeth to activate differential methylation analysis functionaly. MethFlow asks you what samples compare in the differential methylation analysis. Differential methylation maps calculated by MethFlow are located at Diffmeth folder inside the root output folder.

—

Analyze the results files

The output folder of every analyzed input sample directory contains a number of folders:

Methylation_Maps folder. With three folders inside:
- MethylExtract folder. It contains between one to three methylation map files, one for each analyzed methylation context (see Change parameters and launch options): CG.output, CHG.output and CHH.output. These files contain the methylation profiling results at a single cytosine resolution: the methylation context, the position on the genome, the number of reads where this cytosine is methylated, the coverage and the sequencing quality. For a full description of this format visit the manual of MethylExtract.
- methylKit folder. The methylation profiling results in methylKit input format. This format can also be used by MethylSig. For a full description of this format visit the manual of methylKit.
- methylKit_plots folder. The methylation ratio and coverage distributions of the input sample, plotted by methylKit. Files are in PDF format. To get these plots in other formats, see Downstream analysis.
Differential_Methylation_Maps folder. With three folders inside:
- methylKit_DMC_maps. It contains one file for each pair of methylation maps analysed.
- MOABS_DMC_maps. It contains one file for each pair of methylation maps analysed.
- consensus_DMC_maps. It contains one file for each pair of methylation maps analysed.

SNVs folder. There is only one file here: SNVs.vcf. This file contains the sequence variants detected in the input sample against the reference genome assembly. The VCF format specifications can be seen here.
Logs folder. Contains a folder for each program used during the pipeline. Each of these folders have two logs for every processed file: one log recording the standard output and the other recording the standard error.
CITE.txt file. A text file within the references that you should cite if you use MethFlow, including all references to third-party software used in a particular process.
FASTQ folder (for SRA input files). It stores the FASTQ files converted from the original SRA files. There will be either one or two FASTQ files for each SRA file, depending whether the sequencing reads are single-end or paired-end. Only if input sample is SRA or FASTQ and –adapters_trimmed is not specified.
trimmed_FASTQ folder. It stores the trimmed FASTQ files, i.e. the Trimmomatic output files. Only if input sample format is SRA or FASTQ.
FastQC folder. With two folders inside:
- FASTQ_FASTQC folder. Contains the quality report of the FASTQ files before trimming. There is one report for each FASTQ file. Only if input sample format is SRA or FASTQ.
- trimmed_FASTQ_FastQC folder. Contains the quality report of the FASTQ files after trimming. There is one report for each FASTQ file. Only if input sample is SRA or FASTQ and –adapters_trimmed is not specified.
ambiguous_FASTQ folder. It stores the FASTQ files ambiguously mapped against the first assembly. There is one file for each FASTQ used during alignment against the first assembly. This folder appears if the input format is SRA or FASTQ and a second assembly is used.
unmapped_FASTQ folder. It stores the FASTQ files with unmapped reads against the first assembly. There is one file for each FASTQ used during alignment against the first assembly. This folder appears if the input format is SRA or FASTQ and a second assembly is used.
BAM folder. Contains BAM files coming from alignment against the first assembly and, if applicate, the second assembly. There is only one file for each dataset (paired-end data no longer have two files). BAM files from second assembly alignment are merged, if applicable. Only if input sample format is SRA or FASTQ.
fixed_SAM folder. Contains SAM files after bisulfite bias fixing. There is only one file for each dataset. Only if –bisulfite_bias_fixed is not set on the command line.
BSeQC_plots folder. Contains plots about the bisulfite bias. There is only one folder for each dataset. Only if –bisulfite_bias_fixed is not set on the command line.

Local settings

The easiest way to use MethFlow is to set the value of certain parameters by means of a setting file. This file can be found within your home: $HOME/.methflowrc (note the dot at the beginning), where $HOME is your home directory (i.e. /home/methflow). It is not listed with ls, except you add the option -a.

This file is a text file that can be edited with any plain text editor such as vim or nano:

nano $HOME/.methflowrc

It should contain eight variables:

working: the path of the shared folder to be used.
assemblies: the path of the assemblies folder (see Set assemblies folder).
adapters: the path of the adapter collection.
output: the path of the base output folder.
intermediates: the path where intermediates output files were kept.
plots: the path where plots output files were kept.
meth: the path where methylation maps were kept.
diffmeth: the path where differential methylation maps were kept.

You can modify the variables. If the specified path does not exist or if the parameter is missing at all, MethFlow will ask again on the command line.

Input data

It is highly recommended to provide the input data from the shared folder. The data from different samples must go into separate folders. The input files located within the same folder are interpreted as different runs from the same sample. Accepted formats are SRA, FASTQ, SAM and BAM.

The directory with the sample dataset to be used can be specified in a configuration file (not to be confused with the settings file, see Local settings) or on the command line when you launch MethFlow (see Change parameters and launch options). Otherwise, you will be asked.

Prepare the assemblies

Each assembly must go into a separate folder into the assemblies folder. The assembly may consist of a multi-FASTA file or several FASTA files, all contained in the same directory. It may contain or not Bismark Bowtie2 indexes. If not, Bismark Bowtie2 indexes will be calculated during the first usage of the assembly by MethFlow.

The directory with the assembly can be specified in a configuration file (not to be confused with the settings file) or on the command line when you launch MethFlow (see Change parameters and launch options). Otherwise, you will be asked.

You can download some assemblies (including Bismark Bowtie2 indexes) with this command:

MethFlow_manager get_assemblies

The data is then downloaded to the assemblies folder set in .methflowrc.

Launch options

To run the MethFlow pipeline we execute the command MethFlow [arguments] together with the relevant arguments. If you do not specify any arguments the program will enter in the quick mode, where you will be asked interactively.
There is an auxiliar command, MethFlow_configure [arguments], which can be used to create a configuration file (not to be confused with the settings file). This command does not launch MethFlow but it generates a configuration file with the parameters specified on the command.
MethFlow can be used in three ways:

Interactive: MethFlow. The program asks you the mandatory arguments through dialogs.
Configuration file: you indicate arguments to be used in a configuration file created by MethFlow and edited by you. Type MethFlow –config configuration_file to use this mode, where configuration_file is the configuration file previously generated by MethFlow or edited by you.
Command line: MethFlow [arguments]. The arguments are given when launching the program. If any mandatory arguments are missing you will be asked interactively. It can be combined with the configuration file mode. In case of conflict, the command line value of the conflictive argument will be used.

Mandatory arguments

Some parameters must be indicated by the user:

input: the path of the input data folder. It must be indicated in a configuration file, on the command line or through a dialog. During the pipeline various arguments are detected: format of the input files, if they have single-end or paired-end reads, if they use phred33 or phred64 and the maximum and the minimum read length.
adapters: the path of the adapter collection. It must be indicated in the settings file, in a configuration file, on the command line or through a dialog.
assembly: the path of the first assembly folder. It must be indicated in a configuration file, on the command line or through a dialog. During the pipeline it is checked for Bismark Bowtie2 indexes. If there are not indexes within the folder, they will be calculated.
output: the path of the base output folder. It must be indicated in the settings file, in a configuration file, on the command line or through a dialog.

If –assembly2 is used there will be two extra mandatory arguments:

<assembly2_path>: the path of the second assembly folder. It must be indicated in a configuration file, on the command line or through a dialog. During the pipeline it is checked for Bismark Bowtie2 indexes. If there are not indexes within the folder, they will be calculated.
use_assembly2_for: indicates which kind of reads will be used for the mapping against the second assembly (ambiguously mapped reads against first assembly, unmapped reads or both). It must be indicated in a configuration file, on the command line or through a dialog. For example, to use it at the command line:

MethFlow --assembly2 <assembly2_path> --use_assembly2_for [ambiguous, unmapped or both]

where <assembly2_path> is the path in the virtual machine for the assemblies you want to use. It is highly recommended that the assemblies folder is within the shared folder.

In addition, in this command, you have to chooses using ambiguous, unmapped or both kinds of reads.

Optional arguments

Most arguments are optional. When they are not given, MethFlow either calculates or tries to estimate them (like minimum_read_length and threads) or it uses the default values. Note that when the parameters are use on the command line, ‘–‘ must precede the parameter name. For example, parameter name: ‘adapter_trimmed’ ➜ on command line: –adapter_trimmed.

adapter_trimmed: it is a bool argument (default: off). When on, the adapters trimming is skipped.
bisulfite_bias_fixed: it is a bool argument (default: off). When on, the bisulfite bias fixing is skipped.
library: indicates whether the type of sequencing library is directional, non-directional or PBAT (options: directional, non_directional or pbat;default: directional). Unfortunately, this argument cannot be estimated before aligning. If you observe a high number of unmapped reads, try changing this argument.
rrbs: it is a bool argument (default: off). Indicate that the sequencing technique used is Reduced Representation Bisulfite Sequencing (RRBS). It takes into account when bisulfite bias fixing. It is recommended to use combined with the argument not_remove_duplicate.
not_seed_mismatch: it is a bool argument (default: off). When on, you do not use mismatches in seed during aligning. When off, you use one mismatch.
seed_length: indicate the length of the seed used during aligning (minimum: 8; maximum: 32; default: 32).
not_remove_duplicate: it is a bool argument (default: off). When on, duplicate reads are not remove during profiling. Recommended for RRBS data.
minimum_phred_score: indicate the minimum accepted phred score during trimming and profiling (default: 20). To set separately for both steps, use advanced arguments (see Manipulate the configuration file).
minimum_read_length: indicate the minimum accepted read length during trimming (default: calculated as half of the original length of the reads).
minimum_coverage: indicate the minimum accepted coverage during profiling (default: 1).
methylation_context: indicate the methylation context to analysis during profiling (options: CG, CHG, CHH or ALL; default: CG)
threads: indicate the maximum number of threads to be used (default: calculated as the number of CPUs of the virtual machine; minimum: 2).
intermediates: indicate the path where intermediates output files were kept (default: as part of output folder).
disable_plots: a boolean option to switch off the plotting functions (not used by default).
plots: indicate the path where plots output files were kept (default: as part of output folder).
methylomes: indicate the path where methylation maps were kept (default: as part of output folder).
diffmeth: indicate the path where differential methylation maps were kept (default: as part of output folder).
enable_api: a boolean option to switch on the using of NGSmethDB API client (not used by default).
api_conf: use a NGSmethDB API configuration file instead of asking for the samples to download (not used by default).

Use command line arguments

All arguments described above can be used in the command to run or configure MethFlow, adding a double hyphen before the name of the argument. For example:

–input <path>, –adapter_trimmed, –library non_directional or –threads 16.

If you run MethFlow without specify a configuration file, you will be asked for all mandatory arguments not specified on the command line or on the settings file. Optional arguments not indicated will take their default value.

Downstream analysis

In this section, we explain how to do serveral downstream analysis with the virtual machine and standalone implementations.

methylKit downstream analysis

Using methylKit you can do a lot of downstream analysis, as compare methylation maps of different samples by means of a Pearson correlation matrix and sample clustering.

MethFlow converts automatically the methylation maps from MethylExtract output format to methylKit input format during the pipeline (see Analyze the result files). Anyway, you can convert MethylExtract output files into methylKit input files anytime by typing this command:

me2mk -i MethylExtract_Output_File -o methylKit_Input_File -c Methylation_Context (CG, CHG or CHH) [--destrand]

-i, -o and -c are mandatory arguments. Optionally, you can use –destrand to merge the data from both Watson and Crick strands (default: off).

To do a quick descriptive analysis using methylKit you can use the following commands:

Methylation ratio distribution:

mk_methRatio_distribution -i methylKit_Input_File -o Image_Ouput_File [-f format (pdf, ps, svg, png, jpeg, bmp or tiff)] [-a assembly] [-m methylation_context]

-i and -o are mandatory arguments. -f takes pdf as default value.

Coverage distribution:

mk_coverage_distribution -i methylKit_Input_File -o Image_Ouput_File [-f format (pdf, ps, svg, png, jpeg, bmp or tiff)] [-a assembly] [-m methylation_context]

-i and -o are mandatory arguments. -f takes pdf as default value.

Pearson Correlation Matrix:

mk_pearson_correlation -i methylKit_Input_Files -o Image_Ouput_File [-f format (pdf, ps, svg, png, jpeg, bmp or tiff)] [-a assembly] [-m methylation_context]

-i and -o are mandatory arguments. -f takes pdf as default value. You should indicate more than one input file, separated by spaces.

Clustering Tree:

mk_clustering -i methylKit_Input_Files -o Image_Ouput_File [-f format (pdf, ps, svg, png, jpeg, bmp or tiff)] [-a assembly] [-m methylation_context]

-i and -o are mandatory arguments. -f takes pdf as default value. You should indicate more than one input file, separated by spaces.

Principal Component Analysis:

mk_pca -i methylKit_Input_Files -o Image_Ouput_File [-x a_PC_for_x-axis] [-y another_PC_for_y-axis] [--screenplot] [-f format (pdf, ps, svg, png, jpeg, bmp or tiff)] [-a assembly] [-m methylation_context]

-i and -o are mandatory arguments. -f takes pdf as default value. -x and -y takes 1 and 2 as default values, respectively. You should indicate more than one input file, separated by spaces. You can add the optional argument –screenplot to get the screenplot. Otherwise you get the PC indicated in x versus PC and indicated.

For further details on the output, see the manual of methylKit.

Convert to BED and other formats

In addition to methylKit intput format, you can convert MethylExtract output files to other formats such as BedGraph, BED6 or bigWig:

BedGraph, BED6 and BED6+6:

me2bed -i MethylExtract_Output_File -o BED_Output_File -f Output_Format (bedgraph, bed6, bed6+6 or ucsc) -c Methylation_Context (CG, CHG or CHH)

-i, -o, -f and -c are mandatory arguments.

You can find the specifications of BED and BedGraph formats here.

The score column of the BED file and the dataValue column of the BedGraph file contains a numerical value from 0 to 1000. This value is the methylation level, being 0 completely unmethylated and 1000 completely methylated. The six additional columns of the BED6+6 are:

Watson METH: number of reads methylated for this cytosine (referred to the Watson strand).
Watson COVERAGE: reads covering the cytosine in this sequence context (referred to the Watson strand).
Watson QUAL: PHRED score average for the reads covering the cytosine (referred to the Watson strand).
Crick METH: number of reads methylated for this cytosine (referred to the Crick strand).
Crick COVERAGE: reads covering the cytosine in this context (referred to the Crick strand).
Crick QUAL: PHRED score average for the reads covering the cytosine (referred to the Crick strand).

For more details of these values visit the manual of MethylExtract.

bigBed:

First of all, convert your file to BED6 format. Then get the chromosome sizes file from the assembly multi-FASTA file:

faidx multi-FASTA_input_file -i chromsizes > chrom.sizes

Finally, convert the BED6 input format to bigBed:

bedToBigBed -type=bed6 bed6_input_file chrom.sizes bigBed_output_file

You can find the specifications of bigBed format here.

bigWig:

First of all, convert your file to BedGraph format. Then get the chromosome sizes file from the assembly multi-FASTA file:

faidx multi-FASTA_input_file -i chromsizes > chrom.sizes

Finally, convert the BED6 input format to bigWig:

bedGraphToBigWig bedGraph_input_file chrom.sizes bigWig_output_file

You can find the specifications of bigWig format here.

Send to a Galaxy Server

Galaxy gives us the opportunity to do a myriad of analysis. An easy way to send your data to Galaxy from inside virtual machine is using the helper tool upload2galaxy.

To use this tool, first of all you need to get your API key from the Galaxy Server you want to use. To get your API key, open a web browser, go to the URL of the Galaxy Server you want to use and login. In the top menu, go to

User ➜ Preferences ➜ Manage your API keys

Now click on the button Generate a new key now and copy the Current API key. You will use this key to send files to your Galaxy account.

To send a file to your Galaxy account, type:

upload2galaxy [-u URL_of_a_Galaxy_Server] -k API_Key_of_your_Galaxy_Account -i Path_of_the_File_to_Upload [-n Name_of_the_Galaxy_History_to_be_created]

-i and -k are mandatory arguments. By default -u is the URL of the Galaxy Main Server and -n is MethFlow.

To check your uploaded file, in the Galaxy website go to

User ➜ Saved Histories

And click on MethFlow or the name indicated with -n.

Upload to UCSC Genome Browser

One of the best ways to visualize your data is by UCSC Genome Browser. There you can view your data along chromosomes and compare with a myriad of other genomic annotations.

By following these instructions, you can upload your files to UCSC Genome Browser:

Convert your data to UCSC BED6 format by typing:

me2bed -i MethylExtract_Output_File -o BED_Output_File -f ucsc -c Methylation_Context (CG, CHG or CHH)

Note that -f ucsc is required for you to visualize your data correctly.

Open a browser and go to the UCSC Genome Browser website
Go to My Data ➜ My Sessions in the top menu and login or create an account (your data will continue online after logout).
Once logged, go to My Data ➜ Custom Tracks in the top menu.
Select a file from your local disk and submit it.
Once uploaded, select view in Genome Browser and click on go.

Now you can browse your data. If you want to upload more files:

Go to My Data ➜ Custom Tracks in the top menu.
Click on add custom tracks.
Select a file from your local disk and submit it.
Once uploaded, select view in Genome Browser and click on go.

External program manuals and documentation

fastq-dump (SRA Toolkit): we use this program to convert files in SRA format to FASTQ format. Documentation.
Trimmomatic: we use this program to remove the adapter and low quality bases at the 3’ end. Manual.
Bismark: we use this program to align reads against three-letter reference assemblies. Manual.
Bowtie2: it is the aligner that we use in Bismark. Manual.
BSeQC: we use this program to fix the bisulfite bias due to technical factors. Documentation.
MethylExtract: the core of MethFlow. We use this program to profile methylations levels and single nucleotide variants. Manual.
FastQC: the program that we use to check the quality of FASTQ and trimmed FASTQ files. Documentation.
methylKit: one of the programs used in differential methylation analysis and the main program used in downstream analysis. Manual.
MOABS: one of the programs used in differential methylation analysis. Documentation.