Manual

[How to cite NGSmethDB]

Introduction

NGSmethDB is a dedicated database to store whole-genome methylation maps or methylomes (1–3). Methylomes are obtained by single-cytosine methylation profiling based on high-throughput sequencing (NGS) of sodium-bisulfite treated DNA (4, 5). Furthermore, NGSmethDB includes two additional datasets: 1) methylation segments (i.e. genome regions of homogeneous methylation); and 2) differentially methylated single-cytosines.

Methylation profiling

Publicly available short-read data sets from NGS bisulfite sequencing projects for different cell lines, primary and pathological tissues are downloaded mainly from NCBI GEO (6) and the ROADMAP project (7).

The same pipeline, MethFlow (8), was used to produce high-quality methylomes under uniform conditions, which enables comparative downstream analyses. MethFlow integrates the stringent quality controls of MethylExtract (9) with several other third-part scripts. A first step was to run Trimmomatic (10) for adapter trimming and removing of low quality 3’ ends. The alignment to a three letter genome was made by means of Bismark (11) which uses Bowtie2 (12) as aligner. The next step was running BSeQC (13) for the elimination of known technical artefacts. And finally, MethylExtract was run for methylation calling and genotyping. MethylExtract minimizes several important error sources like sequencing errors, bisulfite failure, clonal reads, and single nucleotide variants. The result of the entire process is a high-quality, whole-genome methylation map.

Database back-end

Single-cytosine methylation and differential methylation data are stored hierarchically in MongoDB (https://www.mongodb.com/), a NoSQL database with JSON-formatted documents and dynamic schemas. This makes the joining and the comparison of data of different samples easier and faster. Each assembly is stored in a database and inside every database there is collection for each chromosome. Within the collection, each JSON-like document represents a cytosine and contains hierarchically all genotype, differential methylation and methylation data of all individuals and samples. The first level is the data type (genotype, methylation or differential methylation), the second is the individual, the third is the sample and the fourth are the data themselves.

Data access and visualization

The data stored in NGSmethDB can be accessed as follows.

Downloading database dumps for entire methylomes

Complete methylomes zipped with bzip2 (http://www.bzip.org/) can be downloaded from the Database dumps page. Once unzipped, you get a tab-delimited file which can be directly open in any spreadshet for downstream analyses.

Using the web form

On the database web form (Database access -> Web access) you can easily select for a chromosome or chromosome region to get a table in BED format (https://genome.ucsc.edu/FAQ/FAQformat.html#format1) with the corresponding cytosine methylation levels. The table can be downloaded and directly open in any spreadsheet for downstream analyses.

Using track hubs

A third way to access NGSmethDB data is through standard track hubs (14), which provide an efficient mechanism for visualizing remotely hosted Internet-accessible collections of genome annotations. Hub datasets can then be fully integrated into the University of California Santa Cruz (UCSC) Genome Browser (15). In this way, NGSmethDB data can be visualized and compared to a plethora of third-part annotations. In addition, UCSC tools, as the Table Browser or Data Integrator, provide a way to 1) retrieve detailed NGSmethDB datasets from any genome, chromosome, genome region, gene, SNP or whatever other genome element; 2) combine methylation data and any other third-part annotation into a single set of data based on a specific join criteria –for example, this can be used to find the methylation state of cytosines that intersect with CpG islands; and 3) directly upload NGSmethDB datasets to public bioinformatics platforms as Galaxy (16), GenomeSpace (17) or GREAT (18) for further downstream analyses.

Programmatic access to the database: the NGSmethDB API server

A Node.js (https://nodejs.org/en/) application (NGSmethDB API server) has been implemented on our server, which provides access to the MongoDB database via a RESTful API (19). This allows for three interactive or programmatic ways to access NGSmethDB data:

HTTP access

A first way to access de NGSmethDB API Server is by issuing simple HTTP queries on the navigation bar of a web browser. The server responds by sending you a JSON file. Examples of the available commands follow.

This access method is recommended only to retrieve data on a single position, or regions of moderate size as exons, genes, etc. Querying larger regions can bother your browser; if needed use instead the track hubs or the programmatic access.
Get API server content information
  • Get API server and API client current versions

http://bioinfo2.ugr.es:3333/NGSmethAPI/version

  • Get API server and API client change log

http://bioinfo2.ugr.es:3333/NGSmethAPI/changelog

  • Get list of available assemblies

http://bioinfo2.ugr.es:3333/NGSmethAPI/info

  • Get list of available individuals and samples

http://bioinfo2.ugr.es:3333/NGSmethAPI/<assembly>/samples

Where <assembly> is the selected assembly (e.g. hg38).

  • Get list of available chromosomes and chromosome length

http://bioinfo2.ugr.es:3333/NGSmethAPI/<assembly>/chroms

Where <assembly> is the selected assembly (e.g. hg38).

Single-cytosine methylation and differentially methylated cytosines
  • Get data from a region

http://bioinfo2.ugr.es:3333/NGSmethAPI/<assembly>/<chrom:start-end>

Where <assembly> is the selected assembly (e.g. hg38) and <chrom:start-end> is the selected genomic region (e.g. chr1:30000000-30000100).

  • Filter by sample

http://bioinfo2.ugr.es:3333/NGSmethAPI/<assembly>/<chrom:start-end>?samples=<sample list separated by commas>

Where <assembly> is the selected assembly (e.g. hg38), <chrom:start-end> is the selected genomic region (e.g. chr1:30000000-30000100) and <sample list separated by commas> is the list of selected samples (e.g samples=STL001.adipose,STL002.adipose,STL003.adipose).

Meaning of query symbols in constructing HTTP queries
Symbol Meaning Example
? Indicates the beginning of optional arguments …/chr1 :100-5000?samples=STL001.gastric
= Separate the optional argument (left) of its value or values (right) …/chr1:100-5000?samples=STL001.gastric
, Separate the values of an optional argument …/chr1:100-5000?samples=STL001.gastric,STL002.gastric
& Separate the optional arguments …/chr1:100-5000?samples=STL001.gastric&format=csv
Example

In the navigation bar of your web browser, type

http://bioinfo2.ugr.es:3333/NGSmethAPI/hg38/chr1:30000000-30000500?samples=STL001.gastric,STL002.gastric

to get methylation data in gastric tissue of individuals STL001 and STL002 on the region 30000000-30000500 of the human chromosome 1. First lines of the results follow:

[{“meth_cg”:{“w”:{“coverage”:{“STL002”:{“gastric”:6},”STL001″:{“gastric”:9}},”phredScore”:{“STL002”:{“gastric”:38},”STL001″:{“gastric”:38}},”methylatedReads”:{“STL002”:{“gastric”:4},”STL001″:{“gastric”:8}}},”c”:{“coverage”:{“STL002”:{“gastric”:8},”STL001″:{“gastric”:6}},”phredScore”:{“STL002”:{“gastric”:36},”STL001″:{“gastric”:36}},”methylatedReads”:{“STL002”:{“gastric”:8},”STL001″:{“gastric”:4}}}},”pos”:30000112,”genotype”:{“STL002”:{“gastric”:”CG”},”STL001″:{“gastric”:”CR”}},”chrom”:”chr1″},{“diffmeth_cg”:{“STL001#STL002”:{“gastric#gastric”:{“MOABS_sim”:”0.0164″}}},”meth_cg”:{“w”:{“coverage”:{“STL002”:{“gastric”:12},”STL001″:{“gastric”:14}},”phredScore”:{“STL002”:{“gastric”:37},”STL001″:{“gastric”:38}},”methylatedReads”:{“STL002”:{“gastric”:4},”STL001″:{“gastric”:7}}},”c”:{“coverage”:{“STL002”:{“gastric”:5},”STL001″:{“gastric”:12}},”phredScore”:{“STL002”:{“gastric”:36},”STL001″:{“gastric”:39}},”methylatedReads”:{“STL002”:

Methylation segments
  • Get data from an entire chromosome

http://bioinfo2.ugr.es:3333/NGSmethAPI/segments/<percentile>/NGSmethAPI/<assembly>/<chrom>/complete

Where <percentile> is 90, 95 or 99, <assembly> is the selected assembly (e.g. hg38) and <chrom> is the selected chromosome (e.g. chr1).

  • Get data from a region

http://bioinfo2.ugr.es:3333/NGSmethAPI/segments/<percentile>/NGSmethAPI/<assembly>/<chrom:start-end>

Where <percentile> is 90, 95 or 99, <assembly> is the selected assembly (e.g. hg38) and <chrom:start-end> is the selected genomic region (e.g. chr1:30000000-30000100).

  • Get segments present in more than one sample

Segments present in samples

http://bioinfo2.ugr.es:3333/NGSmethAPI/segments/<percentile>/<assembly>/<chrom:start-end>?sampleCount=<n>

Segments present in at less samples

http://bioinfo2.ugr.es:3333/NGSmethAPI/segments/<percentile>/<assembly>/<chrom:start-end>?sampleCount_min=<n>

Segments present in no more than samples

http://bioinfo2.ugr.es:3333/NGSmethAPI/segments/<percentile>/<assembly>/<chrom:start-end>?sampleCount_max=<n>

Segments present in from m to samples

http://bioinfo2.ugr.es:3333/NGSmethAPI/segments/<percentile>/<assembly>/<chrom:start-end>?sampleCount_min=<m>&sampleCount_max=<n>

Where <percentile> is 90, 95 or 99, <assembly> is the selected assembly (e.g. hg38), <chrom:start-end> is the selected genomic region (e.g. chr1:30000000-30000100), <sample list separated by commas> is the list of selected samples (e.g samples=STL001.adipose,STL002.adipose,STL003.adipose), <m> and <n> are integers from 1 to 13.

  • Filter by sample

http://bioinfo2.ugr.es:3333/NGSmethAPI/segments/<percentile>/<assembly>/<chrom:start-end>?samples=<sample list separated by commas>

Where <percentile> is 90, 95 or 99, <assembly> is the selected assembly (e.g. hg38), <chrom:start-end> is the selected genomic region (e.g. chr1:30000000-30000100) and <sample list separated by commas> is the list of selected samples (e.g samples=STL001.adipose,STL002.adipose,STL003.adipose).

Partially overlapping results will also be reported.

Get results in CSV format

The results can be displayed in CSV format instead of JSON if you add format=csv to your query. It works for any type of query to the NGSmethDB API Server.

For one sample:

http://bioinfo2.ugr.es:3333/NGSmethAPI/segments/95/hg38/chr1/complete?samples=STL003.adipose&format=csv

For all samples:

http://bioinfo2.ugr.es:3333/NGSmethAPI/segments/95/hg38/chr1/complete?format=csv

Programmatic access by the NGSmethDB API Standalone Client

A script written in Python (NGSmethDB API Client) allows also for the programmatic access to NGSmethDB. This script runs on Linux, Mac OS X and other UNIX systems. Windows users should use the API client Virtual Machine (see below).

  • Download the Standalone Client here.
  • Mac OS X users only: use homebrew to install dependencies.
  • Install dependencies:
    • Python 3.4 or higher.
    • pip for Python 3.
    • requests (install with pip).
    • pythondialog (install with pip).
    • PyZenity for Python 3.
  • To use the API Client, open a terminal and type:

python3 <path/to/NGSmethDB_API_client.py> -i <BED file> -o <output>

The API Client will ask for the assembly and the samples to query. Save the configuration file to use this selection in the future.

  • To use a configuration file type:

python3 <path/to/NGSmethDB_API_client.py> -i <BED file> -o <output> -c <config>

See “Output directory contents” below.

Programmatic access by the NGSmethDB API Virtual Machine

The NGSmethDB API Client has been encapsulated in a downloadable VirtualBox machine, on which all the dependencies have been preconfigured. The NGSmethDB API Client VM is platform independent, and it can be run on Linux, Windows or Mac desktops.

  • Download and install a hypervisor: VirtualBox (recommended), VMware or XenServer.
  • Download the VM here and import it into the hypervisor.
  • It is highly recommended to configure a shared folder to store input and output files.
  • Start the VM. Password is not required.
  • When the VM is started, a web browser appears displaying this help. A terminal is also open.
  • Type NGSmethDB_API_clientinto the terminal. A help message about API Client will appear.
  • To use the client, type:

NGSmethDB_API_client -i <BED file> -o <output>

Replace <BED_file> with the path of a BED file with the regions you want to query. <output> is the path of the output directory. The API Client will ask for the assembly and the samples to query. Save the configuration file to use this selection in the future.

  • To use a configuration file:

NGSmethDB_API_client -i <BED file> -o <output> -c <config>

Where <config> is the path of the configuration file. Nothing will be asked during the process.

There is a test BED file in /home/meth/Test/hg38_chr22_exons.bed. Use the hg38 assembly to test it.

See “Output directory contents” below.

Keep your VM up to date

You can change the keyboard layout of your VM with the command:

sudo dpkg-reconfigure keyboard-configuration

To change your time zone, type the following command in a terminal:

sudo tzselect

To upgrade the NGSmethDB_API_client, type the following command in a terminal:

python3 /opt/NGSmethDB_API_client/upgrade_NGSmethDB_API_client.py

To upgrade third-party software, type the following command in a terminal:

sudo apt-get update && sudo apt-get dist-upgrade -y && sudo apt-get autoremove -y

Output directory content
  • NGSmethDB_API_client.log file. It contains detailed information about the process. It is recommended to consult it if something goes wrong.
  • meth directory. Methylation data organized on two levels:
    1. Region directory (one for each queried region with data).
    2. Inside the former, sample file (one for each selected sample with data). It contains methylation data of the sample.
  • stat directory. Methylation ratio statistics organized on two levels:
    1. Region directory (one for each queried region with data).
    2. Inside the former:
      • summary_stat.tsv file.
      • histogram.tsv file.
  • diffmeth directory (if there are data). Differential methylation data organized on two levels:
    1. Region directory (one for each queried region with data).
    2. Inside the former, differential methylation files:
      • interindividual.tsv file (if there are data).
      • intraindividual.tsv file (if there are data).
  • segments directory (if there are data). Methylation segments files (one for each queried region with data).

References

  1. Hackenberg,M., Barturen,G. and Oliver,J.L. (2011) NGSmethDB: a database for next-generation sequencing single-cytosine-resolution DNA methylation data. Nucleic Acids Res, 39, D75–9.
  2. Geisen,S., Barturen,G., Alganza,Á.M., Hackenberg,M. and Oliver,J.L. (2014) NGSmethDB: an updated genome resource for high quality, single-cytosine resolution methylomes. Nucleic Acids Res, 42, D53–9.
  3. Song,Q., Decato,B., Hong,E.E., Zhou,M., Fang,F., Qu,J., Garvin,T., Kessler,M., Zhou,J. and Smith,A.D. (2013) A reference methylome database and analysis pipeline to facilitate integrative and comparative epigenomics. PloS one, 8, e81148.
  4. Pomraning,K.R., Smith,K.M. and Freitag,M. (2009) Genome-wide high throughput analysis of DNA methylation in eukaryotes. Methods (San Diego, Calif.), 47, 142–50.
  5. Laird,P.W. (2010) Principles and challenges of genomewide DNA methylation analysis. Nature reviews. Genetics, 11, 191–203.
  6. Barrett,T., Wilhite,S.E., Ledoux,P., Evangelista,C., Kim,I.F., Tomashevsky,M., Marshall,K.A., Phillippy,K.H., Sherman,P.M., Holko,M., et al. (2013) NCBI GEO: archive for functional genomics data sets–update. Nucleic acids research, 41, D991–5.
  7. Consortium,R.E., Kundaje,A., Meuleman,W., Ernst,J., Bilenky,M., Yen,A., Heravi-Moussavi,A., Kheradpour,P., Zhang,Z., Wang,J., et al. (2015) Integrative analysis of 111 reference human epigenomes. Nature, 518, 317–330.
  8. Lebrón,R., Barturen,G., Gómez-Martín,C., Oliver,J.L. and Hackenberg,M. (2016) MethFlowVM: a virtual machine for the integral analysis of bisulfite sequencing data. bioRxiv: http://biorxiv.org/content/early/2016/07/31/066795.
  9. Barturen,G., Rueda,A., Oliver,J.L. and Hackenberg,M. (2013) MethylExtract: High-Quality methylation maps and SNV calling from whole genome bisulfite sequencing data. F1000Research, 2, 217.
  10. Bolger,A.M., Lohse,M. and Usadel,B. (2014) Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics (Oxford, England), 30, 2114–20.
  11. Krueger,F. and Andrews,S.R. (2011) Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications. Bioinformatics (Oxford, England), 27, 1571–2.
  12. Langmead,B. and Salzberg,S.L. (2012) Fast gapped-read alignment with Bowtie 2. Nat Methods, 9, 357–9.
  13. Lin,X., Sun,D., Rodriguez,B., Zhao,Q., Sun,H., Zhang,Y. and Li,W. (2013) BSeQC: quality control of bisulfite sequencing experiments. Bioinformatics, 29, 3227–3229.
  14. Raney,B.J., Dreszer,T.R., Barber,G.P., Clawson,H., Fujita,P.A., Wang,T., Nguyen,N., Paten,B., Zweig,A.S., Karolchik,D., et al. (2014) Track data hubs enable visualization of user-defined genome-wide annotations on the UCSC Genome Browser. Bioinformatics (Oxford, England), 30, 1003–5.
  15. Kent,W.J., Sugnet,C.W., Furey,T.S., Roskin,K.M., Pringle,T.H., Zahler,A.M. and Haussler, a. D. (2002) The Human Genome Browser at UCSC. Genome Research, 12, 996–1006.
  16. Afgan,E., Baker,D., van den Beek,M., Blankenberg,D., Bouvier,D., Čech,M., Chilton,J., Clements,D., Coraor,N., Eberhard,C., et al. (2016) The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update. Nucleic acids research, 10.1093/nar/gkw343.
  17. Qu,K., Garamszegi,S., Wu,F., Thorvaldsdottir,H., Liefeld,T., Ocana,M., Borges-Rivera,D., Pochet,N., Robinson,J.T., Demchak,B., et al. (2016) Integrative genomic analysis by interoperation of bioinformatics tools in GenomeSpace. Nature Methods, 13, 245–247.
  18. McLean,C.Y., Bristor,D., Hiller,M., Clarke,S.L., Schaar,B.T., Lowe,C.B., Wenger,A.M. and Bejerano,G. (2010) GREAT improves functional interpretation of cis-regulatory regions. Nature biotechnology, 28, 495–501.
  19. Fielding,R.T. (2000) Architectural styles and the design of network-based software architectures. Doctoral Dissertation, University of California, Irvine.