gCluster Manual
Preparation of the protocol
You can follow the protocol in two ways:
- Install the executables files.
- Use gClusterVM, a virtual machine with all that you need.
Both options work on most of operating system (Windows, Mac OS X, Linux and Solaris).
Using gClusterVM
- Download and install VirtualBox (Windows, Mac OS X, Linux and Solaris are supported). Install also the VirtualBox Extension Pack.
- Download gClusterVM OVA file and import it on VirtualBox by double-clicking.
- Configures the machine virtual: allocated memory (minimum: 4 GB), CPUs assigned (minimum: 2) and folder shared for exchange files with the machine virtual (highly recommended).
- Run the virtual machine
Using standalone executables
To use standalone executables, first you must follow the following steps:
- Download and install Java 8 or higher.
- Download and install Python 3.4 or higher. Important: in Windows check the option ‘Add to the PATH’ during the installation.
- Download and extract the standalone executables.
- Open a terminal to execute the commands of the protocol or in this manual. If you are using windows, you can use CMD or PowerShell.
In the following sections we will describe in detail the features of all programs with usage examples.
IMPORTANT: All commands in this manual are for standalones. If you use gClusterVM you must omit the interpreter name (i.e. “python”, “java”) and the extension of the script (i.e. “.py”, “.jar”).
E.g. of command using standalones:
python prepareAssembly.py -u hg38 -o /home/gcluster/sequences -l hg38
E.g. of command using gClusterVM:
prepareAssembly -u hg38 -o /home/gcluster/sequences -l hg38
Prepare the sequences: prepareAssembly.py, makeSeqObj.jar and randomizer.jar
prepareAssembly.py
It can obtains an assembly (from UCSC) and extract canonical sequences from it. It also works with local assembly files.
Input parameters
- -i <path or URL>: Path or URL to multi-FASTA file (.fa, .fa.gz,.fa.tar.gz, .bz2)
- -u <UCSC id>: UCSC assembly ID (e.g. hg38)
- -l <label>: Label for output files. (Default: ‘assembly’)
- -o <outdir>: Path to output directory, default current directory
- -r <regex>: REGEX to filter canonical or non-canonical chromosomes. REGEX to filter canonical chromosomes must be precede$
Output files
In the output folder two files will be generated:
- ‘label’_canonical.fa: multi-FASTA file that contains all canonical chromosomes sequence.
- ‘label’_noncanonical.fa: multi-FASTA file that contains all non-canonical chromosomes sequence.
Extract canonical sequences from assembly (from local file)
python prepareAssembly.py -i /home/gcluster/sequences/hg38.fa -o /home/gcluster/sequences -l hg38
Obtain assembly and extract canonical sequences (from UCSC)
python prepareAssembly.py -u hg38 -o /home/gcluster/sequences -l hg38
makeSeqObj.jar
It index the fasta input files in order to make them faster accessible. With an ‘assembly.fa’ input file it generates an ‘assembly.zip’ file which, in turn, is the input file for the programs described below.
Input parameters
- ‘assembly’_canonical.fa: Multi-FASTA file of the genome assembly.
Output files
- ‘assembly’.zip file, that should be used for all the subsequent analyses.
- ‘assembly’_canonical.N : file with the coordinates of the N-runs in the assembly.
- ‘assembly’_canonical.chromSize : chromosome size file.
All of them will be created in the same folder as input
java -jar makeSeqObj.jar /home/gcluster/sequences/hg38_canonical.fa
randomizer.jar
Randomizes DNA sequences preserving the dinucleotide frequencies.
Input parameters
- ‘assembly’.zip file generated by makeSeqObj.jar
Output files
- ‘assembly’_random.zip file. It would be created in the same folder as input.
java -jar randomizer.jar /home/gcluster/sequences/hg38_canonical.zip
Determine local clusters of DNA words and its global clustering properties: gCluster
gCluster could determine the local clusters of a given DNA word (CpG islands if the DNA word is ‘CG’) and its global clustering properties. It can work on both strands (for non-palindromic words) and accepts any combination of DNA words (like CAG:CTG:CCG for the methylation context CHG).
The following parameters are all possible options of gCluster. In the above sections it would be detailed which parameters must be set for each analysis.
Input parameters:
- genome=<path>: the indexed multi-FASTA file (see ‘Prepare the sequences’ section above).
- output=<output folder>: the output folder where the results will be written.
- pattern=<the k-mers>: the k-mers (DNA words) that should be analysed, eg. Pattern=CG to obtain CpG islands. For detection of CWG cluster, pattern=CAG:CTG should be used.
- writedistribution=<boolean>: write out the observed and expected distance distribution (default writedistribution=false)
- chromStat=<boolean>: if true, the program writes out additional information of the chromosome sequences like G+C content, CpG frequencies (observed/expected ratios), lengths, etc. (default: chromStat=false)
Output files:
- txt: The clusters identified by its chromosomal coordinates and p-value adding basic compositional statistics like G+C content and O/E ratios of the pattern (the number of observed patterns divided by the number of expected pattern). In case of a compound pattern like CAG:CTG, the mean O/E ratio is calculated.
- txt: The file holds the normalized coefficient of variation for each chromosome.
- *.distr files: The distance distribution, the observed and expected frequencies as a function of next-neighbour distance.
- txt: A log file
- txt: A basic statistic as a function of chromosome, i.e pattern frequency, G+C content, lengths and contig length (the sequence length minus the sum of all N’s.
Calculate clustering properties
mkdir /home/gcluster/results (generates the ‘results’ output directory)
java -jar gCluster.jar genome=/home/gcluster/sequences/hg38_canonical.zip pattern=CG output=/home/gcluster/results/hg38_CG writedistribution=true chromStat=true
IMPORTANT: writedistribution and chromStat parameters must be set to ‘True’
Detect the clusters of DNA words
CpG Islands
With the following command (same as above), the clusters are automatically determined and written to the cluster.txt file. The distance threshold is determined by the genome intersection.
java -jar gCluster.jar genome=/home/gcluster/sequences/hg38_canonical.zip pattern=CG output=/home/gcluster/results/hg38_CG writedistribution=true chromStat=true
Clusters of other biologically relevant k-mers
Like mentioned before, in plants cytosine can be methylated as well in other contexts like CWG (CAG, CTG), CCG or CHH.
The clustering of these contexts can be calculated with the following command.
java -jar gCluster.jar genome=/home/gcluster/sequences/ hg38_canonical.zip pattern=CAG:CTG output=/home/gcluster/results/hg38_CWG writedistribution=true chromStat=true
Clusters of clusters: GenomeCluster.pl
GenomeCluster determines the local clusters of genome elements identified by its chromosome coordinates. It could be used to cluster the clusters predicted by gCluster, in order to make clusters of cluster.
Input parameters (the arguments must be given in this order):
- Argument 1: the distance model <element; start; middle; end>.
- Argument 2: the BED file within the chromosome coordinates of whatever genome element (e.g. CpG islands).
- Argument 3: the distance threshold model <gi: genome intersection; ci: chromosome intersection; N (integer): percentile>.
- Argument 4: the p-value threshold (by default 1E-5 as in gCluster).
- Argument 5: the file with the N-runs created by makeSeqObj.jar (see ‘Prepare the sequences’ section above).
- Argument 6: the maximum number of allowed Ns between two elements. It must be an integer greater or equal to 0 (default: 0).
Output files:
- *_genomeIntersec_start_GenomeCluster.txt: there are as many of these files as chromosomes in the assembly. They are tabular files with the columns described below.
Columns in outfile | Description |
Chrom | Chromosome where the local cluster belongs. |
From | Start chromosome coordinate of the local cluster. |
To | End chromosome coordinate of the local cluster. |
Length | Length of the local cluster. |
Count | Number of genome elements within the local cluster. |
PValue | P-value of the local cluster. |
logPValue | Decimal logarithm of the p-value |
perl GenomeCluster.pl start /home/gcluster/results/hg38_CG/cluster.txt gi 1E-5 /home/gcluster/sequences/hg38_canonical.N 0
Determine the methylation of CpG Islands: NGSmethDB_API_client.py
In order to calculate differentially methylation CGIs, we will have to obtain first the methylation values for the CpG islands from our database and second, compare the methylation values between two samples. To obtain the methylomes, we will use NGSmethDB_API_client.py executable.
Input parameters:
- -i <input bed file>: in general, BED3 files will be accepted (only the first 3 columns will be considered), including the ‘cluster.txt’ output files generated by gCluster
- -o <output folder>: output folder
Output files:
- <sample>.CG.meth.tsv: Contains the CG methylation data for all regions defined in the bed data input file. There is one <sample>.CG.meth.tsv file for each selected sample. Each line corresponds to one region.
Columns in outfile | Description |
ID | chrom_start_end |
refPatt | Number of CpGs within the region in the reference genome |
indPatt | Number of CpGs in the region for the genotype of the sample |
meanMR | Mean of CpG methylation ratio distribution |
sdMR | Standard Deviation of CpG methylation ratio distribution |
p10MR | Percentile 10 of CpG methylation ratio distribution |
q1MR | Quartil 1 of CpG methylation ratio distribution |
q2MR | Median of CpG methylation ratio distribution |
q3MR | Quartil 3 of CpG methylation ratio distribution |
p90MR | Percentile 90 of CpG methylation ratio distribution |
totalMC | Total number of methylcytosines in the region |
totalC | Total number of mapped reads within the region |
- <sample>.CHG.meth.tsv (plants only): Same as above, but for CHG.
python3 NGSmethDB_API_client.py -i /home/gcluster/results/hg38_CG/cluster.txt -o /home/gcluster/results/methylationData
Calculate differential methylation: calcDMIs.py
Takes the output files from NGSmethDB_API_client.py for two samples and calculates statistically significant differentially methylated CpG islands between them. The statistical significance is assessed by means of a Fisher exact test.
Input parameters:
- -a <sample1 file> and –b <sample2 file>: output methylation files from NGSmethDB_API_client.py
- -o <outfile>: output file. Default is ‘outputDMIs.txt’
- -p <pvalue>: p-value threshold for DMIs. Default is 1E-5.
Output files:
- <outfile>: a file containing the differentially methylated CpG islands.
Columns in outfile | Description |
ID | chrom_start_end |
mC_Sample1 | Number of methylcytosines for each cluster in sample 1 |
reads_Sample1 | Number of reads for each cluster in sample 1 |
mRatio_Sample1 | methRatio (mC_Sample1/reads_Sample1) for each cluster in sample 1 |
mC_Sample2 | Number of methylcytosines for each cluster in sample 2 |
reads_Sample2 | Number of reads for each cluster in sample 2 |
mRatio_Sample2 | methRatio (mC_Sample2/reads_Sample2) for each cluster in sample 2 |
diffMeth | Absolute difference between methRatio_Sample1 and methRatio_Sample2 |
pvalue | Statistical significance assessed by means of the Fisher exact test |
python calc_DMIs.py –a home/gcluster/results/methylationData/STL003.gastric.CG.meth.tsv -b /home/gcluster/results/methylationData/STL003.pancreas.CG.meth.tsv -o /home/gcluster/results/methylationData/STL003.gastric.STL003.pancreas.DMIs.txt -p 1E-5F