1. Detecting clusters of DNA words
  2. Detecting clusters of genomic elements
  3. Results screen
  4. Calculate the statistical significance

WordCluster is an algorithm to detect clusters of words, or any other genomic elements, in DNA sequences, based on the distance between neighboring occurrences, and then assigning a statistical significance. The detection of spatial clusters on genomic sequences is based on a distance threshold that can be calculated by different methods, the words or elements which are below this threshold are grouped into clusters. Then a p-value is assigned to each cluster, which allows ranking the clusters in order of statistical significance.

On the left hand of all the screens, the user can find four different links:

  • “Restart”: send the user back to the main page.
  • “Help”: show this help page.
  • “See example CAG/CAT cluster”: show the output example for word clusters of the k-mers CAG/CTG, the importance of these words is the recent discovery that the cytosine in these contexts can be methylated.
  • “See example OR genes”: show the output example for clusters of genomic elements. We use here the olfactory receptor genes, a multigene family usually clustered on in the genome.

On the main page, the user can choose between looking for clusters of DNA words (k-mers) or clusters of genomic elements.

  1. Detecting clusters of DNA words
  2. The option window is divided into two squares:

    • “Detect DNA words”, where the user will choose either one of the genome assemblies from our database or upload its own sequence (in Fasta format), and the word (k-mer) or combination of words to be searched for clusters.

      The following genome assemblies are now available in our database:

      • Human (hg18)
      • Mouse (mm8)
      • Rat (rn4)
      • Fruit fly (dm3)
      • Anopheles gambiae (anogam1)
      • Honey bee (apimel2)
      • Cow (bosTau4)
      • Dog (canFam2)
      • C. briggsae (cb3)
      • C. elegans (ce6)
      • Sea squirt (ci2)
      • Zebrafish (danrer5)
      • Chicken (galgal3)
      • Stickleback (gasacu1)
      • Medaka (orylat2)
      • Chimp (pantro2)
      • Rhesus macaque (rhemac2)
      • S. cerevisiae (saccer1)
      • Tetraodon (tetnig1)
    • The “General parameters” window, where the user can choose the strand, the distance parameters and the p-value threshold.

      On the first selection box, the user can choose either searching for clusters on single (forward strand) or on both strands.

      On the next square, four options allow to choose the distance to be used in grouping the words on clusters:

      • “Chromosome intersection”: for each chromosome the intersection between de observed and the theoretical distance distribution is calculated and used to form the clusters. Therefore, the maximum distance between words will vary in different chromosomes.
      • “Genome intersection”: the intersection is calculated with the observed distance distribution for the whole genome. This option can be very slow.
      • “Percentile”: it allows the user to fix the percentile of the distribution to be used as maximum distance.
      • “Fixed distance”: the user also can choose a fixed distance (in bp) to be used as maximum distance.

      The last option is the p-value field, which is used as a threshold for statistical significance. This option will modify the algorithm output, showing just clusters with value below a given cut-off.

  3. Detecting clusters of genomic elements
  4. In seeking for clusters of genomic elements, the option screen is divided into two squares:

    • “Analyze a list of genomic entities”, where the user will upload a list of the genomics elements to analyze and select a genome assembly from our 19 genome database. The list of genomics elements must be in BED format, as specified on the right hand of the selection cascade.

    • “General parameters”, which work as they do on word-cluster seeking.

  5. Results screen
  6. Once the filled data form is submitted, the webserver respond with a waiting screen to monitor the process and providing a lasting link to the results. The user can bookmark this link and come back to it later.

    A parameter summary of the analysis is provided on the top left of the final results page, showing the identifier of the process, the analysis mode, the pattern or file used to search for clusters, the genome assembly and, in the case of word clusters, the strand used.

    By default, at the bottom of the page, the basic statistic dialogue is shown with some statistic for interesting parameters of the clusters.

    The user can also choose, from the square titled “Results” on the top right of the screen, other analysis done by the program, or even download a text file with all the clusters found. The available analyses are:

    • "Basic Statistic": is a table containing basic statistical results as the mean, standard deviation, minimum and maximum values for the length, number of words, p-value, GC-content, observed/expected ratio and word density of the clusters found.

    • "Basic Statistic as a function of the chromosome": the same as in the above link, but sorted by chromosomes.

    • "See all clusters": this link allow the user to see all the clusters found, and rank them by any of the fields in the table.

    • "Show co-localization with different gene regions": this table show the basic statistic of the clusters overlapping with some gene region from refGene database, like R1 (TSS of the gene), R3 (a region +/- 200 bp from the TSS), R13 (a region -500/+1500 bp from the TSS), F5UTR, F3UTR, cdsExon and cdsIntron.

    • "Show functional depletion/enrichment": this is an analysis of enrichment for gene ontology terms (GO), related with the different regions mentioned above. On the region header, the user can select the different regions to be shown on the table.

    • "Show basic compositional and parameter statistic": it shows a basic compositional statistics for the clusters.

  7. Calculate the statistical significance
  8. Significance is calculated as the cumulative density function of the negative binomial distribution in a similar way as done in the CpGcluster algorithm. The negative binomial distribution is defined in the following way:

    being n the number of elements or k-mers within the cluster, nf the number of "failures", i.e. other entities within the cluster different from the considered k-mer or element. For example, if we want detect clusters of AGCT, all k-mers different from AGCT would be considered as failures. Finally, p is the success probability, i.e. the probability to find a k-mer or genomic element.

    While the negative binomial distribution can be defined in the same way for k-mers and genomic elements, differences exist in the way to calculate the number of "failures" and the success probability.

    For k-mers, the number of failures is simply given by the length of the cluster (Lc) minus the number of k-mers in the cluster times the k-mer length (k).

    k-mers can overlap with itself and other k-mers. Here, we consider just non-overlapping occurrences. In such a case, the probabilities for k-mers are given by the following equation

    being N the number of non-overlapping occurrences of the k-mers in the sequence, k the length of the k-mer and Ls the sequence length.

    For genomic elements, it is less clear how to define the number of failures. For example, one has a cluster with 5 elements which has mean length of 300 bp and 250 bp of distance on average between each other. The question is how many "no-elements" does contain this cluster, i.e. how many failures. We define the number of failures:

    being Lno the number of bases in the cluster which do not belong to the genomic element and Lmean is the mean length of the genomic element. That means that this number is an approximation to the number of "no-elements" within the cluster,

    being Ls the length of the sequence, Lmean the mean length of the genomic elements and n the number of genomic elements.