TextKeywords database

Using a generalization of the level statistics analysis of quantum disordered systems, we have developed an approach able to extract automatically keywords in literary texts [1]. Our approach takes into account not only the frequencies of the words present in the text but also their spatial distribution along the text, and is based on the fact that relevant words are significantly clustered i.e., they self-attract each other, while irrelevant words are distributed randomly in the text. Since a reference corpus is not needed, our approach is especially suitable for single documents for which no a priori information is available. In addition, we show that our method works also in generic symbolic sequences continuous texts without spaces, thus suggesting its general applicability.

We used systematically the measure C to analyze a large collection of texts 15 novels, poetry, scientific books. C can be used in two ways: i) to rank the words according to their C values and ii) to rank the words according to their SIGMAnor values but only for words with a C value larger than a threshold value C0, which fixes the statistical significance considered. Both approaches work extremely well for many texts in different languages.

The Origin of Species by Means of Natural Selection is a good example to understand the effect of C: using SIGMAnor, for the very relevant word “species” (n = 1922) we have SIGMAnor = 1.905. In the SIGMAnor-ranking “species” appears in the 505th place! Nevertheless, when using the C measure we find for this word C = 39.97, and in the C ranking it is in the 5th place after “sterility,” “hybrids,” “varieties,” and “instincts”.

[1] Carpena P, Bernaola-Galván P, Hackenberg M, Coronado AV, Oliver JL. 2009.
Level statistics of words: finding keywords in literary texts and symbolic sequences
Physical Review E 79: 035102(R) (1-4) [PDF]
Ver la noticia en New Scientist: Could quantum mathematics shake up Google?