The practical challenge of creating a Hungarian e-mail reader has initiated our work on statistical text analysis. The starting point was statistical analysis for automatic discrimination of the language of texts. Later it was extended to automatic re-generation of diacritic signs and more detailed language structure analysis. A parallel study of three different languages-Hungarian, German and English-using text corpora of a similar size gives a possibility for the exploration of both similarities and differences. Corpora of publicly available Internet sources were used. The corpus size was the same (approximately 20 Mbytes, 2.5-3.5 million word forms) for all languages. Besides traditional corpus coverage, word length and occurrence statistics, some new features about prosodic boundaries (sentence initial and final positions, preceding and following a comma) were also computed. Among others, it was found that the coverage of corpora by the most frequent words follows a parallel logarithmic rule for all languages in the 40-85% coverage range, known as Zipf's law in linguistics. The functions are much nearer for English and German than for Hungarian. Further conclusions are also drawn. The language detection and diacritic regeneration applications are discussed in detail with implications on Hungarian speech generation. Diverse further application domains, such as predictive text input, word hyphenation, language modelling in speech recognition, corpus-based speech synthesis, etc. are also foreseen.
If the inline PDF is not rendering correctly, you can download the PDF file here.
 Gibbon, Dafydd - Roger Moore - Richard Winski 1998. Spoken language characterisation. Mouton de Gruyter, The Hague.
 Németh, Géza - Csaba Zainkó - László Fekete - Gábor Olaszy - Gábor Endrédi - Péter Olaszi - Géza Kiss - Péter Kis 2000. The design, implementation, and operation of a Hungarian e-mail reader. In: International Journal of Speech Technology 3: 217-36.
The design, implementation, and operation of a Hungarian e-mail reader, () 217-36.
The design, implementation, and operation of a Hungarian e-mail reader321736)| false
 Roukos, Salim 1996. Language representation. In: Ronald A. Cole - Joseph Mariani - Hans Uszkoreit - Annie Zaenen - Victor Zue (eds) Survey of state of the art in human language technologies. Cambridge Univeristy Press, Cambridge.
 Váradi, Tamás 1999. On developing the Hungarian National Corpus. In: Ŝpela Vintar (ed.) Proceedings of the Workshop Language Technologies-Multilingual Aspects, 32nd Annual Meeting of the Societas Linguistica Europea, Ljubjana, Slovenia, 57-63. Faculty of Arts, University of Ljubljana, Ljubjana.
On developing the Hungarian National Corpus, ().
On developing the Hungarian National Corpus)| false