View More View Less
  • 1 Department of Telecommunications and Telematics, Budapest University of Technology and Economics 1117 Bp. Magyar tudósok körútja 2.
  • | 2 Department of Telecommunications and Telematics, Budapest University of Technology and Economics 1117 Bp. Magyar tudósok körútja 2
Full access

The practical challenge of creating a Hungarian e-mail reader has initiated our work on statistical text analysis. The starting point was statistical analysis for automatic discrimination of the language of texts. Later it was extended to automatic re-generation of diacritic signs and more detailed language structure analysis. A parallel study of three different languages-Hungarian, German and English-using text corpora of a similar size gives a possibility for the exploration of both similarities and differences. Corpora of publicly available Internet sources were used. The corpus size was the same (approximately 20 Mbytes, 2.5-3.5 million word forms) for all languages. Besides traditional corpus coverage, word length and occurrence statistics, some new features about prosodic boundaries (sentence initial and final positions, preceding and following a comma) were also computed. Among others, it was found that the coverage of corpora by the most frequent words follows a parallel logarithmic rule for all languages in the 40-85% coverage range, known as Zipf's law in linguistics. The functions are much nearer for English and German than for Hungarian. Further conclusions are also drawn. The language detection and diacritic regeneration applications are discussed in detail with implications on Hungarian speech generation. Diverse further application domains, such as predictive text input, word hyphenation, language modelling in speech recognition, corpus-based speech synthesis, etc. are also foreseen.

  • [1] Gibbon, Dafydd - Roger Moore - Richard Winski 1998. Spoken language characterisation. Mouton de Gruyter, The Hague.

    Spoken language characterisation , ().

  • [gutp] Gutenberg Project.

  • (HTTP://WWW.GUTENBERG.AOL.DE)

  • [hel] Hungarian Electronic Library.

  • (HTTP://WWW.MEK.IIF.HU)

  • [kat] Katolikus Biblia.

  • (HTTP://WWW.EXTRA.HU/SZENTIRAS)

  • [2] Kilgarriff, Adam 2002. BNC database and word frequency lists.

  • (HTTP://WWW.ITRI.BTON.AC.UK/~Adam.Kilgarriff/BNC-README.HTML)

  • [king] King James Bible.

  • (HTTP://WWW2.CCIM.ORG/BIBLE/DCB.HTML)

  • [3] Li, Wentian 2002. Bibliography of references to Zipf's law.

  • (HTTP://LINKAGE.ROCKEFELLER.EDU/WLI/ZIPF/)

  • [4] Németh, Géza - Csaba Zainkó 2001. Word unit based multilingual comparative analysis of text corpora. In: Proceedings of Eurospeech 2001, 2035-8. Aalborg, Denmark.

    'Word unit based multilingual comparative analysis of text corpora ' , , .

  • [am] American Standard Version of the Bible.

  • (HTTP://EBIBLE.ORG/BIBLE/ASV)

  • [dig] Digital Library Academy.

  • (HTTP://ALFRED.NEUMANN-HAZ.HU)

  • [elb] Elberfelder Bible.

  • (HTTP://HEILIGE-SCHRIFT.SYTES.NET)

  • [5] Németh, Géza - Csaba Zainkó - László Fekete - Gábor Olaszy - Gábor Endrédi - Péter Olaszi - Géza Kiss - Péter Kis 2000. The design, implementation, and operation of a Hungarian e-mail reader. In: International Journal of Speech Technology 3: 217-36.

    The design, implementation, and operation of a Hungarian e-mail reader , () 217 -36.

    • Search Google Scholar
  • [6] Popescu, Ioan-Iovitz 2002. On the Lavalette's nonlinear Zipf's law.

  • (HTTP://WWW.GEOCITIES.COM/IIPOPESCU/Zipfs_LAW.HTML)

  • [7] Roukos, Salim 1996. Language representation. In: Ronald A. Cole - Joseph Mariani - Hans Uszkoreit - Annie Zaenen - Victor Zue (eds) Survey of state of the art in human language technologies. Cambridge Univeristy Press, Cambridge.

    Language representation , ().

  • (HTTP://CSLU.CSE.OGI.EDU/HLT#SURVEY/CH1NODE8.HTML#SECTION16)

  • [8] Sojka, Petr 1995. Notes on compound word hyphenation in TEX. In: Proceedings of TUG'95, September 1995, 290-6.

    Notes on compound word hyphenation in TEX , () 290 -6.

  • [9] Váradi, Tamás 1999. On developing the Hungarian National Corpus. In: Ŝpela Vintar (ed.) Proceedings of the Workshop Language Technologies-Multilingual Aspects, 32nd Annual Meeting of the Societas Linguistica Europea, Ljubjana, Slovenia, 57-63. Faculty of Arts, University of Ljubljana, Ljubjana.

    On developing the Hungarian National Corpus , ().