View More View Less
  • 1 Department of Telecommunications and Telematics, Budapest University of Technology and Economics 1117 Bp. Magyar tudósok körútja 2.
  • | 2 Department of Telecommunications and Telematics, Budapest University of Technology and Economics 1117 Bp. Magyar tudósok körútja 2
Full access

The practical challenge of creating a Hungarian e-mail reader has initiated our work on statistical text analysis. The starting point was statistical analysis for automatic discrimination of the language of texts. Later it was extended to automatic re-generation of diacritic signs and more detailed language structure analysis. A parallel study of three different languages-Hungarian, German and English-using text corpora of a similar size gives a possibility for the exploration of both similarities and differences. Corpora of publicly available Internet sources were used. The corpus size was the same (approximately 20 Mbytes, 2.5-3.5 million word forms) for all languages. Besides traditional corpus coverage, word length and occurrence statistics, some new features about prosodic boundaries (sentence initial and final positions, preceding and following a comma) were also computed. Among others, it was found that the coverage of corpora by the most frequent words follows a parallel logarithmic rule for all languages in the 40-85% coverage range, known as Zipf's law in linguistics. The functions are much nearer for English and German than for Hungarian. Further conclusions are also drawn. The language detection and diacritic regeneration applications are discussed in detail with implications on Hungarian speech generation. Diverse further application domains, such as predictive text input, word hyphenation, language modelling in speech recognition, corpus-based speech synthesis, etc. are also foreseen.

  • [1] Gibbon, Dafydd - Roger Moore - Richard Winski 1998. Spoken language characterisation. Mouton de Gruyter, The Hague.

    Spoken language characterisation , ().

  • [gutp] Gutenberg Project.

  • (HTTP://WWW.GUTENBERG.AOL.DE)

  • [hel] Hungarian Electronic Library.

  • (HTTP://WWW.MEK.IIF.HU)

  • [kat] Katolikus Biblia.

  • (HTTP://WWW.EXTRA.HU/SZENTIRAS)

  • [2] Kilgarriff, Adam 2002. BNC database and word frequency lists.

  • (HTTP://WWW.ITRI.BTON.AC.UK/~Adam.Kilgarriff/BNC-README.HTML)

  • [king] King James Bible.

  • (HTTP://WWW2.CCIM.ORG/BIBLE/DCB.HTML)

  • [3] Li, Wentian 2002. Bibliography of references to Zipf's law.

  • (HTTP://LINKAGE.ROCKEFELLER.EDU/WLI/ZIPF/)

  • [4] Németh, Géza - Csaba Zainkó 2001. Word unit based multilingual comparative analysis of text corpora. In: Proceedings of Eurospeech 2001, 2035-8. Aalborg, Denmark.

    'Word unit based multilingual comparative analysis of text corpora ' , , .

  • [am] American Standard Version of the Bible.

  • (HTTP://EBIBLE.ORG/BIBLE/ASV)

  • [dig] Digital Library Academy.

  • (HTTP://ALFRED.NEUMANN-HAZ.HU)

  • [elb] Elberfelder Bible.

  • (HTTP://HEILIGE-SCHRIFT.SYTES.NET)

  • [5] Németh, Géza - Csaba Zainkó - László Fekete - Gábor Olaszy - Gábor Endrédi - Péter Olaszi - Géza Kiss - Péter Kis 2000. The design, implementation, and operation of a Hungarian e-mail reader. In: International Journal of Speech Technology 3: 217-36.

    The design, implementation, and operation of a Hungarian e-mail reader , () 217 -36.

    • Search Google Scholar
  • [6] Popescu, Ioan-Iovitz 2002. On the Lavalette's nonlinear Zipf's law.

  • (HTTP://WWW.GEOCITIES.COM/IIPOPESCU/Zipfs_LAW.HTML)

  • [7] Roukos, Salim 1996. Language representation. In: Ronald A. Cole - Joseph Mariani - Hans Uszkoreit - Annie Zaenen - Victor Zue (eds) Survey of state of the art in human language technologies. Cambridge Univeristy Press, Cambridge.

    Language representation , ().

  • (HTTP://CSLU.CSE.OGI.EDU/HLT#SURVEY/CH1NODE8.HTML#SECTION16)

  • [8] Sojka, Petr 1995. Notes on compound word hyphenation in TEX. In: Proceedings of TUG'95, September 1995, 290-6.

    Notes on compound word hyphenation in TEX , () 290 -6.

  • [9] Váradi, Tamás 1999. On developing the Hungarian National Corpus. In: Ŝpela Vintar (ed.) Proceedings of the Workshop Language Technologies-Multilingual Aspects, 32nd Annual Meeting of the Societas Linguistica Europea, Ljubjana, Slovenia, 57-63. Faculty of Arts, University of Ljubljana, Ljubjana.

    On developing the Hungarian National Corpus , ().

The author instruction is available in PDF.

Please, download the file from HERE

Editors

Editor(s)-in-Chief: Katalin É. Kiss,
Ferenc Kiefer

Editor: Éva Dékány

Technical Editor: Zoltán G. Kiss

Review Editor: Beáta Gyuris

Editorial Board

  • Anne Abeillé
  • Željko Bošković
  • Marcel den Dikken
  • Hans-Martin Gärtner
  • Elly van Gelderen
  • Anders Holmberg
  • Katarzyna Jaszczolt
  • István Kenesei
  • Anikó Lipták
  • Katalin Mády
  • Gereon Müller
  • Csaba Pléh
  • Giampaolo Salvi
  • Irina Sekerina
  • Péter Siptár
  • Gregory Stump
  • Peter Svenonius
  • Anne Tamm
  • Akira Watanabe
  • Jeroen van de Weijer

Acta Linguistica Academica
Address: Benczúr u. 33. HU–1068 Budapest, Hungary
Phone: (+36 1) 351 0413; (+36 1) 321 4830 ext. 154
Fax: (36 1) 322 9297
E-mail: ala@nytud.mta.hu

Indexing and Abstracting Services:

  • Arts and Humanities Citation Index
  • Bibliographie Linguistique/Linguistic Bibliography
  • International Bibliographies IBZ and IBR
  • Linguistics Abstracts
  • Linguistics and Language Behaviour Abstracts
  • MLA International Bibliography
  • SCOPUS
  • Social Science Citation Index
  • LinguisList

 

Acta Linguistica Hungarica
Language English
Size  
Year of
Foundation
1951
Publication
Programme
changed title
Volumes
per Year
 
Issues
per Year
 
Founder Magyar Tudományos Akadémia
Founder's
Address
H-1051 Budapest, Hungary, Széchenyi István tér 9.
Publisher Akadémiai Kiadó
Publisher's
Address
H-1117 Budapest, Hungary 1516 Budapest, PO Box 245.
Responsible
Publisher
Chief Executive Officer, Akadémiai Kiadó
ISSN 1216-8076 (Print)
ISSN 1588-2624 (Online)