The practical challenge of creating a Hungarian e-mail reader has initiated our work on statistical text analysis. The starting point was statistical analysis for automatic discrimination of the language of texts. Later it was extended to automatic re-generation of diacritic signs and more detailed language structure analysis. A parallel study of three different languages-Hungarian, German and English-using text corpora of a similar size gives a possibility for the exploration of both similarities and differences. Corpora of publicly available Internet sources were used. The corpus size was the same (approximately 20 Mbytes, 2.5-3.5 million word forms) for all languages. Besides traditional corpus coverage, word length and occurrence statistics, some new features about prosodic boundaries (sentence initial and final positions, preceding and following a comma) were also computed. Among others, it was found that the coverage of corpora by the most frequent words follows a parallel logarithmic rule for all languages in the 40-85% coverage range, known as Zipf's law in linguistics. The functions are much nearer for English and German than for Hungarian. Further conclusions are also drawn. The language detection and diacritic regeneration applications are discussed in detail with implications on Hungarian speech generation. Diverse further application domains, such as predictive text input, word hyphenation, language modelling in speech recognition, corpus-based speech synthesis, etc. are also foreseen.
This paper reports on a project aimed to explore how the proportion of newly introduced word-types and lemmas varies in different adaptations of the same text. The term ‘adaptation’ is used here to include both intralingual and interlingual adaptation, whether involving reduction in text size or not, as well as what we usually class as (interlingual) translation. The first part of the study looks at the way lemmatization affects the appearance of new words in a text. It was found that there are only minor differences between the appearance of word-types and lemmas, which means that lemmatization is not absolutely necessary in an analysis of the introduction of new words. In the next part of the study another type of adaptation, i.e. translations into foreign languages are analyzed. It was found that changes on the discourse level are independent of the language. This is equivalent to saying that if there are differences between the translations they must be due to inadequate translation. The third type of adaptation examined was two condensed versions of the original text. In this case the question was the extent of vocabulary changes affected. It was found that the condensed versions eliminated exactly those text-slices of the original text which made it unique. As a result, there are only minor differences between the condensed versions and a statistical model that assumes a hypergeometrical distribution of words. The method used made it possible to determine the source of the second order adaptation.