View More View Less
  • 1 Ludwig Maximilian University of Munich
  • 2 Ludwig Maximilian University of Munich
Full access

In this paper we describe the data processing procedures and the preliminary results of the project Ob-Ugric database (OUDB), a web-based framework which aims at developing corpus-based descriptive resources of Khanty and Mansi dialects. Using established language documentation and annotation tools, OUDB provides interlinked corpus and lexicon data from digitized texts as well as recent fieldwork studies in an uniform IPA-transcription together with the corresponding audio recordings thus making these less described languages of the Ob-Ugric branch of the Finno-Ugric language family accessible for researchers as well as the language community and archiving the raw data for documentation, linguistic evaluation and possible future use in building resources for language technology applications.

If the inline PDF is not rendering correctly, you can download the PDF file here.

  • Black, H. Andrew and Gary F. Simons. 2006. The SIL FieldWorks Language Explorer approach to morphological parsing. In Computational Linguistics for Less-Studied Languages: Proceedings of Texas Linguistics Society 10. Austin, TX: CSLI Publications. 3755.

    • Search Google Scholar
    • Export Citation
  • Bradley, Jeremy. 2015. Corpus.mari-language.com: A rudimentary corpus searchable by syntactic and morphological patterns. In First International Workshop on Computational Linguistics for Uralic Languages. 5768.

    • Search Google Scholar
    • Export Citation
  • Davies, Mark. 2005. The advantage of using relational databases for large corpora: Speed, advanced queries, and unlimited annotation. International Journal of Corpus Linguistics 10. 307334.

    • Search Google Scholar
    • Export Citation
  • Filtchenko, Andrey. 2006. The Eastern Khanty locative-agent constructions. In B. Lyngfelt and T. Solstadt (eds.) Demoting the Agent: Passive, Middle and Other Voice Phenomena. Amsterdam & Philadelphia: John Benjamins. 4782.

    • Search Google Scholar
    • Export Citation
  • Gries, Stefan Th. 2009. What is corpus linguistics? Language and Linguistics Compass 3. 12251241.

  • Gries, Stefan Th. and Andrea L. Berez. 2017. Linguistic annotation in/for corpus linguistics. In N. Ide and J. Pustejovsky (eds.) Handbook of Linguistic Annotation. Berlin & New York: Springer. 379409.

    • Search Google Scholar
    • Export Citation
  • Hardie, Andrew. 2012. CQPweb—Combining power, flexibility and usability in a corpus analysis tool. International Journal of Corpus Linguistics 17. 380409.

    • Search Google Scholar
    • Export Citation
  • Janda, Gwen Eva, Axel Wisiorek and Stefanie Eckmann. to appear. Reference tracking mechanisms and automatic annotation based on Ob-Ugric information structure. Journal de la Société Finno-Ougrienne 96.

    • Search Google Scholar
    • Export Citation
  • McEnery, Tony and Andrew Hardie. 2011. Corpus linguistics: Method, theory and practice. Cambridge: Cambridge University Press.

  • Ottmann, Thomas and Peter Widmayer. 1996. Algorithmen und Datenstrukturen. Heidelberg: Spektrum.

  • Schmid, Helmut. 1994. Probabilistic part-of-speech tagging using decision trees. In Proceedings of International Conference on New Methods in Language Processing. Manchester. 4449.

    • Search Google Scholar
    • Export Citation
  • Schön, Zsófia. 2015. On the road to a dialect dictionary of Khanty postpositions. In First International Workshop on Computational Linguistics for Uralic Languages. 99107.

    • Search Google Scholar
    • Export Citation
  • Stonebraker, Michael and Joey Hellerstein. 2005. What goes around comes around. In J. Hellerstein and M. Stonebraker (eds.) Readings in Database Systems. Cambridge, MA: MIT Press. 241.

    • Search Google Scholar
    • Export Citation