In this paper we describe the data processing procedures and the preliminary results of the project Ob-Ugric database (OUDB), a web-based framework which aims at developing corpus-based descriptive resources of Khanty and Mansi dialects. Using established language documentation and annotation tools, OUDB provides interlinked corpus and lexicon data from digitized texts as well as recent fieldwork studies in an uniform IPA-transcription together with the corresponding audio recordings thus making these less described languages of the Ob-Ugric branch of the Finno-Ugric language family accessible for researchers as well as the language community and archiving the raw data for documentation, linguistic evaluation and possible future use in building resources for language technology applications.
Black, H. Andrew and Gary F. Simons. 2006. The SIL FieldWorks Language Explorer approach to morphological parsing. In Computational Linguistics for Less-Studied Languages: Proceedings of Texas Linguistics Society 10. Austin, TX: CSLI Publications. 37–55.
Bradley, Jeremy. 2015. Corpus.mari-language.com: A rudimentary corpus searchable by syntactic and morphological patterns. In First International Workshop on Computational Linguistics for Uralic Languages. 57–68.
Davies, Mark. 2005. The advantage of using relational databases for large corpora: Speed, advanced queries, and unlimited annotation. International Journal of Corpus Linguistics 10. 307–334.
Filtchenko, Andrey. 2006. The Eastern Khanty locative-agent constructions. In B. Lyngfelt and T. Solstadt (eds.) Demoting the Agent: Passive, Middle and Other Voice Phenomena. Amsterdam & Philadelphia: John Benjamins. 47–82.
Gries, Stefan Th. 2009. What is corpus linguistics? Language and Linguistics Compass 3. 1225–1241.
Gries, Stefan Th. and Andrea L. Berez. 2017. Linguistic annotation in/for corpus linguistics. In N. Ide and J. Pustejovsky (eds.) Handbook of Linguistic Annotation. Berlin & New York: Springer. 379–409.
Hardie, Andrew. 2012. CQPweb—Combining power, flexibility and usability in a corpus analysis tool. International Journal of Corpus Linguistics 17. 380–409.
Janda, Gwen Eva, Axel Wisiorek and Stefanie Eckmann. to appear. Reference tracking mechanisms and automatic annotation based on Ob-Ugric information structure. Journal de la Société Finno-Ougrienne 96.
McEnery, Tony and Andrew Hardie. 2011. Corpus linguistics: Method, theory and practice. Cambridge: Cambridge University Press.
Ottmann, Thomas and Peter Widmayer. 1996. Algorithmen und Datenstrukturen. Heidelberg: Spektrum.
Schmid, Helmut. 1994. Probabilistic part-of-speech tagging using decision trees. In Proceedings of International Conference on New Methods in Language Processing. Manchester. 44–49.
Schön, Zsófia. 2015. On the road to a dialect dictionary of Khanty postpositions. In First International Workshop on Computational Linguistics for Uralic Languages. 99–107.
Stonebraker, Michael and Joey Hellerstein. 2005. What goes around comes around. In J. Hellerstein and M. Stonebraker (eds.) Readings in Database Systems. Cambridge, MA: MIT Press. 2–41.