Restricted access

Purchase article

USD  $25.00

1 year subscription

USD  $360.00

Abstract

We investigate the cost-effectiveness of special-purpose crawled corpora versus more focused corpora for automatic terminology extraction (ATE). Our focus is on medical terminology on heart failure for two languages, viz. English for which we have more web and specialized resources at our disposal and the less resourced Dutch. We show that, although term density in the dedicated corpora is larger for both languages, the potential for term extraction is higher in the crawled corpora than in the dedicated corpora. Furthermore, in a set of experiments in which we evaluate both types of corpora, while keeping size constant, we observe that more Gold Standard (GS) terms are covered by the “noisy” crawled corpus than with a dedicated corpus of the same size.

  • Baroni, M. & Bernardini, S. 2004. BootCaT: Bootstrapping Corpora and Terms from the Web. In: Proceedings of LREC 2004. Lisbon, Portugal.

    • Search Google Scholar
    • Export Citation
  • Baroni, M., Kilgarriff, A., Pomikálek, J. & Rychly, P. 2006. WebBootCaT: Instant Domain-specific Corpora to Support Human Translators. In: Proceedings of the EuraLex Conference 2006. Torino, Italy. 247252.

    • Search Google Scholar
    • Export Citation
  • Baroni, M. & Ueyama, M. 2006. Building General- and Special-purpose Corpora by Web Crawling. In: Proceedings of the 13th NIJL International Symposium, Language Corpora: Their Compilation and Application. Tokyo, Japan. 3140.

    • Search Google Scholar
    • Export Citation
  • Costa, H., Corpas Pastor, G., Mitkov, R. & Seghiri, M.. 2015. Towards a Web-based Tool to Semi-automatically Compile, Manage and Explore Comparable and Parallel Corpora. In: Proceedings of the 7th International Conference of the Iberian Association of Translation and Interpreting Studies, AIETI. Malaga, Spain.

    • Search Google Scholar
    • Export Citation
  • Corpas Pastor, G. & Seghiri, M. (eds) 2016. Corpus-based Approaches to Translation and Interpreting. From Theory to Applications. Bern, Switzerland: Peter Lang;

    • Search Google Scholar
    • Export Citation
  • De Boer, V. 2010. Ontology enrichment from heterogeneous sources on the web. PhD Amsterdam: University of Amsterdam.

  • De Groc, C. 2011. Babouk: Focused Web Crawling for Corpus Compilation and Automatic Terminology Extraction. In: Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology. Vol. 1. Washington DC, USA: IEEE Computer Society. 497498.

    • Search Google Scholar
    • Export Citation
  • De Schryver, G.-M. 2002. Web for/as Corpus: A Perspective for the African Languages. Nordic Journal of African Studies Vol. 11. No. 2. 266282.

    • Search Google Scholar
    • Export Citation
  • Ghani, R., Jones, R. & Mladnic, D. 2001. Mining the Web to Create Minority Language Corpora. In: Proceedings of the 10th International Conference on Information and Knowledge Management. Atlanta, GA, USA: ACM. 27642767.

    • Search Google Scholar
    • Export Citation
  • Ghani, R., Jones, R. & Mladnic, D. 2004. Building Minority Language Corpora by Learning to Generate Web Search Queries. Knowledge and Information Systems Vol. 7. No. 1. 5683.

    • Search Google Scholar
    • Export Citation
  • Heylen, K. & De Hertog, D. 2015. Automatic Term Extraction. In: Kockaert, H. J. & Steurs, F. (eds) Handbook of Terminology. Amsterdam/Philadelphia: John Benjamins Publishing Company. 203221.

    • Search Google Scholar
    • Export Citation
  • Kilgarriff, A. & Grefenstette, G. 2003. Introduction to the Special Issue on the Web as Corpus. Computational Linguistics Vol. 29. No. 3. 333347.

    • Search Google Scholar
    • Export Citation
  • Macken, L., Lefever, E. & Hoste, V. 2013. TExSIS: Bilingual Terminology Extraction from Parallel Corpora Using Chunk-based Alignment. Terminology Vol. 19. No. 1. 130.

    • Search Google Scholar
    • Export Citation
  • Maynard, D., Li, Y. & Peters, W. 2008. NLP Techniques for Term Extraction and Ontology Population. In: Buitelaar, P. & Cimiano, P. (eds) Ontology Learning and Population: Bridging the Gap between Text and Knowledge, Vol. 167. Frontiers in Artificial Intelligence and Applications. Amsterdam: IOS Press. 107127.

    • Search Google Scholar
    • Export Citation
  • Morin, E., Daille, B., Takeuchi, K. & Kageura, K. 2007. Bilingual Terminology Mining – Using Brain, Not Brawn Comparable Corpora. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. Prague, Czech Republic: ACL. 664671.

    • Search Google Scholar
    • Export Citation
  • Pearson, J. 1998. Terms in Context. In: Tognini-Bonelli, E. (ed.) Studies in Corpus Linguistics, Vol. 1. Amsterdam/Philadelphia: John Benjamins Publishing Company.

    • Search Google Scholar
    • Export Citation
  • Pinkham, J. 1996. Grammar Sharing between English and French. In: Proceedings of the NLP-IA Conference. 4–6. June, Moncton, Canada

  • Scannell, K. 2007. The Crúbadán Project: Corpus Building for Under-resourced Languages. In: Fairon, C., Naets, H., Kilgarriff, A. & De Schryver, G.-M. (eds) Building and Exploring Web Corpora: Proceedings of the 3rd Web as Corpus Workshop. Louvain-la-Neuve, Belgium: Presses universitaires de Louvain. 515. June, Moncton, Canada.

    • Search Google Scholar
    • Export Citation
  • Varantola, K. 2003. Translators and Disposable Corpora. In: Zanettin, F., Bernardini, S. & Stewart, D. (eds) Corpora in Translator Education. Manchester: St Jerome Publishing. 5570.

    • Search Google Scholar
    • Export Citation
  • Vintar, S. 2010. Bilingual Term Recognition Revisited. Terminology Vol. 16. No. 2. 141158.

  • Wendt, M., Büscher, C., Herta, C., Gerlach, M., Messner, M., Kemmerer, S., Tietze, W. & Düwiger, H. 2009. Extracting Domain Terminologies from the WorldWideWeb. In: Proceedings of the Web as Corpus Workshop (WAC5). 7987.

    • Search Google Scholar
    • Export Citation
  • Wong, W., Liu, W. & Bennamoun, M. 2008. Constructing Web Corpora through Topical Web Partitioning for Term Recognition. In: Wobcke, W. & Zhang, M. (eds) Proceedings of the Australian Joint Conference on Artificial Intelligence. Berlin/Heidelberg: Springer. 6778.

    • Search Google Scholar
    • Export Citation
  • Xu, F., Kurz, D., Piskorski, J. & Schmeier, S. 2002. A Domain Adaptive Approach to Automatic Acquisition of Domain Relevant Terms and their Relations with Bootstrapping. In: Proceedings of LREC 2002. Las Palmas de gran Canaria, Spain.

    • Search Google Scholar
    • Export Citation
  • Zanettin, F. 2002. Corpora in Translation Practice. In: Proceedings of the First International Workshop on Language Resources (LR) for Translation Work and Research. Las Palmas de gran Canaria, Spain. 1014.

    • Search Google Scholar
    • Export Citation