Author:
Gábor Prószéky Hungarian Research Centre for Linguistics, Hungary

Search for other papers by Gábor Prószéky in
Current site
Google Scholar
PubMed
Close
https://orcid.org/0000-0002-1082-8202
Free access

This special issue of Acta Linguistica Academica is a collection of selected contributions providing new theoretical and practical results in Hungarian computational linguistics. A central problem in language modeling today is to learn a language model from examples, such as how a model of Hungarian can be learned from a set of Hungarian sentences. In this short introduction we show the basic language modeling methods that are language independent and with slight modifications they can be applied to Hungarian. The authors of this volume show that the linguistic relevance of these methods, although they rely on technical and mathematical apparatus is a really important theoretical issue.

Machine learning algorithms make the computer perform accurately on unseen examples after having experienced a sample data set called training data. In other words, the model based on training data makes predictions or decisions without being explicitly programmed by task-specific steps to do so.

Neural networks are a family of powerful machine learning models. In this volume we focus on the application of neural network models to natural language corpora, difficulties in their development, and their evaluation. The key term of neural networks is (artificial) neuron. There are three main types of layers we distinguish: the input layer of an artificial neural network (ANN) contains input neurons which send information to the so-called hidden layer and the hidden layer sends data to the output layer. If there are more hidden layers between the input and output layers we call the learning process deep learning. Each neuron forms a weighted sum of its inputs and passes the resulting scalar value through a function referred to as activation function which defines the output given an input. There are commonly used activation functions like linear, step, sigmoid, tanh, etc. (for more, see (Baheti 2022)). Training is the process optimizing weights where the error of predictions is minimized, and the network reaches a specified level of accuracy. The method generally applied to determine the error contribution of each neuron is called back-propagation.

One of the most used neural networks in natural language processing is (a variant of) the so-called recurrent neural network (RNN) the output of which does not depend on the inputs only but also on the previous step's neuron state. Long short-term memory (LSTM) is a specific version of RNN designed mainly to model tasks concerning temporal sequences and their long-range dependencies which are rather typical phenomena in natural languages. The most recent sequence-to-sequence models (seq2seq) are based on converting sequences from one domain to sequences in another domain, and consist of two RNNs: an encoder that processes the input and a decoder that produces the output. This approach is very suitable to human language processing and its applications—machine translation, automatic summarization, question answering, or speech recognition—where we need to generate text. The easiest way to apply seq2seq methods is when both input sequences and output sequences have the same length. In the general case, however, they have different lengths, so, the entire input sequence is required to start predicting the target: the encoder layer processes the input sequence and generates the fixed-size context vector which represents the semantic summary of the input sequence. This is given as input to the decoder which is trained to predict the next elements of the target sequence, given previous elements of the target sequence.

An important notion in the most recent seq2seq models is attention. The encoder's behavior is not influenced by attention because it receives one word at a time and produces the hidden state, which is used in the next step, but the difference is on the decoder's side where not only the last hidden state is passed to the decoder but all the hidden states. So, the attention layer can access all previous states and weigh them according to a learned measure of relevancy, providing relevant information about far-away tokens.

Transformers are designed to handle sequential input data (like RNNs), but they do not necessarily process the data in order (unlike RNNs), because their attention mechanism provides context for any position in the input sequence. Transformers identify the context that confers meaning to each word in the sentence, so it allows for more parallelization than recurrent neural networks, thereby reducing training times, so larger datasets can be used than was ever possible. This led to the notion of so-called pre-trained systems, which are trained with massive datasets, so they contain pre-trained weights and need enormous computational resources. These pre-trained models then can be fine-tuned for specific tasks, which means that if we use a pre-trained model, we only need to train it on a dataset specific to our task. Fine-tuning is therefore a way of utilizing transfer learning. Specifically, fine-tuning is a process that takes a model that has already been trained for one given task and then tunes the model to make it perform a second similar task. A look at the current literature of natural language processing may convince anybody that the field has recently been revolutionized by the adoption of the above-mentioned technologies.

The first attempts to apply neural methods in the computational processing of Hungarian were the first word-embedding models (Novák 2016; Siklósi 2018). The idea is old, the efficient computational treatment is new: the theoretical foundations of word embedding can be traced back to the early fifties and in particular in (Harris 1954) and (Firth 1957). Static models like word embedding have fixed computational graphs and parameters at the inference stage. Dynamic networks can, however, adapt their structures or parameters to different inputs, leading to notable advantages in terms of accuracy and computational efficiency. Dynamic models like the transformer-encoder-based BERT (Bidirectional Encoder Representations from Transformers) (Devlin et al. 2019) and the transformer-decoder-based GPT (Generative Pre-trained Transformer) (Bussler 2020) are the best known transformers. They were implemented for Hungarian rather quickly after their debut in the international research world. The first BERT architecture trained for Hungarian was huBERT (Nemeskey 2020a), followed by emBERT (Nemeskey 2020b) that enabled the integration of modern contextualized embedding-based classifiers into the e-magyar pipeline (Váradi et al. 2018). Various experimental models developed then by (Yang & Váradi 2021). The first Hungarian attempt to develop HILBERT, a BERT-Large model is described by (Feldmann et al. 2021).

The papers in this volume introduce the reader to language models, their applications and the datasets developed for their evaluation developed recently for Hungarian. The authors belong to the following research groups: the Hungarian Language Technology Research Group at the Faculty of Information Technology and Bionics of the Pázmány Péter Catholic University, the Institute for Language Technologies and Applied Linguistics and the Institute for Lexicology of the Hungarian Research Centre for Linguistics (the former Institute for Linguistics of the Hungarian Academy of Sciences) and the Department of Telecommunications and Media Informatics at the Faculty of Electrical Engineering and Informatics of the Budapest University of Technology and Economics.

Attila Novák and Borbála Novák show in their paper Cross-lingual transfer of knowledge in distributional language models: experiments in Hungarian that the distributional method to language description—as it is used by the neural models some of which are introduced in this volume—is not only as good as the well-known models of the generative school, but they can naturally handle ambiguity and achieve human-like linguistic performance. What is more, their training material consists of noisy raw language data without any multimodal grounding or external supervision, refuting Chomsky's argument that some generic neural architecture cannot arrive at the linguistic performance exhibited by humans given the limited input available to children. The authors demonstrate in experiments with Hungarian (as target language) that the shared internal representations in multilingually trained versions of these models enable to transfer specific linguistic skills, including structured annotation skills, from one language to another remarkably efficiently.

Most of the papers of this volume prove that transformer-based language models have achieved state-of-the-art results in tasks like text classification or text generation. However, the layers of these models do not output any explicit representations for text units larger than tokens (e.g., sentences), although such representations are required to perform text classification. Sentence encodings are usually obtained by applying a pooling technique during fine-tuning on a specific task. Bence Nyéki introduces a new sentence encoder in his paper BiVaSE: A Bilingual Variational Sentence Encoder with Randomly Initialized Transformer Layers. Relying on an auto-encoder architecture, the system was trained to learn sentence representations from the very beginning of its training on bilingual data. The representations were evaluated in downstream and linguistic probing tasks. It must be noted that the new encoder generally performs worse than the well-known transformer-based encoders, but the author shows that the system could learn to incorporate linguistic information in the sentence representations.

Győző Zijian Yang describes the ability of neural technologies in summary production for long texts in his paper Neural Text Summarization for Hungarian. Nowadays it is one of the hottest topics in research as well as in the industry. The two basic approaches are extractive summarization that searches for the most relevant sentences from the text, and abstractive summarization that generates a shorter text based on the content of the original text. Naturally, for both approaches the first solutions have been developed for English and this paper demonstrates the first running applications for Hungarian (both extractive and abstractive) text summarization. The author used various transformer-based methods for this task and evaluated them one by one.

When speaking of natural language processing, the question arises soon: How good are the most recent machine translation techniques? László János Laki and Győző Zijian Yang give an overview of the currently existing Hungarian solutions for machine translation in their paper Neural Machine Translation for Hungarian and show how academic and industrial systems can cope with the difficulties caused by the rather free constituent order of the morphologically very rich Hungarian language. The authors demonstrate that the Marian NMT and BART models trained by them for the English--Hungarian language pair perform significantly better than the solutions offered by even the market-leader multinational companies. The paper also shows some promising results with the fine-tuning of different pre-fine-tuned models like mT5, mBART and M2M100 for English-Hungarian translation.

Meaning discrimination is a difficult task and still a hot topic in natural language processing. The Word-in-Context dataset focuses on a specific sense disambiguation task: the meaning of the same target word in two different contexts should be identified whether they are used in the same sense or not. One of the biggest problems is that the meaning discrimination task is not well defined even for humans, as shown by the relatively low consistency of inter-annotator agreements. The paper A Proof-of-Concept Meaning Discrimination Experiment to Compile a Word-in-Context Dataset for Adjectives: A Graph-based Distributional Approach of Enikő Héja and Noémi Ligeti-Nagy introduces a method where both sparse and dense vector representations served as input. Their algorithm can anchor the semantic information to contextual data, and therefore it is able to provide clear and explicit criteria as to when the same meaning should be assigned to the occurrences. The approach seems usable for Hungarian and also for other low-density languages.

Ágnes Kalivoda's contribution entitled PrevDistro: An Open-access Dataset of Hungarian Preverb Constructions gives an overview of the productive predicate formation system of Hungarian that combines a preverb and a verb. One of its interesting features is that the preverb—depending on the construction in question—may preserve its separability to some extent. The paper introduces the reader to an open-access dataset of preverb distributions (called PrevDistro) that contains more than forty million corpus occurrences of almost fifty preverb construction types and developed by the author alone. Detailed explanation of the design considerations and the basic methodology and its main characteristics are also given.

Winograd Schema Challenge is a sort of novel Turing test relying on anaphora resolution with commonsense reasoning and world knowledge. Resolution of anaphora is a rather difficult task in computational linguistics. The authors, Noémi Vadász and Noémi Ligeti-Nagy introduce their Hungarian equivalent to the original Winograd schemata in their paper Winograd Schemata and Other Datasets for Anaphora Resolution in Hungarian. A parallel corpus is also provided with all the translations the schemata is currently available for. The challenges the authors faced during the adaptation process is also shown in detail.

The first seven papers in this volume deal with processing written language only, but the paper Morphology Aware Data Augmentation with Neural Language Models for Online Hybrid ASR of Balázs Tarján, Péter Mihajlik and Tibor Fegyó introduces the reader to spoken language processing. Recognition of Hungarian conversational telephone speech is challenging because of the informal style and the morphological richness of Hungarian. The known neural methods are almost inapplicable here: the authors show that well-known methods running well for isolating languages cause vocabulary explosion in a morphologically rich language. A subword-based method is introduced here which significantly improves the Word Error Rate (WER) while greatly reducing vocabulary size and memory requirements. Combining subword-based modeling and neural language model-based data augmentation, the authors achieved 11% relative WER reduction and preserved real-time operation of their conversational telephone speech recognition system. Furthermore, subword-based neural text augmentation outperforms the word-based approach in recognition of Out-of-Vocabulary (OOV) words as well.

Following the research papers, Bálint Sass's contribution can be read in the Discussion Note section. He gives an introduction to corpus querying for those who are not really familiar with computer programming. The paper Principles of Corpus Querying presents some basic methods on how to collect linguistic data soundly and effectively from text corpora by direct data manipulation that can be done even without knowledge of sophisticated programming languages. In brief, it is an introduction for non-computational linguists on how to obtain correct and complete data from corpora to apply them in further linguistic research.

ACKNOWLEDGEMENT

I would like to thank all the contributors of this volume for their inspiring papers and all the reviewers for their very helpful and constructive reviews. Special thanks go to Prof. András Cser, the editor-in-chief of Acta Linguistica Academica who made this special issue possible and for inviting me to be its guest editor.

References

  • Baheti, Pragati. 2022. Activation functions in neural networks (12 types and use cases). https://www.v7labs.com/blog/neural-networks-activation-functions [Retrieved: October 6, 2022].

    • Search Google Scholar
    • Export Citation
  • Bussler, Frederik. 2020. Will GPT-3 kill coding? https://towardsdatascience.com/will-gpt-3-kill-coding-630e4518c04d [Retrieved: October 14, 2022].

    • Search Google Scholar
    • Export Citation
  • Devlin, Jacob, Ming-Wei Chang, Kenton Lee and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the ACL, Vol. 1. Minneapolis: Association for Computational Linguistics. 41714186.

    • Search Google Scholar
    • Export Citation
  • Feldmann, Ádám, Róbert Hajdu, Balázs Indig, Bálint Sass, Márton Makrai, Iván Mittelholcz, Dávid Halász, Zijian Győző Yang and Tamás Váradi. 2021. HILBERT, magyar nyelvű BERT-large modell tanítása felhő környezetben. XVII. Magyar Számítógépes Nyelvészeti Konferencia. SZTE: Szeged. 2936.

    • Search Google Scholar
    • Export Citation
  • Firth, John R. 1957. A synopsis of linguistic theory 1930–1955. In Studies in linguistic analysis. Oxford: Philological Society. 132.

    • Search Google Scholar
    • Export Citation
  • Harris, Zellig. 1954. Distributional structure. Word 10(23). 146162.

  • Nemeskey, Dávid Márk. 2020a. Natural language processing methods for language modeling, PhD thesis. Budapest: ELTE.

  • Nemeskey, Dávid Márk. 2020b. Egy emBERT próbáló feladat. XVI. Magyar Számítógépes Nyelvészeti Konferencia. SZTE: Szeged. 409418.

    • Search Google Scholar
    • Export Citation
  • Novák, Attila. 2016. Improving corpus annotation quality using word embedding models. Polibits 53. 4953.

  • Siklósi, Borbála. 2018. Using embedding models for lexical categorization in morphologically rich languages. Computational Linguistics and Intelligent Text Processing (LNTCS Vol. 9623). Springer. 115126.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Váradi, Tamás, Eszter Simon, Bálint Sass, Iván Mittelholcz, Attila Novák, Balázs Indig, Richárd Farkas and Veronika Vincze. 2018. E-magyar – a digital language processing system. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation. Miyazaki: ELRA. 13071312.

    • Search Google Scholar
    • Export Citation
  • Yang, Zijian Győző and Tamás Váradi. 2021. Training language models with low resources: RoBERTa, BART and ELECTRA experimental models for Hungarian. In A. Anon (ed.) Proceedings of the 12th IEEE International Conference on Cognitive Infocommunications. IEEE. 279285.

    • Search Google Scholar
    • Export Citation
  • Baheti, Pragati. 2022. Activation functions in neural networks (12 types and use cases). https://www.v7labs.com/blog/neural-networks-activation-functions [Retrieved: October 6, 2022].

    • Search Google Scholar
    • Export Citation
  • Bussler, Frederik. 2020. Will GPT-3 kill coding? https://towardsdatascience.com/will-gpt-3-kill-coding-630e4518c04d [Retrieved: October 14, 2022].

    • Search Google Scholar
    • Export Citation
  • Devlin, Jacob, Ming-Wei Chang, Kenton Lee and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the ACL, Vol. 1. Minneapolis: Association for Computational Linguistics. 41714186.

    • Search Google Scholar
    • Export Citation
  • Feldmann, Ádám, Róbert Hajdu, Balázs Indig, Bálint Sass, Márton Makrai, Iván Mittelholcz, Dávid Halász, Zijian Győző Yang and Tamás Váradi. 2021. HILBERT, magyar nyelvű BERT-large modell tanítása felhő környezetben. XVII. Magyar Számítógépes Nyelvészeti Konferencia. SZTE: Szeged. 2936.

    • Search Google Scholar
    • Export Citation
  • Firth, John R. 1957. A synopsis of linguistic theory 1930–1955. In Studies in linguistic analysis. Oxford: Philological Society. 132.

    • Search Google Scholar
    • Export Citation
  • Harris, Zellig. 1954. Distributional structure. Word 10(23). 146162.

  • Nemeskey, Dávid Márk. 2020a. Natural language processing methods for language modeling, PhD thesis. Budapest: ELTE.

  • Nemeskey, Dávid Márk. 2020b. Egy emBERT próbáló feladat. XVI. Magyar Számítógépes Nyelvészeti Konferencia. SZTE: Szeged. 409418.

    • Search Google Scholar
    • Export Citation
  • Novák, Attila. 2016. Improving corpus annotation quality using word embedding models. Polibits 53. 4953.

  • Siklósi, Borbála. 2018. Using embedding models for lexical categorization in morphologically rich languages. Computational Linguistics and Intelligent Text Processing (LNTCS Vol. 9623). Springer. 115126.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Váradi, Tamás, Eszter Simon, Bálint Sass, Iván Mittelholcz, Attila Novák, Balázs Indig, Richárd Farkas and Veronika Vincze. 2018. E-magyar – a digital language processing system. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation. Miyazaki: ELRA. 13071312.

    • Search Google Scholar
    • Export Citation
  • Yang, Zijian Győző and Tamás Váradi. 2021. Training language models with low resources: RoBERTa, BART and ELECTRA experimental models for Hungarian. In A. Anon (ed.) Proceedings of the 12th IEEE International Conference on Cognitive Infocommunications. IEEE. 279285.

    • Search Google Scholar
    • Export Citation
  • Collapse
  • Expand

Editors

Editor-in-Chief: András Cser

Editor: György Rákosi

Review Editor: Tamás Halm

Editorial Board

  • Anne Abeillé / Université Paris Diderot
  • Željko Bošković / University of Connecticut
  • Marcel den Dikken / Eötvös Loránd University; Hungarian Research Centre for Linguistics, Budapest
  • Hans-Martin Gärtner / Hungarian Research Centre for Linguistics, Budapest
  • Elly van Gelderen / Arizona State University
  • Anders Holmberg / Newcastle University
  • Katarzyna Jaszczolt / University of Cambridge
  • Dániel Z. Kádár / Hungarian Research Centre for Linguistics, Budapest
  • István Kenesei / University of Szeged; Hungarian Research Centre for Linguistics, Budapest
  • Anikó Lipták / Leiden University
  • Katalin Mády / Hungarian Research Centre for Linguistics, Budapest
  • Gereon Müller / Leipzig University
  • Csaba Pléh / Hungarian Academy of Sciences, Central European University
  • Giampaolo Salvi / Eötvös Loránd University
  • Irina Sekerina / College of Staten Island CUNY
  • Péter Siptár / Hungarian Research Centre for Linguistics, Budapest
  • Gregory Stump / University of Kentucky
  • Peter Svenonius / University of Tromsø
  • Anne Tamm / Károli Gáspár University of the Reformed Church
  • Akira Watanabe / University of Tokyo
  • Jeroen van de Weijer / Shenzhen University

 

Acta Linguistica Academica
Address: Benczúr u. 33. HU–1068 Budapest, Hungary
Phone: (+36 1) 351 0413; (+36 1) 321 4830 ext. 154
Fax: (36 1) 322 9297
E-mail: ala@nytud.mta.hu

Indexing and Abstracting Services:

  • Arts and Humanities Citation Index
  • Bibliographie Linguistique/Linguistic Bibliography
  • International Bibliographies IBZ and IBR
  • Linguistics Abstracts
  • Linguistics and Language Behaviour Abstracts
  • MLA International Bibliography
  • SCOPUS
  • Social Science Citation Index
  • LinguisList

 

2023  
Web of Science  
Journal Impact Factor 0.5
Rank by Impact Factor Q3 (Linguistics)
Journal Citation Indicator 0.37
Scopus  
CiteScore 1.0
CiteScore rank Q1 (Literature and Literary Theory)
SNIP 0.571
Scimago  
SJR index 0.344
SJR Q rank Q1

Acta Linguistica Academica
Publication Model Hybrid
Submission Fee none
Article Processing Charge 900 EUR/article
Effective from  1st Feb 2025:
1200 EUR/article
Printed Color Illustrations 40 EUR (or 10 000 HUF) + VAT / piece
Regional discounts on country of the funding agency World Bank Lower-middle-income economies: 50%
World Bank Low-income economies: 100%
Further Discounts Editorial Board / Advisory Board members: 50%
Corresponding authors, affiliated to an EISZ member institution subscribing to the journal package of Akadémiai Kiadó: 100%
Subscription fee 2025 Online subsscription: 648 EUR / 712 USD
Print + online subscription: 744 EUR / 820 USD
Subscription Information Online subscribers are entitled access to all back issues published by Akadémiai Kiadó for each title for the duration of the subscription, as well as Online First content for the subscribed content.
Purchase per Title Individual articles are sold on the displayed price.

Acta Linguistica Academica
Language English
Size B5
Year of
Foundation
2017 (1951)
Volumes
per Year
1
Issues
per Year
4
Founder Magyar Tudományos Akadémia   
Founder's
Address
H-1051 Budapest, Hungary, Széchenyi István tér 9.
Publisher Akadémiai Kiadó
Publisher's
Address
H-1117 Budapest, Hungary 1516 Budapest, PO Box 245.
Responsible
Publisher
Chief Executive Officer, Akadémiai Kiadó
ISSN 2559-8201 (Print)
ISSN 2560-1016 (Online)