Abstract
This contribution addresses the challenging issue of building corpus resources for the study of news translation, a domain in which the coexistence of radical rewriting and close translation makes the use of established corpus-assisted analytical techniques problematic. In an attempt to address these challenges, we illustrate and test two related methods for identifying translated segments within trilingual (Spanish, French and English) sets of dispatches issued by the global news agency Agence France-Press. One relies on machine translation and semantic similarity scores, the other on multilingual sentence embeddings. To evaluate these methods, we apply them to a benchmark dataset of translations from the same domain and perform manual evaluation of the dataset under study. We finally leverage the cross-linguistic equivalences thus identified to build a ‘comparallel’ corpus, which combines the parallel and comparable corpus architectures, highlighting its affordances and limitations for the study of news translation. We conclude by discussing the theoretical and methodological implications of our findings both for the study of news translation and more generally for the study of contemporary, novel forms of translation.
1 Introduction
This contribution addresses the challenging issue of building corpus resources for the study of news translation. This field of study has grown substantially in the past two decades, but still raises methodological and theoretical issues. The multilingual output of news producers is typically characterized by the coexistence of radical rewriting and close translation, as detailed in section 2. In this scenario, the fundamental corpus linguistics distinction between comparable and parallel corpus setups (Bernardini, 2022) is hardly applicable. In an attempt to address the challenges described by news translation scholars (Caimotto & Gaspari, 2018; Davier & van Doorslaer, 2018), we focus on global news agencies as a typical discourse production setting in which translation plays a fundamental but typically unacknowledged, almost intangible role.
We propose two related methods for identifying translated segments within trilingual sets of dispatches issued by the same news agency (Agence France-Press, AFP) in Spanish (ES), French (FR) and English (EN): one relying on machine translation and semantic similarity scores, the other on multilingual sentence embeddings. To evaluate the feasibility of our general approach, and to establish a tentative ‘translationality’ threshold, we perform manual evaluation of the dataset under study and apply the proposed methods to a separate golden standard translation dataset from the same domain (news). Based on these two separate rounds of evaluation, we obtain a set of candidate translated sentence pairs that are subjected to qualitative spot checks. These reveal different discursive practices at work, with monolingual text reuse becoming more prominent as events unfold and texts are written about them. As a final step, we describe a ‘comparallel’ corpus setup which combines the parallel and comparable corpus architectures, and highlight its affordances and limitations.
While our work is still at the proof-of-concept stage, it does point to the feasibility of applying corpus methods to study translation practices in settings where boundaries between multilingual text production and translation are blurred. As such settings are likely to increase in the future, as a consequence of “the full range of interactions involving the production and transfer of meaning in fluid genres” (Gambier, 2022, p. 103), work along these lines seems fundamental if the product-oriented study of translation is to remain aligned with current priorities and practices in the field.
The structure of our contribution is as follows. Section 2 provides on overview of the literature on news translation and reviews previous work on building corpora of translated news. In section 3 we illustrate the method to identify translation in news texts, in section 4 we present the results of the quantitative and qualitative evaluation of its performance, and in section 5 the comparallel corpus obtained by leveraging them. Section 6 concludes the article by discussing the theoretical and methodological implications of our findings, and reflecting on the ways in which our approach could be extended and refined to build ever larger, representative corpora for the study of news translation.
2 Previous work
2.1 News translation and global news agencies
The subfield of ‘journalistic translation’ (Valdeón, 2015) or ‘news translation’ (Holland, 2013), has received increasing attention from scholars in translation studies in the past two decades. More recent work has referred to this object of study as ‘news media translation’ (Zanettin, 2021) and ‘translation and/in/of media’ (Bielsa, 2022, p. 1), underscoring both the flexible nature of media nowadays, which may encompass newspapers, online news providers, national or global news agencies, amongst others, and the multiple role translation plays in them. The inherent flexibility and complexity of news production is well summarized by the notion of ‘media convergence’, accounting for news production across multiple platforms (Davier & Conway, 2019), often in multimodal and multilingual format.
While translation is no doubt a key element of current news production, playing multiple roles and adapting to the changing landscape of media communication, its ties to journalism are in fact far from recent, dating back to the very origin of the latter (Davier, 2022; Valdeón, 2022a). In essence, news translation happens at the point where ‘news crosses national boundaries’ (Palmer, 2011, p. 186). It is therefore one of the pillars of global news agencies, also known as newswire services, i.e., news organizations that sell their output to a range of ‘retail’ clients worldwide (Boyd-Barrett & Rantanen, 1998), such as newspapers and television broadcasters. They have even been described as vast translation agencies (Bielsa, 2007), structurally designed to provide fast and reliable translations of considerable volumes of information on a daily basis. For instance, Agence France-Presse (AFP), the news agency that makes the subject of our study, has put in place a “network of correspondents and translators” (AFP, 2022) since its early days, setting up a work environment where journalism and translation combine to produce multilingual news output. As the first and one of the most authoritative institutions of its kind, AFP has set the standard for other news agencies, which have adopted similar working practices. Headquartered in Paris, with regional offices across the globe, AFP publishes news in Arabic, English, French, German, Portuguese and Spanish. News coverage and translation services originating in Spanish, as is the case with the dispatches we address in this contribution, are provided by offices in Uruguay, which centralize news production from the Americas (Rodriguez Blanco, 2024a).
Even though a strong tie has always existed between translation and journalism, translation in news settings remains largely invisible, and text producers do not seem to view their role as related to that of translators. Early work by Hernando (1999) highlighted the limited interest for translation in journalism research, and this was confirmed more recently by Valdeón (2023, p. 250), who has suggested that, in journalism studies, “translation remains under-researched and restricted to literal language transfer”. This contrasts with the attention devoted to news translation within translation studies, a field that defines itself in terms of its interdisciplinarity, and has thus been able to underscore the importance of interactions between multimedia and news translation (Valdeón, 2023).
Nonetheless, some publications specializing in journalism and media studies have recently started to include the perspective of translation. For instance, a prominent journal in the field, i.e. Journalism, has published works dealing with gatekeeping (Valdeón, 2022b), journalistic translation and interdisciplinarity (Kalantari, 2022; Valdeón, 2022a), translation of digital narratives (Hernández Guerrero, 2022), and foreign reporting about Latin America in the German press (Cazzamatta, 2022). This journal had also previously published a special issue about translation seen from media studies, focusing on the BBC multilingual services (Baumann, Gillespie, & Sreberny, 2011). Recently an entry on journalistic translation was included in the Sage Encyclopedia of Journalism (Valdeón, 2022c), where the author reviews the historic ties between translation and journalism in the production of international news. All these developments point to an increased awareness and attention to translation within media studies.
Moving on to the study of translated news products, fundamental challenges emerge since translated news products are typically characterized “by a high degree of transformation and rewriting” (Bielsa, 2010, p. 48). Indeed, texts produced by ‘journalators’ (Van Doorslaer, 2012) are often “far” from “an accurate translation that is true to the original” (Hernández Guerrero, 2022, p. 232). News translation is not always characterized solely by variation though. Examining bilingual cultural news coverage about Bolivia, Rodriguez Blanco (2024b) finds that practices of close and distant translation coexist: in fact, close renditions dominate as a macrostrategy, possibly suggesting that cultural coverage has different priorities and constraints than other types of news.
The coexistence of seemingly contradictory practices such as radical rewriting and close translation (Davier, 2021), or the addition and recycling of information as a practice of so-called ‘patchwork’ (Davier, 2017) are indeed constituent features of multilingual news production. This is because, in this setting, cross-linguistic translation of a more or less literal kind is but one aspect of the more general process that media studies scholars have referred to as ‘traduction journalistique’ (Lagneau, 2007). Not to be confused with the translation studies notion of ‘journalistic translation’, ‘traduction journalistique’ refers to the process through which journalists ‘translate’ an event into news coverage, through processes of selection and hierarchization, constructing newsworthiness by means of discursive devices that underscore specific news values (Bednarek & Caple, 2017).
Decisions concerning literalness vs. rewriting in multilingual news production are thus not set a priori, but rather depend on the complex interweaving of the agency and journalists' agendas and their assumptions about audience expectations. The variation observed may also be due to contingent factors intrinsic to newsroom dynamics. Hernandez Guerrero has mentioned the limitations of time and space that ‘journalists-translators’ are faced with when translating news items (2019, p. 387), while Bielsa (2010, p. 41) has pointed to the lack of specific training in translation by journalists whose priority is to ensure maximum impact.
Against this backdrop, corpus-assisted studies of “translational phenomena in the news” (Davier, 2022) thus pose theoretical and methodological challenges, as well as opportunities, of more general import. Basic steps in the study of ‘prototypical’ translation, such as tracing down source and target texts (Caimotto & Gaspari, 2018) and defining equivalence, translation units and even authorship (Bielsa & Bassnett, 2009; Gambier, 2022) are far from straightforward in this field. At the same time, as suggested by Scammell (2021, p. 302), the “continually evolving global media landscape presents opportunities for research that specifies the involvement of translation in a multitude of developing news contexts”. For all their challenge and complexity, theoretical and analytical frameworks that allow us to study news translation “reframed as a form of intercultural interaction” (Gambier, 2022, p. 99) are necessary, if corpus-assisted translation studies are to remain relevant to society and able to account for an object of study whose boundaries are becoming increasingly fluid and blurred.
2.2 Bridging the gap between comparable and parallel corpora in the study of news translation
Corpus studies of news translation have typically relied on multilingual comparable corpora (Zanettin, 2021), which in this field can be defined as collections of journalistic accounts of the same event for which a translation relation may be posited based on external evidence (Davier & van Doorslaer, 2018). Parallel corpora, or collections of texts which can be aligned since they result from a translation process, would clearly offer a more powerful resource, yet in this domain their creation is hindered by the difficulty of identifying source-target pairs (Davier & van Doorslaer, 2018), and of singling out sub-textual translation units within them (Gambier, 2022). While the first issue can be partly addressed using paratextual and ethnographic approaches, identifying translation units (i.e., translated sentences) remains highly problematic. In the words of Caimotto and Gaspari (2018, p. 216) “adopting straightforward conventional parallel corpus methodologies seems hardly feasible in most news translation scenarios”.
This problem has been addressed in two main ways in previous work, not specifically related to news translation. Computational techniques have been employed to mine parallel sentences from comparable corpora and noisy parallel corpora (Barrón-Cedeño, España-Bonet, Boldoba, & Màrquez, 2015; Gete et al., 2022). Extracting parallel sentences from similar multilingual corpora is a well-known problem, addressed as a necessary step when gathering data for training and testing of machine translation systems, as well as for cross-lingual information retrieval algorithms. Early approaches relied on metadata from web crawls and searches, as described in the seminal STRAND paper (Resnik, 1999). More recent methods have focused on the textual content instead, learning a classifier or using alignment information, as well as applying machine translation to one side of the corpus and then using a similarity score to identify parallel sentences (Abdul-Rauf & Schwenk, 2009; Bouamor & Sajjad, 2018). With the advent of encoder-decoder based systems for sentence embeddings, new methods have emerged, even using multilingual sentence embeddings alone (Chaudary et al., 2019; Guo et al., 2018; Schwenk, 2018). These methods have proven effective in identifying translation matches from quasi-parallel corpora (Artetxe & Schwenk, 2019).
Within corpus linguistics, attempts have been made to create hybrid comparallel corpora, i.e., collections of texts aligned at the document level, where the actual correspondence of translation units at the sub-textual level, if any, is left for the researcher to work out (Bernardini, Castagnoli, Ferraresi, Gaspari, & Zanchetta, 2010; Gaspari, 2015). This solution is, however, less than ideal, since the translation units have to be painstakingly identified by the researcher manually.
Recent work by Pęzik and Grabowski (2023) has approached this issue from a perspective similar to the one described here. Their aim is to extend the Polish-English Paralela corpus (Pęzik, 2016) with bilingual data that are translated, or at least convey a similar message in the two languages. For this purpose, they use multilingual sentence embeddings, then setting similarity thresholds based on manual assessment of candidate parallel sentences and of conditional inference trees – the latter method being applied to headlines only. The result is a near-parallel corpus containing both translationally and thematically equivalent text portions. However, only these near-parallel portions are included in the corpus, thus effectively turning the corpus into a pseudo translation memory, with no user access to full texts, and thus to discursive practices that cross the boundaries between translation and rewriting/patchwork.
3 Method
This study describes a proof-of-concept attempt to develop a comparallel corpus of multilingual news dispatches by the global news agency Agence France Press (AFP). Three sets of trilingual (Spanish, French and English) news dispatches from AFP dealing with a single key political event (the general elections of 2020 in Bolivia) were downloaded from the LexisNexis database (https://www.lexisnexis.com). Triplets of dispatches in the three languages, published between the 14th and the 20th of October 2020, were manually identified as being related to each other. Relatedness was defined on the basis of textual and paratextual cues, including thematic similarity, length, date and time of publication, and authorship (see Table 1 below). The Spanish version in each triplet is published first, then the French one and finally the English one, consistent with newsroom practices at AFP (Rodriguez-Blanco, 2024a).
Textual and paratextual cues for the selection of candidate dispatches
No. | Multilingual Headlines | Dateline | Length (words) | Paratexts (initials) |
1aES | Bolivia, el país de América con mayor cantidad de indígenas | La Paz, October 15, 2020 7:18 PM GMT | 738 | ber-mm/fj/ll |
1bFR | Bolivie: un des pays du continent à la plus forte proportion d'Amérindiens | La Paz, October 16, 2020 5:05 AM GMT | 713 | ber-mm-ang/cds/jb/roc |
1cEN | Bolivia: Turmoil in Latin America's indigenous heartland | La Paz, October 16, 2020 10:10 AM GMT | 672 | bur-ang/jmy/ fg/bc/dw/leg |
2aES | Candidato de Evo Morales se impone en primera vuelta de presidenciales de Bolivia | La Paz, October 19, 2020 5:14 AM GMT | 789 | val-pb/jac/fj/yow |
2bFR | Bolivie: Luis Arce, dauphin d'Evo Morales, vainqueur de la présidentielle | La Paz, October 19, 2020 5:49 AM GMT | 791 | bur-jb/ahe |
2cEN | Bolivia 'has recovered democracy' says Arce as exit poll suggests win | La Paz, October 19, 2020 8:30 AM GMT | 741 | val-pb/jac/ fj/lda/bc/st |
3aES | Arce tomará las riendas de una Bolivia polarizada y en crisis económica | La Paz, October 20, 2020 2:44 AM GMT | 790 | val-fj/rsr |
3bFR | Bolivie: Arce sera le futur président, Morales “tôt ou tard” dans le pays | La Paz, October 20, 2020 3:17 AM GMT | 767 | bur-jb/bds/am |
3cEN | Morales says will return to Bolivia after ally's election victory | La Paz, October 20, 2020 3:29 AM GMT | 751 | bur-fj/db/st/ to/jh |
As a first step, the FR and ES dispatches were automatically translated into EN using the machine translation system Modern MT (https://www.modernmt.com/); we then used sentence-transformer models (Reimers & Gurevych, 2019) from the Hugging Face platform (https://huggingface.co) to obtain a vector representation for each original and each translated sentence, and computed cosine similarity between each sentence pair. As an alternative method, we used a language-agnostic sentence embedding model (LaBSE, Feng, Yang, Cer, Arivazhagan, & Wang, 2022) to obtain a vector representation for each sentence, comparing them directly across language pairs.
Benchmark data were obtained in two ways. First, three authors evaluated the three dispatch triplets (T) independently, categorizing each sentence as either ‘translated’ or ‘not translated’. Interrater agreement as measured by Krippendorff's α varied for the three triplets, pointing to agreement levels from moderate (T2: 0. 0.462; T3: 0.456) to substantial (T1: 0.646); cf. Artstein and Poesio (2008, p. 576). The irr package (Gamer, Lemon, Fellows, & Puspendra, 2019) in R (R Core Team, 2021) was used for this purpose.
Second, our methods were applied to a ‘prototypical’ translation dataset from the same textual domain: the NTREX dataset (Federmann, Kocmi, & Xin, 2022), which contains news data originally in English and professionally translated into 128 languages. We selected 30 sentences in English, Spanish, and French and computed sentence similarity between each pair of sentences using the same methods as per the main study.
Based on the scores obtained for translated sentences from the NTREX dataset (Fig. 1), sentence pairs receiving a score of 0.8 and above seem likely to result from translation. This was confirmed by manual perusal, which further suggested that matches which score in the region of 0.6 may result from general heavy rewriting/patchwork, or from the presence of translated and non-translated sub-sentential segments (see examples 1 and 2 in section 4.1). Since close translations like the ones in NTREX are rare in this setting, and patchwork matches are also relevant for our purposes, we relaxed the 0.8 threshold and selected for further consideration matches scoring 0.6 and higher.
Similarity scores for the Spanish/English NTREX data
Citation: Across Languages and Cultures 25, 2; 10.1556/084.2024.00905
To evaluate the accuracy of the two automatic scoring methods against the manually annotated benchmark we calculated their precision and recall. These are widely used measures in computational linguistics and natural language processing to evaluate a system's ability to find only cases that annotators marked as ‘good’ cases (precision) and to find all of them (recall). We consider as a true positive, i.e., a case in which the automatic method matches the manual evaluation, each sentence pair which received a score of 0.6 or higher from the respective automatic scoring system and was marked as a translation by at least one annotator. Precision is then calculated as the number of true positives out of all sentences with scores higher than 0.6, and recall as the number of true positives out of all sentences which were considered as matches by at least one annotator. This quantitative analysis was complemented by a more qualitative one, looking at cases of substantial divergence between the automatic and human evaluation.
The matches identified using the best automatic scoring method were leveraged to build a prototype comparallel corpus, which was then aligned and indexed using the Sketch Engine corpus query tool (https://www.sketchengine.eu/), as described in section 5.
4 Results
4.1 Evaluation of the automatic scoring methods
Figure 2 displays scores for potentially translated sentences from the first dispatch in Spanish and English. The dark blue cells and bold font indicate scores above 0.8, with lighter shades indicating lower scores. Direct observation confirms that matches which score in the region of 0.6 may result from general heavy rewriting/patchwork, or from the presence of translated and non-translated sub-sentential segments, as exemplified by examples 1 and 2.
Similarity scores for the Spanish/English versions of dispatch nr. 1 from our dataset (1aES vs. 1cEN)
Citation: Across Languages and Cultures 25, 2; 10.1556/084.2024.00905
Example (1) shows a sentence pair (specifically the headlines of the first dispatch in French and English) that received a score of 0.65. Based on the structural similarity of the two headlines we might conclude that this is a case of heavy rewriting, or else, based on references to Bolivie/Bolivia, continent/Latin America, Amérindiens/indigenous, conclude that the two sentences are simply linked by thematic similarity. Indeed, one scorer opted for the former interpretation, identifying it as a translation-relevant match, and two for the latter, excluding it from the selection.
(1) |
Bolivie: un des pays du continent à la plus forte proportion d'Amérindiens (1bFR, s1) |
Bolivia: Turmoil in Latin America's indigenous heartland (1cEN, s1) |
Example (2) is another case of a low-scoring match (0.66), which was however identified as a case of translation by two out of three human scorers. In this case the similarity seems mainly due to the final part being a relatively close translation (the reference to nationalization of gas and oil), while the rest of the two sentences are only partly thematically similar.
(2) |
El crecimiento del PIB boliviano es uno de los más importantes de la región en los años de Evo Morales, especialmente gracias a una nacionalización de los hidrocarburos, en 2006. (1aES, s. 18) |
Previously the country had enjoyed average annual growth of 4.9 percent between 2004 and 2014, thanks to Morales's nationalization in 2006 of the gas and oil sectors. (1cEN, s. 21) |
Precision and recall measured on the basis of sentences obtaining scores ≧0.6 from the two methods show noticeable differences between different dispatches (see Table 2): while the first dispatch contains several sentences that are clearly recognized as translations by both automated methods and human annotators, the other dispatches present more ambiguous content similarities. As human annotators do not agree on identifying potential alignments in dispatches 2 and 3 (as testified by the lower interrater agreement; cf. section 3), recall provided by the automated methods is low, showing the limits of sentence embeddings in capturing paraphrasing.
Precision and recall for the two methods across the three AFP dispatches
Precision | Recall | |||
LaBSE | MT+BERT | LabSE | MT+BERT | |
Dispatch 1 | 0.969 | 0.954 | 0.733 | 0.721 |
Dispatch 2 | 0.900 | 0.815 | 0.300 | 0.367 |
Dispatch 3 | 0.844 | 0.771 | 0.375 | 0.375 |
Total | 0.923 | 0.874 | 0.495 | 0.509 |
In general, we observe that LaBSE, the multilingual model (or so-called ‘language-agnostic’ model), performs better than the MT+BERT method, which uses machine translation into English coupled with an English-only model. Indeed, LaBSE obtains higher precision for every set of dispatches, as shown by the bold font in Table 2. That said, the latter tends to produce slightly more matches, or higher recall, which helps marginally in the more ambiguous situations, but with no improvement in precision.
These results seem encouraging, despite the substantial variation observed in terms of results obtained for closer vs. more distant dispatches. Given the complexity of the task for human scorers as well, and thus the limited reliability of the benchmark data, we performed qualitative spot checks to get a feeling for the actual performance of our model beyond quantitative results. Examples (3) and (4) show cases of mismatch between the automatic and the manual scoring. In particular, the automatic similarity score for (3) was below the 0.6 threshold (0.43), even though the three human evaluators had identified the pair as translated. The two sentences convey approximately the same meaning, but formally they are quite different, one being substantially longer than the other, which may have resulted in the low score.
(3) |
Hasta 1982 la nación vivió inmersa en una gran inestabilidad política. (1aES, s. 12) |
Bolivia has a history of political instability. (1cEN, s. 16) |
Example (4) is rather more puzzling, since the similarity score is above the threshold (0.64), albeit barely, yet no human evaluator had identified these sentences as translation matches. Indeed, the similarities are scarce and local (Arce, felicitación/félicité, Luis, the set of parentheses, possibly reelección/victoire), such that this match can safely be categorized as a false positive.
(4) |
Arce recibió además la felicitación y los deseos de “éxito” de Luis Almagro, secretario general de la Organización de Estados Americanos (OEA), entidad cuyo lapidario informe sobre los comicios de 2019 estimuló las protestas que condujeron a la dimisión de Morales, tras una polémica nueva reelección. (6aES, s. 19) |
Le président vénézuélien Nicolas Maduro a “félicité le peuple frère de Bolivie à l'occasion du large et indiscutable triomphe du Mouvement vers le socialisme (MAS)” et de l'“éclatante victoire” de Luis Arce. (6bFR, s. 31) |
4.2 A note on intralingual text reuse
The three dispatches in our datasets seem to differ in terms of the main discursive practices they employ. As an event unfolds, and texts are produced about it, news workers have more material at their disposal to pick and mix from. In our dataset, segments from the first/earlier set of dispatches align quite closely, while those from the later ones drift apart: journalists had reused segments from their own previous texts in the same language as well as translating from the more recent, ‘matching’ ones, available in other language(s).
One striking example of intralingual text reuse was observed in the English dispatch from our third triplet. This contains one sentence about former Bolivian president Evo Morales that is neither to be found in the Spanish nor in the French corresponding dispatch, but appears instead to have been recycled almost verbatim from the second English dispatch in our dataset (or from a previous one not included).
(5) |
He is still being investigated for “rape and trafficking” over allegations he had relationships with underage girls, and even fathered a child with one. (3cEN, s. 32) |
(6) |
He is also being investigated for alleged “rape and trafficking” over allegations he had relationships with underage girls. (2cEN, s. 27) |
The serious allegations contained in these segments, which lack information sources despite the presence of seeming quotations in inverted commas, exist solely in the English dispatches. Their newsworthiness for the English-speaking audience thus transpires only thanks to the simultaneous comparison with previous English texts and with matching Spanish and French texts. Neither the monolingual nor the translational perspective would, on their own, tell the whole story about how newsworthiness is construed in these texts.
5 Corpus creation: the comparallel corpus setup
The correspondences identified on the basis of the method presented in section 4.1 were leveraged to build a prototype comparallel corpus. Like a multilingual comparable corpus, such a corpus setup makes it possible to compare news dispatches in multiple languages ensuring a high degree of comparability: every text in the corpus matches at least one other text in another language in terms of date, genre, and topic or event reported on. At the same time, the corpus is also parallel, insofar as each sentence which is automatically identified as having a translational near-equivalent in another language is aligned to the matching sentence and can thus be accessed through parallel concordances.
In terms of corpus building, once cross-linguistic matches are identified, corpus data have to be processed to allow comparable and parallel access through a corpus query tool. The most salient differences with a prototypical parallel corpus are that not all sentences are alignable, and that those that are alignable do not necessarily follow the same order across texts, such that a sentence could appear at the beginning of a news dispatch in language A while its equivalent is found further down the matching text in language B. The inclusion of more than two languages in the corpus complicates matters further, as multiple alignments must be created, and sentence order is likely to vary depending on the language pair considered.
The solution we have identified relies on the corpus building and alignment functionalities made available by Sketch Engine. For each of the three languages, a sub-corpus was created which preserves the order of sentences in texts, thus guaranteeing text integrity, and making it possible to consult the three sub-corpora as a traditional multilingual comparable corpus. Each of the three sub-corpora was then aligned at sentence level with the parallel target sentences to which LABSE had assigned a score of 0.6 or higher. The order of sentences in the target corpus thus follows the order of sentences in the source corpus. Figure 3 displays the corpus input format in its raw version (i.e., an Excel file), which illustrates the issue.
Raw format of the comparallel news dispatch corpus
Citation: Across Languages and Cultures 25, 2; 10.1556/084.2024.00905
Sentences in the ES sub-corpus, which is considered here as a source, are preserved in their original order, as testified by their sentence IDs (ES_id). If a sentence has an equivalent in a target language, this is recorded on the same line in the column pertaining to that language (e.g., ES sentence n. 10 has equivalents both in FR and EN); if no equivalence is present, the relative cell remains empty (e.g., ES sentence n. 1 has no FR equivalent, and sentence n. 13 has no EN equivalent). The order of sentences in the FR and EN target corpora is entirely based on the order of the ES text, and not all sentences in the FR and EN texts are included in these target language corpora. To overcome the problem, the alignment procedure is replicated making each of the three languages a source for the others.
The final corpus, which was obtained by exploiting the Sketch Engine indexing function for Excel files, thus consists of three comparable sub-corpora in Spanish, French and English, each of which is aligned to two parallel sub-corpora in the other languages. Figure 4 shows a multi-parallel concordance for the word ‘Bolivia’ in the Spanish sub-corpus.
Multi-parallel concordance for the word ‘Bolivia’ in the Spanish sub-corpus as indexed in Sketch Engine
Citation: Across Languages and Cultures 25, 2; 10.1556/084.2024.00905
6 Discussion and conclusion
In this contribution we have addressed the issue of identifying sub-textual translation units within trilingual news dispatches. This is a critical step if one is to approach news translation through corpus-based methods, but its implementation has typically been hindered by obstacles both of a theoretical nature (what counts as translation in these settings?) and of a methodological one (how do we identify and represent this peculiar type of translation in a corpus?). The methods we propose rely on state-of-the art techniques in natural language processing, namely similarity scores computed over (machine-translated) monolingual and multilingual sentence embeddings. We tested these methods on 3 triplets of news dispatches in Spanish, English and French by the AFP news agency: based on preliminary evaluation carried out on a benchmark dataset and on manual spot checks, we concluded that similarity scores higher than 0.8 point to cases of ‘close’ translation and that scores above 0.6 point to cross-lingual rewriting/patchwork. The scores obtained in this way are used to build a corpus combining a multilingual comparable component, made of the trilingual texts aligned at document level, and a parallel component, made of the subset of sentences aligned across languages.
Translation is known to occur in news agencies, and in journalistic settings in general, but it is not consistently acknowledged as such and coexists with other discursive practices such as rewriting and patchwork. Corpus approaches to news translation research are typically faced with the problem of identifying source and target texts, and translation units within them. This was confirmed in our study by the relatively modest inter-annotator agreement we obtained when trying to tell apart translated and non-translated sentence pairs, despite our familiarity with the languages and the domain. And yet the identification of translation units cannot be bypassed if one wishes to study translation through a parallel corpus: a situation that is necessarily fluid must be forced into a tight methodological grid. To quote Davier and van Doorslaer (2018, p. 241), this amounts to adapting “the multisource and multi-author situation of translation in journalism to [the] non- (or only partially) identifiable character of the source text–target text relationship”.
With this contribution we have tried to identify ways in which that grid can be relaxed, while preserving the affordances of corpus methods. In particular, the corpus set up we have described combines the parallel and comparable perspectives, allowing researchers to observe what is reused through translation, as well as what is added and what is removed, when drafting a news dispatch. As an event unfolds, and textual material is produced about it, sources multiply. While identifying a single starting point for any journalistic account would be pointless anyway, the more dispatches become available, the more it is likely that news workers pick and mix from them, following their own agenda. This process was observed in our dataset, where the earlier multilingual dispatches from the first triplet staid closer to each other, while the later ones drifted further apart, as journalists recycled form their own previous texts in the same language as well as translating from the more recent ones available in the other language(s).
The example concerning serious unsupported allegations against the Bolivian former president (section 4.2), recycled across two English dispatches but absent from the corresponding French and Spanish ones, underscores the way in which newsworthiness is construed differently for audiences accessing the information in different languages, particularly around political matters of consequence. In turn, it points to the value of innovative corpus setups through which the reuse of textual segments can be traced both at the intralingual and interlingual/translational levels.
The comparallel model we are proposing is one in which a multilingual comparable corpus accounts for the news coverage of a single event by a single news producer. Within this comparable corpus, the translation-relatedness of the individual texts is estimated relying on textual and paratextual cues, and the translation-relatedness of individual sub-textual units is estimated based on multilingual sentence embeddings, leading to a parallel corpus. Yet neither perspective is obscured, giving researchers in translation studies, discourse analysis, and media studies the means to gauge empirically the relative weight of translation and other discursive practices under varying constraints and meeting different agendas.
The main affordance of the comparallel corpus design proposed here is that it makes it possible to preserve text integrity in its monolingual comparable component, while at the same time taking heed of the fragmentary and nonlinear nature of cross-linguistic correspondences typical of news writing/translation in its parallel part. As such, the design lends itself to representing other forms of non-prototypical translation (one may think, e.g., of Wikipedia texts; cf. Section 2.2), in which only partial translational equivalences are observed across languages, and where non-translated text portions are to be maintained. Thanks to multiple alignments, the design can also easily be extended to more than two languages, as is the case in our prototype.
This flexibility, however, comes at a cost, as it calls into question key tenets traditionally associated with comparable and parallel corpora, such as the notion of source and target sub-corpora, of texts as minimal units of a corpus, as well as assumed bidirectionality in parallel corpora. In the corpus setup we are describing, the notion of ‘source’ is mostly instrumental in signalling which sub-corpus should be considered as a starting point for the analysis, without the implication that target sub-corpora contain translations produced based on those texts. These parallel corpora are thus more similar to translation memories than to corpora as traditionally conceived of, since they are opportunistically created on the basis of the identified translation matches, leaving out ‘untranslated’ sentences and disregarding original sentence order.
To move beyond the proof-of-concept stage, the most pressing next steps are the identification of more automatic ways of looking for similarities, and hence potential translational relations, at document level, so that the preliminary manual screening of texts carried out here could be speeded up. In this respect, one could imagine using cross-language information retrieval techniques along with paratextual cues to identify sets of near-equivalent news dispatches, within which sentence-level similarity can then be computed. Other advancements over the proposed method would be the identification of objective, data-driven ways to establish thresholds of ‘translationality’. This could be done by applying conditional inference trees, along the lines of Pęzik and Grabowski (2023), or to identify dynamic thresholds that are document-based, along the lines of Artetxe and Schwenk (2019). In both cases one could also allow corpus users more freedom to decide on what they perceive as appropriate thresholds given their research priorities, e.g., by including sentence similarity scores as searchable corpus metadata. Lastly, a finer classification of ambiguous cases should be explored, for example by finding sub-sentential matches through a word-by-word similarity matrix. In this way one could also bypass the implicit reliance on the sentence as a correspondence unit, which forced us to disregard news-relevant framing devices such as paragraph structure and positioning within the text (Tankard, 2001).
As machine translation becomes better at taking care of the more routine translation tasks, more discourse types are produced on flexible multimedia supports, and an increasing number of people actively engage with multiple languages, we expect to see more fluid translation practices “stimulated by technology and users” (Gambier, 2022, p. 98), involving different levels of linguistic and cultural adaptation. Methods are therefore required that allow for the observation of translation practices in such non-prototypical settings, where well-established concepts such as source text, target text and equivalence do not necessarily apply. We believe that innovative corpus designs such as the one described in this contribution can offer more adequate methods than hitherto available for addressing current research priorities and practices, taking into account the complexity of current multilingual discursive practices, and highlighting the essential role translation plays as a discourse creation device in a multitude of fields.
References
Abdul-Rauf, S., & Schwenk, H. (2009). On the use of comparable corpora to improve SMT performance. In A. Lascarides, C. Gardent, & J. Nivre (Eds.), Proceedings of the 12th conference of the European chapter of the ACL (EACL 2009) (pp. 16–23). Association for Computational Linguistics.
Agence France Presse (2022). AFP in dates. https://www.afp.com/en/agency/afp-dates.
Artetxe, M., & Schwenk, H. (2019). Margin-based parallel corpus mining with multilingual sentence embeddings. In A. Korhonen, D. Traum, & L. Màrquez (Eds.), Proceedings of the 57th annual meeting of the association for computational linguistics (pp. 3197–3203). Association for Computational Linguistics. https://doi.org/10.18653/v1/P19-1309.
Artstein, R., & Poesio, M. (2008). Inter-coder agreement for computational linguistics. Computational Linguistics, 34(4), 555–596. https://doi.org/10.1162/coli.07-034-R2.
Barrón-Cedeño, A., España-Bonet, C., Boldoba, J., & Màrquez, L. (2015). A factory of comparable corpora from Wikipedia. In P. Zweigenbaum, S. Sharoff, & R. Rapp (Eds.), Proceedings of the eighth workshop on building and using comparable corpora (pp. 3–13). Association for Computational Linguistics. https://doi.org/10.18653/v1/W15-3402.
Baumann, G., Gillespie, M., & Sreberny, A. (2011). Transcultural journalism and the politics of translation: Interrogating the BBC world service. Journalism, 12(2), 135–142. https://doi.org/10.1177/1464884910388580.
Bednarek, M., & Caple, H. (2017). The discourse of news values: How news organizations create newsworthiness. Oxford University Press.
Bernardini, S. (2022). How to use corpora for translation. In A. O’Keeffe, & M. J. McCarthy (Eds.), The routledge handbook of corpus linguistics (pp. 485–498). Routledge.
Bernardini, S., Castagnoli, S., Ferraresi, A., Gaspari, F., & Zanchetta, E. (2010). Introducing Comparapedia: A new resource for corpus-based translation studies. In Paper presented at the International Symposium on Using Corpora in Contrastive and Translation Studies (UCCTS 2010). Edge Hill University.
Bielsa, E. (2007). Translation in global news agencies. Target, 19(1), 135–155. https://doi.org/10.1075/target.19.1.08bie.
Bielsa, E. (2010). Translating news: A comparison of practices in news agencies. In R. Valdeón (Ed.), Translating information (pp. 31–49). Universidad de Oviedo.
Bielsa, E. (Ed.) (2022). The Routledge handbook of translation and media. Routledge.
Bielsa, E., & Bassnett, S. (2009). Translation in global news. Routledge.
Bouamor, H., & Sajjad, H. (2018). H2@bucc18: Parallel sentence extraction from comparable corpora using multilingual sentence embeddings. In R. Rapp, P. Zweigenbaum, & S. Sharoff (Eds.), Proceedings of the 11th workshop on building and using comparable corpora (BUCC) (pp. 43–47). European Language Resources Association.
Boyd-Barrett, O., & Rantanen, T. (1998). The globalization of news. SAGE.
Caimotto, M. C., & Gaspari, F. (2018). Corpus-based study of news translation: Challenges and possibilities. Across Languages and Cultures, 19(2), 205–220. https://doi.org/10.1556/084.2018.19.2.4.
Cazzamatta, R. (2022). The role of wire services in the new millennium: An examination of the foreign-reporting about Latin America in the German press. Journalism, 23(5), 1044–1063. https://doi.org/10.1177/1464884920944745.
Chaudhary, V., Tang, Y., Guzmán, F., Schwenk, H., & Koehn, P. (2019). Low- resource corpus filtering using multilingual sentence embeddings. In O. Bojar, R. Chatterjee, C. Federmann, M. Fishel, Y. Graham, B. Haddow, … K. Verspoor (Eds.), Proceedings of the fourth conference on machine translation (volume 3: Shared task papers, day 2) (pp. 261–266). Association for Computational Linguistics. https://doi.org/10.18653/v1/W19-5435.
Davier, L. (2017). Les enjeux de la traduction dans les agences de presse. Presses Universitaires du Septentrion.
Davier, L. (2021). Translation in the news agencies. In E. Bielsa (Ed.), The Routledge handbook of translation and media (pp. 183–198). Routledge.
Davier, L. (2022). Translating news. In K. Malmkjær (Ed.), The Cambridge handbook of translation (1st ed., pp. 401–420). Cambridge University Press.
Davier, L., & Conway, K. (2019). Journalism and translation in the era of convergence. Benjamins.
Davier, L., & van Doorslaer, L. (2018). Translation without a source text: Methodological issues in news translation. Across Languages and Cultures, 19(2), 241–257. https://doi.org/10.1556/084.2018.19.2.6.
Federmann, C., Kocmi, T., & Xin, J. (2022). NTREX-128 – News Test references for MT evaluation of 128 languages. In K. Ahuja, A. Anastasopoulos, B. Patra, G. Neubig, M. Choudhury, S. Dandapat, … V. Chaudhary (Eds.), Proceedings of the first workshop on scaling up multilingual evaluation (pp. 21–24). Association for Computational Linguistics.
Feng, F., Yang, Y., Cer, D., Arivazhagan, N., & Wang, W. (2022). Language-agnostic BERT sentence embedding. In S. Muresan, P. Nakov, & A. Villavicencio (Eds.), Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: Long papers) (pp. 878–891). Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.acl-long.62.
Gambier, Y. (2022). Revisiting certain concepts of translation studies through the study of media practices. In E. Bielsa (Ed.), The Routledge handbook of translation and media (pp. 91–107). Routledge.
Gamer, M., Lemon, J., Fellows, I., & Puspendra, S. (2019). irr: Various coefficients of interrater reliability and agreement. R package version 0.84.1. https://cran.r-project.org/package=irr.
Gaspari, F. (2015). Exploring Expo Milano 2015: A cross-linguistic comparison of food-related phraseology in translation using a comparallel corpus approach. The Translator, 21(3), 327–349. https://doi.org/10.1080/13556509.2015.1103099.
Gete, H., Etchegoyhen, T., Ponce, D., Labaka, G., Aranberri, N., Corral, A., Saralegi, X., Ellakuria, I., & Martin, M. (2022). Tando: A corpus for document-level machine translation. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, … S. Piperidis (Eds.), Proceedings of the thirteenth language resources and evaluation conference (pp. 3026–3037). European Language Resources Association.
Guo, M., Shen, Q., Yang, Y., Ge, H., Cer, D., Hernandez Abrego, G., … Kurzweil, R. (2018). Effective parallel corpus mining using bilingual sentence embeddings. In O. Bojar, R. Chatterjee, C. Federmann, M. Fishel, Y. Graham, B. Haddow, … K. Verspoor (Eds.), Proceedings of the third Conference on machine translation (pp. 165–176). Association for Computational Linguistics. [Research papers]. https://doi.org/10.18653/v1/W18-6317.
Hernández Guerrero, M. J. (2019). Journalistic translation. In R. Valdeón, & C. A. Vidal (Eds.), The Routledge handbook of Spanish translation studies (pp. 386–401). Routledge.
Hernández Guerrero, M. J. (2022). The translation of multimedia news stories: Rewriting the digital narrative. Journalism, 23(7), 1488–1508. https://doi.org/10.1177/14648849221074517.
Hernando, B. M. (1999). Traducción y periodismo o el doble y misterioso escepticismo. Estudios Sobre el Mensaje Periodístico, 5, 129–141.
Holland, R. (2013). News translation. In C. Millán, & F. Bartrina (Eds.), The Routledge handbook of translation studies (pp. 332–346). Routledge.
Kalantari, E. (2022). Journalistic translation: A gate at which journalism studies and translation studies meet. Journalism, 23(7), 1411–1429. https://doi.org/10.1177/14648849221074516.
Lagneau, É. (2007). Dépêches de campagne: Ce que l’AFP fait pendant (/à) une élection. Le Temps des Médias, 7, 104–125. https://doi.org/10.3917/tdm.007.0104.
Palmer, J. (2011). News gathering and dissemination. In M. Baker, & G. Saldanha (Eds.), The Routledge Encyclopedia of translation studies (2nd ed., pp. 186–189). Routledge.
Pęzik, P. (2016). Paralela corpus and search engine. CLARIN-PL Digital Repository. http://hdl.handle.net/11321/276.
Pęzik, P., & Grabowski, Ł. (2023). Towards a near-parallel corpus of news texts: An experiment in using multilingual sentence embeddings. In Paper presented at the PACOR 2023 conference. University of León.
R Core Team (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing version 4.3.2. https://www.r-project.org/.
Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using siamese BERT-networks. arXiv:1908.10084. https://doi.org/10.18653/v1/D19-1410.
Resnik, P. (1999). Mining the web for bilingual text. In Proceedings of the 37th annual meeting of the association for computational linguistics (pp. 527–534). https://doi.org/10.3115/1034678.1034757:0.3115/1034678.1034757.
Rodriguez Blanco, N. (2024a). Distance and closeness in translated global news coverage: Bilingual representations of culture-bound themes from Bolivia to the world. Perspectives, 1–19. https://doi.org/10.1080/0907676X.2023.2299709.
Rodriguez Blanco, N. (2024b). Plurilingual perspectives, pluricultural contexts. Exchanges. University of Warwick, 11(2), 107–132. https://doi.org/10.31273/eirj.v11i2.1137.
Scammell, C. (2021). Translation and the globalization/localization of news. In E. Bielsa, & D. Kapsaskis (Eds.), The Routledge handbook of translation and globalization (pp. 293–305). Routledge.
Schwenk, H. (2018). Filtering and mining parallel data in a joint multilingual space. In I. Gurevych, & Y. Miyao (Eds.), Proceedings of the 56th annual meeting of the association for computational linguistics (volume 2: Short papers) (pp. 228–234). Association for Computational Linguistics. https://doi.org/10.18653/v1/P18-2037.
Tankard, J. W. (2001). The empirical approach to the study of media framing. In S. D. Reese, O. H. Gandy, & A. E. Grant (Eds.), Framing public life (pp. 95–105).
Valdeón, R. A. (2015). Fifteen years of journalistic translation research and more. Perspectives, 23(4), 634–662. https://doi.org/10.1080/0907676X.2015.1057187.
Valdeón, R. A. (2022a). Interdisciplinary approaches to journalistic translation. Journalism, 23(7), 1397–1410. https://doi.org/10.1177/14648849221074531.
Valdeón, R. A. (2022b). Gatekeeping, ideological affinity and journalistic translation. Journalism, 23(1), 117–133. https://doi.org/10.1177/1464884920917296.
Valdeón, R. A. (2022c). Journalistic translation. In G. A. Borchard (Ed.), The SAGE Encyclopedia of journalism (second edition) (pp. 901–903). SAGE Publications.
Valdeón, R. A. (2023). On the cross-disciplinary conundrum: The conceptualization of translation in translation and journalism studies. Translation Studies, 16(2), 244–260. https://doi.org/10.1080/14781700.2022.2162573.
Van Doorslaer, L. (2012). Translating, narrating and constructing images in journalism with a test case on representation in Flemish TV news. Meta, 57(4), 1046–1059. https://doi.org/10.7202/1021232ar.
Zanettin, F. (2021). News media translation. Cambridge University Press.