Clear Sky Science · en

Multilingual news dataset about Ukraine (2022–2025): data collection and documentation

2026-03-17 · Back to index

Why this news collection matters

Since Russia’s full-scale invasion of Ukraine in 2022, the war has been fought not only on the ground but also on screens and social feeds. What people around the world read about the conflict shapes how they understand it, whom they trust, and which side they support. This article presents a large, carefully organized collection of online news stories about Ukraine from 2022 to 2025, designed to help researchers study this information battlefield and build better tools to spot misleading claims.

The challenge of truth in wartime

The authors begin by outlining how the war unleashed a wave of propaganda and false stories. Russian state outlets and online networks have pushed recurring claims about supposed “neo-Nazis” in Ukraine, secret laboratories, or staged war crimes. At the same time, fact-checkers and scholars have shown that even when people are successfully corrected on specific myths, their broader political views often remain unchanged. Studies across Eastern Europe and beyond reveal that belief in conspiracy theories about COVID-19 often goes hand in hand with belief in pro-Kremlin narratives about the war, especially among those who distrust mainstream media and governments and prefer alternative information spaces.

How news shapes public understanding

News coverage of the war looks very different depending on where you are. Comparative research has found that Ukrainian and Western outlets tend to highlight human suffering and resistance, while Russian media portray the enemy as monstrous and their own actions as justified. In parts of Asia and the Global South, coverage may focus more on global power struggles or the role of NATO than on civilians. These different angles influence how local audiences see the conflict and the actors involved. Against this backdrop, having a transparent, shared source of news articles becomes essential for understanding which themes dominate coverage and how narratives shift over time.

Building a shared pool of news articles

To meet this need, the authors created a multilingual dataset of 120,617 news articles related to Ukraine, published between 2022 and 2025. They designed an automated pipeline that, for each day in the chosen period, constructs website addresses, downloads news pages, and extracts article headlines and full texts. When articles appear in other languages, a machine translation step produces Ukrainian versions so that material can be compared more easily. Each item is then assigned to a broad theme using keyword rules (for example, whether the story focuses on Ukraine’s leaders, Russia’s domestic situation, or international reactions). The final result is a large table where each row represents one article and includes its link, date, original text, translated text when available, and a rough topic label.

What the dataset looks like

The collection is dominated by Ukrainian sources and language, reflecting where the team focused their efforts and the centrality of Ukrainian outlets in covering the war. Most headlines and main texts are in Ukrainian, with small shares in Russian, English, and several European languages. Article lengths vary widely—from brief updates to very long analytical pieces—though typical news stories fall in the range of a few thousand characters. The largest share of articles deals with how Ukraine appears in the information space of the Russian Federation, followed by coverage of Ukraine’s political and military leadership and reports on Russia’s own internal situation. The dataset is stored in a simple comma-separated file so that it can be loaded by common analysis tools without special software.

Checking quality and limits

Because this collection is meant as a research foundation rather than a finished analysis, the authors emphasize careful technical checks. They removed articles whose web pages could not be loaded or that were exact duplicates. They verified that language labels made sense on spot checks, inspected missing values, and ensured that machine-translated texts were complete. At the same time, they stress that the topic labels are only rough guides based on keywords, not definitive expert judgments about what each article “really” means. Likewise, they did not try to correct any translation errors, which may matter in politically sensitive passages.

What this opens up for the future

For non-specialists, the key takeaway is that this project provides a public, reusable map of how news about Ukraine has been written during some of the most turbulent years in its modern history. Journalists, social scientists, and computer scientists can all draw on the same shared pool of stories to study media bias, track the spread of misleading narratives, or train language technologies that help flag suspicious content. By documenting the collection process in detail and making both the data and code openly available, the authors aim to support transparent, reproducible work on information warfare and, ultimately, to strengthen society’s ability to withstand manipulation in times of crisis.

Citation: Lipianina-Honcharenko, K., Komar, M., Ihnatiev, I. et al. Multilingual news dataset about Ukraine (2022–2025): data collection and documentation. Sci Data 13, 701 (2026). https://doi.org/10.1038/s41597-026-07033-5

Keywords: Ukraine war media, disinformation, news dataset, multilingual journalism, information warfare