Clear Sky Science · en

A high-precision catalogue of landslide events in China based on news text mining with large language model

· Back to index

Why this landslide map matters

Landslides kill thousands of people and destroy homes, roads, and farmland every year, yet basic facts about when and where they happen can be surprisingly hard to find. This study builds a detailed catalogue of more than a thousand landslides across mainland China by teaching a computer system to read years of news reports. The result is a public dataset that can help improve warning systems, guide safer construction, and support smarter disaster planning.

Figure 1. Turning thousands of scattered news reports into a precise nationwide map of landslides in China.
Figure 1. Turning thousands of scattered news reports into a precise nationwide map of landslides in China.

From scattered stories to a national picture

Until now, China had only partial records of landslides. Official bulletins counted how many events occurred each year or in each province but rarely included exact locations or times. International catalogues focused mainly on the biggest or deadliest events worldwide and often missed local reports in Chinese. This left researchers without a clear, fine-grained picture of landslides across the country, making it difficult to judge where slopes are most dangerous or how risk is changing over time.

Letting computers read the news

The authors turned to China News Network, a major national news site that publishes stories around the clock from across the country. They scraped more than 33,000 articles mentioning the word “landslide” from 2008 to 2024, then filtered out pieces that used the term as a metaphor, such as for an election or a market crash. Next they used a large language model, a type of advanced artificial intelligence trained on massive amounts of text, to pull out key facts from each genuine disaster report. For every event, the system tried to identify the time it occurred, the place, what triggered it, and how many people were killed, injured, or missing.

Cleaning, checking, and pinning events on the map

Raw AI output is not perfect, so the team added several layers of checking. They removed records without clear time or place information and dropped reports that only named a broad region, like a province, without useful detail. They also handled the common problem of multiple stories covering the same disaster by comparing how close events were in time and how similar their location descriptions were, then merging likely duplicates. Human experts reviewed all remaining records and corrected mistakes. To turn written place names into map coordinates, the authors used an online mapping service and custom rules to choose the best match, followed again by manual checks for doubtful cases.

Figure 2. Stepwise filtering of news stories by AI to produce accurately timed and located records of individual landslides.
Figure 2. Stepwise filtering of news stories by AI to produce accurately timed and located records of individual landslides.

What the new catalogue reveals

The final dataset includes 1,582 landslides with unusually precise information. About half of the events are dated to the exact hour or even the minute, and more than 80 percent are located at village scale or a specific site such as a road cut or hillside. Most recorded landslides were triggered by heavy rain, especially in southern China, while quake-related events cluster near the eastern edge of the Tibetan Plateau. When compared with two widely used global landslide databases, this new catalogue contains about two and a half times more events in China over the same years and locates them more precisely in both time and space.

How reliable is AI reading the news

To test accuracy, the team compared their AI-extracted records with official reports on well known disasters and with detailed local geological surveys. They found that the system was very good at pulling out basic details like when and where a landslide occurred and what triggered it, but less reliable at counting deaths, injuries, and missing people, which often change as emergencies unfold. Overall, the news reports themselves closely matched government sources on timing and location, confirming that they are a trustworthy base for building such a catalogue.

What this means for future safety

For non specialists, the key message is that computers can now sift through years of news coverage to create clear, detailed maps of where dangerous slopes have failed. This Chinese landslide catalogue is not a complete record of every event, especially small ones that left little trace in the media, and casualty numbers should be treated with care. Even so, its fine timing and location make it a powerful tool for scientists testing warning models, for planners deciding where to build roads and towns, and for officials preparing for future storms and earthquakes.

Citation: Zhao, B., Zhang, L., Liu, Z. et al. A high-precision catalogue of landslide events in China based on news text mining with large language model. Sci Data 13, 722 (2026). https://doi.org/10.1038/s41597-026-07066-w

Keywords: landslide catalogue, China hazards, news text mining, large language model, disaster risk data