Clear Sky Science · en

Evaluating literary translation by large language models: a multidimensional quality assessment of Shen Congwen’s Border Town

· Back to index

Why this study matters for readers and writers

As tools like ChatGPT and other large language models become part of everyday life, people are beginning to ask a simple question: can these systems really replace human translators, especially for beloved novels? This study takes a close look at that question by examining how several leading AI models translate a classic Chinese book, Shen Congwen’s Border Town, into English and comparing their work to a respected human translation.

A village story meets artificial intelligence

Border Town is famous for its gentle portrait of rural life in southwest China, its poetic language, and its dense web of local customs and beliefs. These features make it an ideal test case: any translator has to capture not just who did what, but the feel of mist over river boats, the rhythm of folk songs, and the weight of traditional values. The authors selected the first two chapters of the novel and gathered five English versions: four produced by large language models (GPT-4, GPT-4o, Gemini, and the Chinese system WXYY 4.0 Turbo) and one by the human scholar–translator Jeffrey Kinkley, whose 2009 version is widely praised for its sensitivity to style and culture.

Figure 1
Figure 1.

How the translations were judged

To move beyond gut feelings about what “sounds right,” the researchers used a detailed framework called Multidimensional Quality Metrics. Instead of just checking whether the wording matches the original, this approach sorts errors into types and rates how serious they are. The team focused on three big questions: Is the meaning accurate? Does the version stay faithful to the author’s tone and storytelling style? And does it handle cultural details in a way that makes sense to readers without erasing their original flavor? With these in mind, two trained annotators compared each sentence of the Chinese text to each translation, flagging five main error types: mistranslation, omission, over-translation (adding unnecessary material), cultural mistranslation, and broader discourse-level problems that harm the flow of the story.

Where the machines stumble

The results show clear patterns. All four AI systems produced fluent English, but they often slipped on crucial nuances. Mistranslation was the most common problem across the board: for example, old copper coins became modern-sounding “cash,” quietly changing the historical feel of the village. Gemini dropped the most material, sometimes skipping descriptive phrases that help tie characters together or build atmosphere. GPT-4 most often added extra judgmental language, turning a discreet hint of romance into a full-blown “affair,” which shifts how readers view the characters. Cultural references were particularly fragile: everyday objects tied to ritual life, like incense and candles, or the name of a legendary hero, were frequently flattened, modernized, or handled too literally. On the level of whole paragraphs, some models subtly rearranged who was central to a metaphor or scene, weakening key relationships, such as the emotional bond between the young girl Cuicui and her grandfather.

A closer look at relative strengths

Not all systems performed equally. GPT-4o, a newer and more optimized model, consistently made fewer mistakes than GPT-4 in nearly every category, suggesting that careful tuning can matter more than sheer model size. It omitted less content and mistranslated fewer phrases, and it tended to keep the story more intact across sentences. Gemini, by contrast, showed its greatest weakness in leaving things out, especially in passages thick with imagery and cultural hints. WXYY 4.0 Turbo, despite being trained in a Chinese context, did not clearly surpass its foreign counterparts on culture-heavy passages; it still treated some historical and ritual terms as if they were ordinary modern objects. Across all of these machine versions, the human translation remained the most reliable in weaving together meaning, mood, and culture.

Figure 2
Figure 2.

What this means for the future of reading in translation

For everyday tasks and straightforward texts, large language models already offer impressive help. But this study shows that when it comes to literary works like Border Town, they still miss vital layers of sense and feeling. The best-performing model, GPT-4o, comes closer than others yet still needs human oversight, especially where culture and story structure are concerned. The authors argue that better prompts, more focused training, and systematic human post-editing are essential if AI is to support, rather than replace, literary translators. For readers, the message is clear: machine output can be a useful draft or aid, but the full emotional and cultural life of a novel still depends on human artistry.

Citation: Yang, W., Yang, M. Evaluating literary translation by large language models: a multidimensional quality assessment of Shen Congwen’s Border Town. Humanit Soc Sci Commun 13, 628 (2026). https://doi.org/10.1057/s41599-026-06868-y

Keywords: literary translation, large language models, machine translation quality, Chinese literature, cultural nuance