Clear Sky Science · en

Fidelity-driven data augmentation for multimodal large language model on architectural heritage interpretation

· Back to index

Why old buildings need smart digital helpers

Across many historic cities, ornate street arcades and weathered building fronts are disappearing or being radically rebuilt. Experts race to document and protect this architectural heritage, but the work is slow and requires deep knowledge of style, structure, and history. This study explores how a new kind of artificial intelligence—multimodal large language models that can look at pictures and read text—might help, and what kind of carefully crafted training data they need in order to truly understand old buildings rather than simply guess about them.

Figure 1
Figure 1.

When AI looks at buildings and gets them wrong

The authors begin by testing several state-of-the-art AI systems on photos of historic shop-house arcades in Guangzhou, China. These buildings, known as Qilou, mix Chinese and Western influences and form long, continuous streetscapes. Specialists created a benchmark of 50 façade images and thousands of multiple-choice questions about what appears in each scene: how many floors a balcony spans, whether certain decorative supports are of one type or another, what material the window frames are made from, and how damage should be assessed. Even the best commercial systems, including some of the largest available models, regularly misread these images—placing balconies on the wrong floor, confusing key architectural elements, or calling modern aluminum windows “wooden” based mainly on color.

Breaking down how people read a façade

To understand these failures, the researchers map heritage interpretation into three human-like skills. First is visual perception: noticing what is present, such as windows, columns, and surface materials. Second is spatial reasoning: understanding how parts of the façade line up and repeat, including symmetry and the vertical and horizontal rhythm of openings. Third is contextual reasoning: deciding what the building’s condition and history imply, for instance whether peeling paint signals serious deterioration or only mild wear. Their tests show that today’s AI systems especially struggle with the second and third skills—precise spatial layout and nuanced meaning—because they have rarely seen carefully labeled heritage examples while being trained.

Teaching AI with made-up images that still tell the truth

Simply collecting more real photos and expert labels would be extremely costly. Instead, the team builds a data “amplifier” that creates convincing synthetic façade images plus matching question–answer pairs. The key idea is to treat two aspects of a façade separately: its spatial skeleton (the exact arrangement and proportions of openings and ornaments) and its semantic flavor (materials, historical style, and weathering). Using a modern image-generation engine, they add one specialized module that locks in the geometry by following edge maps drawn from real buildings, and another that controls stylistic details via lightweight adapters trained on small, coherent style groups. By mixing and matching layouts and styles, the system produces over 1,400 new façade variations from just 208 originals, while keeping the look and feel tightly grounded in real architecture.

Figure 2
Figure 2.

Checking whether the synthetic world matches the real one

The authors then ask: do these artificial façades behave like real heritage data? They compare structural similarity, semantic closeness in a learned feature space, and the judgments of human experts. Quantitative scores show that the structure-focused module sharply improves how well the layout of synthetic buildings matches real examples, while the style-focused module increases diversity without drifting away from authentic regional character. Expert reviewers rate the augmented images as far more plausible and stylistically faithful than those produced by a standard generator, and, crucially, find that they preserve enough detail for reliable question answering about materials, elements, and damage.

Smaller tuned models that outperform bigger general ones

Armed with this expanded dataset, the team fine-tunes a mid-sized open-source vision–language model, then tests it on mixed Chinese and European façade benchmarks. Despite having far fewer internal parameters than leading commercial systems, the tuned model now beats them across almost all task types, especially in reading symmetry, counting and aligning elements, and distinguishing materials. Expert audits of its step-by-step explanations show a shift from wild “hallucinations” toward grounded, building-aware reasoning: the model cites real visual evidence, applies architectural rules more consistently, and makes fewer logical leaps. Analysis of its remaining mistakes points to new frontiers—such as better representing perspective distortions and encoding professional standards for when visible decay actually demands intervention.

How this helps protect historic streets

For a non-specialist reader, the deeper message is that more AI power alone is not enough to safeguard architectural heritage. What matters at least as much is the fidelity and structure of the data we feed into these systems. By generating synthetic façades that carefully preserve the geometry and meaning of real buildings, this study shows how a compact, openly available model can become a more trustworthy partner for experts. Such systems could eventually scan whole districts, flag risky alterations, and support repair decisions at scale, helping cities keep their distinctive historic streetscapes alive in the face of rapid change.

Citation: Huang, R., Lin, HC. & Zeng, W. Fidelity-driven data augmentation for multimodal large language model on architectural heritage interpretation. npj Herit. Sci. 14, 179 (2026). https://doi.org/10.1038/s40494-026-02446-2

Keywords: architectural heritage, multimodal AI, data augmentation, historic facades, cultural preservation