Clear Sky Science · en

Hierarchical multi-agent reinforcement learning for retrieval-augmented industrial document question answering

2026-03-14 · Back to index

Smarter Help from Complex Manuals

Modern industries like power grids and manufacturing rely on thick manuals, circuit diagrams, and parameter tables to keep equipment running safely. When operators have urgent questions—such as why an alarm is sounding or which switch to flip—the answer is often buried somewhere in these long, mixed-format documents. This paper introduces a new AI system, called MARL‑RAGDoc, that is designed to dig through such tangled information and deliver accurate, well‑grounded answers instead of guesses.

Why Ordinary AI Gets Lost in Real Manuals

Most current question‑answering systems work well when all the information is plain text, like an online article. Industrial documents are very different: they mix text, diagrams, flowcharts, and tables laid out across dozens of pages. Different questions rely on different parts—pictures may matter for wiring, while tables matter for ratings or settings. Existing systems usually treat all types of content the same, pull a fixed number of snippets, and then generate an answer. Because they cannot change how much they trust each type of content or how deeply they search based on the question, they often miss crucial evidence, retrieve a lot of irrelevant material, and sometimes “hallucinate” answers that are not supported by the documents.

A Team of Specialized AI Helpers

MARL‑RAGDoc tackles this problem by treating document search as a cooperative game played by several AI “agents,” each with a different role. First, the system breaks a document collection into many small pieces: text blocks, images, and tables, each tagged with its position on the page and its role (such as title or caption). These pieces are mapped into a shared mathematical space so that related items from different formats end up close together. Then, for a given question, the system builds shortlists of promising candidates within each format—like the top text blocks, images, and tables that might contain the answer.

A Coordinator That Learns Where to Look

At the heart of MARL‑RAGDoc is a high‑level coordinator agent that decides how much attention to give to each type of content and how many steps of search are needed. Under this coordinator sit three specialized agents, one each for text, images, and tables. These agents choose which candidates to keep, when to look at neighboring material (such as the rest of a table row or the caption under an image), and when to stop searching. Crucially, all of these decisions are learned through reinforcement learning: the agents receive rewards based on both how well they retrieved relevant evidence and how good the final answer is. Over time, the system learns strategies such as relying more on tables for numeric queries or more on diagrams for questions about spatial layout.

From Evidence to Reliable Answers

Once the agents have assembled their best evidence, a large language model takes in the question together with the selected text, images, and tables, weighted by their importance. It then produces an answer and a quality score reflecting how complete and well supported that answer appears to be. If the score is low, the system can trigger another round of retrieval, asking the agents to gather supplemental material before trying again. This “retrieve–reason–reflect” loop lets MARL‑RAGDoc correct itself when the first attempt is unsure, reducing the risk that it will fill gaps with unsupported guesses. The same loop also feeds back into training, teaching the agents which retrieval patterns tend to lead to strong answers.

Putting the System to the Test

The researchers evaluated MARL‑RAGDoc on three demanding collections of multimodal documents, including two public benchmarks and a new power‑industry dataset they constructed from real manuals, guidelines, and technical reports. Across all three, the new system outperformed a range of strong competitors, from powerful general‑purpose multimodal models to specialized document understanding and retrieval‑augmented systems. It delivered improvements of roughly 5–9 percentage points in overall accuracy and similar gains in stricter measures that require exact matches and early ranking of correct answers. The benefits were especially clear for very long, multi‑page documents and questions that required combining information from text, tables, and diagrams.

What This Means for Real‑World Operators

In everyday terms, MARL‑RAGDoc is like a team of trained assistants who know how to skim huge binders of technical material, consult the right diagrams or tables for each question, and double‑check their work before answering. By dynamically deciding which parts of a document matter most and learning from feedback, it offers more accurate and better‑justified answers than one‑size‑fits‑all approaches. While the study focuses on power‑system documents, the same framework could help workers in many fields—from factory technicians to hospital staff—navigate complex manuals quickly and safely.

Citation: Qian, Y., Han, B., Yuan, Y. et al. Hierarchical multi-agent reinforcement learning for retrieval-augmented industrial document question answering. Sci Rep 16, 13512 (2026). https://doi.org/10.1038/s41598-026-41684-z

Keywords: industrial document QA, multimodal retrieval, reinforcement learning agents, retrieval-augmented generation, technical manuals