Clear Sky Science · en

Semantic clause retrieval for trademark law using transformer encoders and lexical baselines: a cross-domain agri-robotics compliance case study

2026-03-05 · Back to index

Why Smarter Legal Search Matters

Finding the one crucial rule inside hundreds of pages of legal text is a daily headache for lawyers, regulators, and companies. As laws become more complex and technologies like farm robots and drones spread across borders, people need faster ways to locate the exact clauses that govern what they are allowed—or required—to do. This paper shows how recent advances in artificial intelligence can make clause-by-clause legal search more accurate and transferable across different legal domains, from trademark law to agri-robotics safety rules.

From Keyword Guessing to Meaning-Based Search

Traditional legal search tools behave like very fast card catalogues: users type in a few keywords, and the system looks for documents that contain those words. This works only if the user guesses the right terminology and if the law is written in similar language. In practice, important obligations and exceptions are often buried deep inside sections and subsections, and different countries use different labels for similar ideas. The authors argue that what really matters to practitioners is not whether the exact words match, but whether a clause answers a concrete question—such as how to renew a trademark, or what standards apply to an autonomous tractor.

How the New Search Engine Works

The study builds an application-oriented search pipeline that focuses on clauses—the level at which legal decisions are usually made—rather than whole documents. First, the system breaks statutes and regulations into individual clauses and converts each one into a numerical “fingerprint” that captures its meaning. This is done using pre-trained transformer models, a family of AI systems originally developed for natural language tasks like translation. Instead of training new models from scratch, the authors rely on existing legal-specialized encoders, including versions tailored to international legal texts and to Pakistani legal language.

Comparing AI Search with Classic Methods

To see whether semantic search really helps, the authors compare their transformer-based system with two widely used keyword methods known as TF–IDF and BM25. All methods are tested under the same conditions: for each natural-language query, the system returns the top five clauses from the relevant corpus, and legal experts judge whether each clause is truly helpful for a decision. The main benchmark is the Pakistan Trademark Ordinance of 2001, using ten practitioner-style questions about issues such as confusion between marks, foreign registration, renewal procedures, and infringement penalties. A smaller set of three questions targets regulations and standards for agricultural robots and drones, giving an early look at cross-domain transfer.

What the Results Reveal

Across the trademark tasks, a transformer model trained on Pakistani legal text (Pak-Legal-BERT) provides the best overall ranking of useful clauses, outscoring both more generic legal transformers and the classic keyword baselines. However, the study also finds that BM25, a refined keyword method, remains surprisingly strong and even slightly outperforms one of the transformer models. Detailed analysis of individual queries shows a recurring challenge: all models sometimes rank clauses highly because they contain similar procedural phrases, even when those clauses do not actually resolve the user’s legal question. This “high-similarity but wrong answer” pattern underscores the need for careful evaluation and transparent reporting of how systems behave, query by query.

Extending to Robots in the Fields

To test whether the same approach can support newer areas like agri-robotics compliance, the authors assemble a focused corpus of regulations and standards covering drone operations, robotic tractor safety, and ethical data practices for farm robots. Using the same top-five retrieval and expert-judgment protocol, they find that keyword methods achieve reasonable performance and that the transformer-based pipeline can surface relevant drone and safety provisions. At the same time, the authors stress that the current agri-robotics benchmark is small and should be viewed as evidence of feasibility rather than proof of broad generalization across all jurisdictions and technologies.

What This Means for Everyday Legal Work

Overall, the study shows that meaning-aware clause search can significantly reduce the effort required to pinpoint decision-ready legal provisions, especially when models are adapted to the language and drafting style of a given legal system. Instead of guessing the right keywords, practitioners can pose questions in natural language and receive a short, ranked list of likely clauses. Strong keyword tools are not obsolete—they still perform well in settings where query words closely match the text of the law—but transformer-based semantic search offers a powerful complement, particularly for complex or cross-domain questions. With larger benchmarks, multi-expert review, and careful handling of failure cases, such systems could become a practical backbone for future legal and compliance research across industries.

Citation: Asfand E Yar, M., Hashir, Q., Tanveer, M.H. et al. Semantic clause retrieval for trademark law using transformer encoders and lexical baselines: a cross-domain agri-robotics compliance case study. Sci Rep 16, 12327 (2026). https://doi.org/10.1038/s41598-026-43098-3

Keywords: semantic legal search, trademark law, sentence embeddings, agri-robotics compliance, transformer encoders