Clear Sky Science · en
Ontology-driven association rule mining for biomedical entity relationships: integrating hierarchical knowledge to improve gene-disease discovery
Why hidden gene–disease links matter
Modern medicine increasingly depends on finding which genes are connected to which diseases. These links can reveal why illnesses arise, suggest new drug targets, and point to people at higher risk. Yet most computer tools look only for genes and diseases that appear together in the same sentence or paper, missing many subtle but important connections. This study introduces a new way to mine the biomedical literature that taps into expert-built knowledge hierarchies, aiming to uncover both well-known and overlooked gene–disease relationships more reliably.
From raw text to candidate links
The authors start by gathering a large collection of scientific articles from PubMed and breaking each article into sentences. Each sentence is treated like a small “basket” of items that may contain one or more gene names and one or more disease names. Using established data mining algorithms (Apriori, FP-Growth, and Eclat), they scan millions of these baskets to find gene–disease pairs that tend to appear together more often than expected by chance. This first step, called entity-specific association, captures the direct co-occurrences that most existing tools rely on. It already reveals thousands of potential connections, but it still favors well-studied genes and common diseases that dominate the literature.

Using biological hierarchies as a map
To go beyond simple word counting, the researchers turn to biological “maps” known as ontologies. The Gene Ontology describes what genes do and where they act in the cell, while the Disease Ontology organizes diseases into families and subtypes. In these hierarchies, specific terms such as a rare epilepsy sit under broader parents like “neurological disease.” The key idea is that if a particular gene is strongly tied to a very specific disease, and that disease belongs to a larger family, then the gene likely has some relationship to that whole family as well. The authors formalize this by creating hierarchical ontology associations, which propagate evidence up through parent terms on both the gene and disease sides, and also indirectly capture “siblings” that share a parent.
Blending direct evidence with inherited signals
Simply adding up counts from many levels of the hierarchy can distort scores, especially because very general terms like “cancer” appear extremely often. The team therefore designs a careful scoring system. They use a standard measure from data mining, called lift, to gauge how strongly a gene and disease are linked beyond chance and then transform these scores to reduce skew and make them comparable. Their new Athar Semantic-Enriched Association (ASEA) score blends three ingredients: the direct gene–disease link, links between the gene and broader disease families, and links between broader gene functions and disease families. They also apply rank-based normalization so that scores behave similarly across different depths of the ontologies, allowing fair comparison and ranking.

Testing the method against trusted databases
To judge whether ASEA produces biologically meaningful results, the authors compare their top-ranked associations with entries in expert-curated resources such as the Comparative Toxicogenomics Database and DisGeNET. They find that ASEA recovers more high-grade known associations than any of the classic algorithms alone, while still generating a rich set of additional candidate links. In total, ASEA identifies 185 notable gene–disease pairs. These are then grouped into four categories: well-established connections already in major databases; connections strongly backed by recent studies but not yet curated; links with only weak or scattered database support; and purely speculative associations with no current backing, which are proposed as hypotheses for future laboratory or clinical work.
What this means for future medicine
For non-specialists, the crucial message is that this framework offers a smarter way to read the biomedical literature at scale. Instead of counting only obvious mentions of a gene and a disease side by side, it leverages expert knowledge about how genes and diseases are organized into families to strengthen promising but rare signals. The resulting ASEA score does not prove that a gene causes a disease, but it provides a transparent, statistically grounded shortlist of candidates for scientists and clinicians to investigate. In the long run, such ontology-aware mining could accelerate the discovery of biomarkers, inform precision medicine, and help turn the growing flood of biomedical text into actionable medical insight.
Citation: Naqash, M.A., Amin, M., Uddin, J. et al. Ontology-driven association rule mining for biomedical entity relationships: integrating hierarchical knowledge to improve gene-disease discovery. Sci Rep 16, 13072 (2026). https://doi.org/10.1038/s41598-026-42584-y
Keywords: gene–disease associations, biomedical text mining, ontologies, precision medicine, computational biology