Clear Sky Science · en

Comparative performance of LLMs and machine learning in predicting complications after percutaneous kyphoplasty for osteoporotic vertebral compression fractures

2026-04-01 · Back to index

Why this matters for people with fragile spines

As more people live into older age, painful spine fractures caused by thinning bones are becoming common. A widely used treatment called percutaneous kyphoplasty can quickly relieve pain, but it can also lead to unwanted side effects. This study asks whether modern artificial intelligence tools, including large language models similar to popular chatbots, can help doctors predict which patients are more likely to develop these complications after treatment.

Figure 1. Comparing AI tools and surgeons to foresee which spine patients may face cement leaks or new fractures after treatment.

The back problem and its common fix

Osteoporotic vertebral compression fractures occur when weakened bones in the spine collapse, often after a minor fall or even simple daily activities. Percutaneous kyphoplasty aims to stabilize these broken bones by inserting a balloon and filling the space with bone cement, which usually reduces pain and restores some height to the crushed vertebra. However, cement can sometimes leak out of the bone, and new fractures can appear in other spinal levels months later. These complications can cause serious problems, including nerve damage, lung issues, and ongoing pain, so doctors are eager for tools that can identify high risk patients before surgery.

Old style computer models and human judgment

Before the rise of large language models, researchers built traditional machine learning systems that learned patterns from patient records and scans. These systems can estimate the chance of cement leakage or new fractures by combining many details, such as age, bone density, fracture shape, and how the cement is distributed. At the same time, experienced spine surgeons form their own judgments after reviewing the same information. While these older computer models often perform well, they require careful training, technical expertise, and computing resources, which can limit their use in everyday hospitals.

Putting chatbots to the test

In this study, researchers gathered data from more than a thousand patients treated with kyphoplasty at a large hospital in Beijing. For each patient, they recorded standard clinical and imaging information, then asked two large language models, a set of traditional machine learning models, and two spine surgeons to predict whether bone cement would leak and whether new fractures would appear later on. The chatbots were tested in two ways. In a zero shot setting, they were simply given the case details and asked for a prediction. In a few shot setting, they were first shown a small set of example cases with known outcomes, to see whether learning from these examples would improve their answers.

Figure 2. How different AI systems process patient spine data to predict safe healing versus cement leaks or future fractures.

What the computers and surgeons got right and wrong

For predicting cement leakage soon after surgery, the large language models did reasonably well. Their results were similar to those of the best traditional computer models and somewhat better than the surgeons working on their own. When it came to predicting new fractures months later, however, the chatbots struggled. Their first attempts were poor and strongly biased toward assuming that almost everyone would suffer a new fracture. Providing example cases helped somewhat, but traditional machine learning, especially a model called a support vector machine, still performed more reliably. The chatbots also failed when asked to identify specific subtypes of complications, such as exactly where the cement leaked or which vertebra would break next.

Help for doctors, but not yet a stand alone tool

One interesting finding was that surgeons sometimes benefited from seeing the chatbots explanations, but only in tasks where the models already performed fairly well. When the underlying predictions were weak, such as for long term fractures, the explanations did not improve doctors decisions. Overall, the study shows that current large language models can offer useful support for certain short term risks after kyphoplasty, but they are not yet dependable enough to replace existing computer models or expert judgment. For now, they should be viewed as early helpers that still need fine tuning, better training on medical data, and closer integration with imaging tools before they can safely guide real world spine care.

Citation: Wang, T., Chen, R., Liang, M. et al. Comparative performance of LLMs and machine learning in predicting complications after percutaneous kyphoplasty for osteoporotic vertebral compression fractures. npj Digit. Med. 9, 401 (2026). https://doi.org/10.1038/s41746-026-02588-4

Keywords: osteoporotic spine fractures, percutaneous kyphoplasty, large language models, machine learning in medicine, surgical risk prediction