Clear Sky Science · en

Development and evaluation of an effective solubility prediction model for pharmaceuticals in organic solvents using machine learning based on eXtreme Gradient Boosting

2026-05-28 · Back to index

Why dissolving medicines really matters

When a pill enters your body, it must first dissolve before it can do any good. How easily an active ingredient dissolves in a liquid affects how a medicine is made, how stable it is, and how well it works. Measuring this "solubility" in many different liquids and temperatures is slow and costly. This study explores how a modern data tool, machine learning, can help scientists quickly estimate how well drug-like compounds dissolve in common organic liquids, using only a small amount of easy-to-obtain information.

Choosing the right liquid for crystal making

In drug manufacturing, producers often grow crystals of an active ingredient from an organic liquid. The liquid does not just control how much solid is recovered; it also shapes the size and form of the crystals, which in turn can influence how the medicine behaves. Traditionally, chemists either perform many experiments or use complex thermodynamic equations to predict solubility. These established methods can be accurate but often require many fitted parameters or detailed molecular data that are not available in early development. The authors of this work instead ask whether a carefully designed machine learning model can capture the key trends in solubility while staying grounded in simple, physically meaningful inputs.

Figure 1. How a compact machine learning model links basic drug and solvent properties to solubility across many organic liquids.

A compact data-driven model with physical insight

The team built a solubility prediction model using a popular machine learning method called eXtreme Gradient Boosting, or XGBoost. They gathered published solubility data for four drug-like molecules in nine common organic liquids, across a wide span of temperatures, yielding 224 data points. Rather than feed the algorithm arbitrary descriptors, they selected ten features that chemists already understand: properties of the solid (such as melting temperature, heat of fusion, heat capacity and a well-known solubility parameter), basic liquid properties (polarity through dielectric constant and boiling temperature) plus temperature itself and simple encodings of the names of the solid and liquid. To reflect the fact that most solids dissolve better when warmed, they built in a rule forcing the model’s predictions to rise with temperature, ensuring physically sensible behavior.

How well the model matches real measurements

After tuning the model using cross-validation, the authors tested how closely the predictions matched measured values. They evaluated performance by comparing the logarithm of the measured and predicted solubilities, which is well suited because solubilities spanned several orders of magnitude. For the four compounds used for training and testing, the model reproduced the data with very small average errors and high correlation, indicating that it can reliably describe temperature-dependent solubility across many liquid environments. Importantly, the model remained accurate even for a very poorly soluble compound, risperidone, whose behavior is notoriously hard to capture with simpler equations.

Figure 2. Stepwise view of inputs, machine learning model, and matching curves for measured and predicted solubility with rising temperature.

Predicting a completely new compound

The crucial question was whether the model could handle an active ingredient it had never seen. To test this, the researchers set aside all data for a fifth compound, butamben, and used those 50 measurements only after training was complete. The model’s errors were larger for this true prediction task than for the data it had seen before, but still stayed within a range comparable to typical experimental uncertainty, especially for several of the liquids tested. When compared with two widely used semi-predictive thermodynamic methods, Flory Huggins and temperature-dependent NRTL-SAC, the XGBoost model consistently produced smaller errors overall, and performed particularly well for the most challenging systems.

What this means for future drug development

For non-specialists, the key takeaway is that a relatively small, physically informed machine learning model can reliably estimate how well drug-like molecules dissolve in common organic liquids over a range of temperatures. It does this using a modest set of measurable properties, without the heavy parameter fitting often needed in traditional approaches. While the authors note that further refinement of the chosen descriptors and broader data would improve performance, the study shows that such models can already support solvent screening and process design, helping chemists narrow down promising options before carrying out detailed laboratory work.

Citation: Valavi, M., Assareh, M., Khoshsima, A. et al. Development and evaluation of an effective solubility prediction model for pharmaceuticals in organic solvents using machine learning based on eXtreme Gradient Boosting. Sci Rep 16, 16592 (2026). https://doi.org/10.1038/s41598-026-53038-w

Keywords: drug solubility, organic solvents, machine learning, XGBoost, crystallization