Clear Sky Science · en

qsGW quasiparticle and GW-BSE excitation energies of 133,885 molecules

· Back to index

Why a Giant Map of Molecules Matters

Designing better solar cells, LEDs, and other light-responsive materials often comes down to understanding how molecules behave when they absorb or release light. Doing this accurately with traditional quantum chemistry is so computationally demanding that researchers can only study relatively few molecules at a time. This article introduces a huge, carefully checked digital resource of molecular data that is meant to supercharge machine-learning models, making it far easier to predict how over a hundred thousand molecules respond when they gain or lose electrons or are excited by light.

A New Playground for Molecular Discovery

The authors present QM9GWBSE, a dataset covering 133,885 small organic molecules originally collected in the popular QM9 database. For every one of these molecules, they provide high-quality information about two key aspects of electronic behavior. First, they list quasiparticle energies, which describe how tightly electrons are bound and how easily they can be removed or added—critical for understanding charge transport and redox chemistry. Second, they include so-called excitation energies, which quantify what happens when a molecule absorbs light and an electron is promoted to a higher energy level. Together, these data form the basic ingredients needed to predict absorption spectra, color, and other optical properties that matter in technologies such as photovoltaics and light-emitting devices.

Figure 1
Figure 1.

A Careful Balance Between Accuracy and Cost

Producing such an enormous dataset with the very best quantum chemistry methods would be practically impossible: the most accurate approaches scale so steeply with system size that they are limited to much smaller test sets. Cheaper methods exist and are often used to generate large databases, but they can be unreliable, and their accuracy depends strongly on technical choices. The authors instead adopt an approach called quasiparticle self-consistent GW combined with the Bethe–Salpeter equation (qsGW-BSE). This family of methods occupies a middle ground: it is significantly more accurate than many commonly used techniques, yet still efficient enough to be applied across the entire QM9 collection. Crucially, qsGW-BSE is largely free of adjustable parameters, reducing the risk of hidden biases tied to method tuning.

What Exactly Is Stored in the Dataset

For each molecule, the dataset contains the energies of important electronic states and associated properties in a simple, standardized form. Users can access quasiparticle energy levels, the lowest several singlet–singlet and singlet–triplet excitation energies, and quantities related to how strongly each excitation interacts with light, such as transition dipole moments and oscillator strengths. The information is organized into separate archives, each holding a file per molecule, ordered from lowest to highest energy. Alongside this, the authors also provide the underlying molecular structures and reference energies from a simpler density-functional calculation. These ingredients together make the dataset particularly well suited for training neural networks that map from molecular structure directly to excited-state properties.

Figure 2
Figure 2.

Ensuring Reliability at Massive Scale

Because the dataset is so large, the authors rely on an automated quality-control pipeline instead of manual inspection. They encode simple but powerful physical expectations—for example, how the energy gap between filled and empty molecular orbitals should change when moving from an approximate description to the more refined qsGW treatment, and what ranges of energies are reasonable for small organic molecules. If a calculation violates these checks or shows mathematical pathologies, it is rerun with tighter numerical settings and a more flexible auxiliary basis that improves stability. Only in two rare cases do parts of the calculation remain problematic, likely due to a genuine physical instability in those molecules; these exceptions are explicitly documented in the accompanying files.

Putting the Data in Context

To demonstrate that their approach is sound, the authors compare their results to other state-of-the-art datasets. They show that the overall distributions of key quantities, such as the highest occupied electronic energy levels and the lowest excitation energies, match the shape of existing references while displaying predictable shifts that can be rationalized by differences in method and basis set. They also check how sensitive their results are to the choice of basis functions used to represent electrons, confirming that any residual basis set error is comparable to the typical theoretical uncertainty of modern GW-BSE methods. Taken together, these tests provide evidence that the large body of data is free of unphysical outliers and systematic distortions that could mislead downstream machine-learning models.

A Foundation for Smarter Molecular Design

In essence, this work delivers a high-quality, openly available map linking molecular structures to their charged and light-induced electronic responses across more than a hundred thousand compounds. For non-specialists, the key message is that this dataset can help machine-learning models learn the "rules" of how molecules interact with light and carry charge, without requiring each new molecule to be simulated from scratch with heavy computations. As a result, chemists and materials scientists gain a powerful tool to rapidly screen vast chemical spaces for promising candidates in areas such as solar energy, optoelectronics, and photocatalysis, accelerating the path from theoretical ideas to practical materials.

Citation: Baum, D., Förster, A. & Visscher, L. qsGW quasiparticle and GW-BSE excitation energies of 133,885 molecules. Sci Data 13, 643 (2026). https://doi.org/10.1038/s41597-026-07018-4

Keywords: molecular excited states, machine learning in chemistry, GW-BSE, quantum chemistry datasets, molecular spectroscopy