Clear Sky Science · en

Open Molecular Crystals 2025 (OMC25) dataset and models

· Back to index

Why Vast Crystal Data Matters

Molecular crystals sit at the heart of many everyday technologies, from medicines and pigments to flexible electronics. Their behavior depends not just on which atoms they contain, but on how countless copies of a molecule pack together in a solid. Predicting this packing and its consequences is notoriously hard and slow, usually demanding heavy quantum-mechanical calculations. This article introduces a new open dataset, called Open Molecular Crystals 2025 (OMC25), that gathers tens of millions of carefully simulated molecular crystal structures. It is designed to give modern machine learning methods the rich experience they need to learn how these crystals behave, with the long-term goal of making crystal design faster, cheaper, and more reliable.

Figure 1
Figure 1.

A Giant Library of Model Crystals

The OMC25 team set out to build an expansive “training ground” for computer models that predict how molecular crystals arrange themselves and how stable those arrangements are. They assembled more than 27 million crystal structures built from about 50,000 different organic molecules. Each crystal contains only common light elements, like carbon, nitrogen, oxygen, and a few halogens, and can have up to 300 atoms in its basic repeating unit. For every structure they record not just the positions of atoms, but also the total energy of the crystal, the forces acting on each atom, and the mechanical stress in the unit cell. These labels are what allow machine learning models to connect patterns in atomic arrangements with physical behavior.

From Random Packings to Realistic Solids

To populate this library, the authors did not simply copy known experimental crystals. Instead, they used an open-source tool to generate many different ways a molecule could pack in a crystal. They varied how many molecules sit in the basic unit cell and explored a wide range of crystal symmetries. For each candidate, they created both loosely packed and tightly packed versions to cover situations far from and close to realistic conditions. They then used a high-quality quantum method, which includes the subtle attraction between molecules, to relax each structure—letting atoms move step by step until the forces nearly vanish. Along these relaxation paths, they sampled many intermediate structures, capturing how a crystal changes as it settles from a rough guess into a likely physical arrangement.

Figure 2
Figure 2.

Careful Filtering and Rich Variety

Because random guesses can produce unrealistic situations, the team applied strict filters to clean the data. They removed any frames where energies, forces, or stresses were wildly out of range, or where molecules broke apart or merged in chemically unreasonable ways. They also checked that cell volumes did not jump so much that the underlying numerical settings would become unreliable. The result is a dataset that spans a huge variety of chemistries and packing styles while keeping unphysical examples to a minimum. Compared with a large experimental crystal database, OMC25 contains a broader spread of crystal symmetries and unit cell sizes, deliberately oversampling some types of arrangements to challenge and enrich machine learning models.

Teaching Computers to Predict Crystals

To test whether OMC25 is truly useful, the authors trained several state-of-the-art machine learning models that operate directly on atomic structures. These models learn to predict energy, forces, and stress from the positions and identities of atoms. When evaluated on held-out OMC25 data, they reached very low prediction errors, showing that the dataset is consistent and informative. The team then pushed the models onto external tests, such as reproducing known crystal energies and volumes and ranking different crystal forms (polymorphs) of the same molecule. Despite being trained on data generated with one flavor of quantum method, the models performed competitively on benchmarks based on somewhat more advanced methods, and they proved especially strong at comparing relative stabilities of different crystal packings.

What This Means for Future Materials

For non-specialists, the key message is that OMC25 offers a large, carefully curated playground where machine learning models can “practice” on realistic molecular crystals. Instead of running demanding quantum calculations from scratch for every new crystal guess, researchers can increasingly rely on fast, learned models trained on OMC25 to screen and refine structures. This could speed up the search for better drug forms, more efficient organic electronics, and improved specialty materials. While the dataset focuses on a particular family of crystals and uses one level of quantum theory, it establishes a powerful foundation. By making both the data and example models openly available, the authors aim to catalyze broader efforts to predict and design molecular crystals with the ease and speed that modern machine learning can offer.

Citation: Gharakhanyan, V., Barroso-Luque, L., Yang, Y. et al. Open Molecular Crystals 2025 (OMC25) dataset and models. Sci Data 13, 354 (2026). https://doi.org/10.1038/s41597-026-06628-2

Keywords: molecular crystals, machine learning potentials, materials database, crystal structure prediction, quantum chemistry