Clear Sky Science · en

Shifting the retinal foundation models paradigm from slices to volumes for optical coherence tomography

· Back to index

Why this matters for eye health

Eye scans are becoming a frontline tool for spotting vision-threatening diseases long before a person notices symptoms. This study explores a new way for artificial intelligence (AI) to read one of the most powerful eye scans we have—optical coherence tomography (OCT)—by treating it not as a single picture, but as a full 3D movie-like volume. That shift turns out to make AI better at recognizing major causes of blindness, potentially bringing earlier and more reliable screening to clinics around the world.

Figure 1
Figure 1.

From flat snapshots to 3D views

OCT creates a three-dimensional block of the retina, built from many thin cross-sectional slices. In everyday clinical practice, eye doctors scroll through this entire volume to look for signs of disease, such as tiny deposits in age-related macular degeneration (AMD) or subtle structural changes in glaucoma. Yet most existing AI “foundation models” for retinal images ignore this rich 3D structure. They usually look at just a single central slice, based on the assumption that it contains the most important information. The authors argue that this shortcut throws away much of what clinicians actually use, and may be one reason why AI systems still struggle to match real-world diagnostic needs.

Borrowing ideas from video analysis

Instead of treating OCT as a lone picture, the researchers treat it as if it were a short video: a sequence of slices progressing through the depth of the retina. They adopt a cutting-edge video foundation model called V-JEPA, originally trained on everyday natural videos, and adapt it to OCT. Like other modern AI systems, V-JEPA is first trained in a self-supervised way, learning to predict and fill in missing information without needing labels. This creates a powerful general “visual vocabulary” the model can later reuse. The researchers then fine-tune V-JEPA on smaller, labeled OCT datasets so it can distinguish between healthy and diseased eyes.

Testing across countries and diseases

To see whether this 3D approach truly helps, the team compared V-JEPA with three strong image-based foundation models, including two trained specifically on retinal images and one trained on natural photographs. All models used similar underlying neural network architectures, so differences in performance mainly reflect how they use the data, not how large or complex they are. The models were asked to detect two major eye diseases—glaucomatous optic neuropathy (a form of glaucoma) and AMD—using five independent OCT datasets collected in the United States, China, Israel, and Iran, covering different scanners, populations, and disease patterns.

Figure 2
Figure 2.

Clearer answers from full 3D information

Across these datasets, the video-based V-JEPA either matched or outperformed all image-based models, achieving an average area under the ROC curve of 0.94 compared with 0.90 for the best single-slice model—a statistically significant gain. The advantage was largest on some of the more challenging data, where disease and non-disease cases were harder to tell apart. Analyses of the model’s attention maps showed that V-JEPA focuses on relevant structures scattered across multiple slices, not just the central one. Error analysis revealed that single-slice models often missed cases where pathology was absent or subtle in the middle slice but present elsewhere in the volume. By effectively increasing spatial coverage and learning how slices relate to one another, the 3D model could better capture these patterns.

Balancing performance and practicality

The authors also compared V-JEPA to other ways of using 3D information, such as classic three-dimensional convolutional networks trained from scratch and a “2.5D” method that processes many slices one by one with an image model and then combines the results. V-JEPA generally outperformed these alternatives while requiring less computation than the 2.5D strategy. Importantly, its benefits held even when only modest amounts of labeled data were available, and its performance continued to improve as training data increased, whereas image-based models tended to plateau. Although analyzing volumes is slower than reading a single slice, the extra processing time remained small—on the order of hundredths of a second per scan on modern hardware.

What this means for future eye care

For non-specialists, the takeaway is straightforward: letting AI see the retina in 3D, the way clinicians do, leads to more accurate automated detection of common blinding diseases. This work suggests that future “foundation models” for eye imaging should be designed from the start to handle full OCT volumes, not just individual snapshots. Combined with broader data sharing and, eventually, integration with other clinical information, such models could help bring consistent, high-quality screening for glaucoma and macular degeneration to many more clinics and patients worldwide.

Citation: Judkiewicz, R., Berkowitz, E., Meisel, M. et al. Shifting the retinal foundation models paradigm from slices to volumes for optical coherence tomography. npj Digit. Med. 9, 314 (2026). https://doi.org/10.1038/s41746-026-02496-7

Keywords: optical coherence tomography, retinal imaging AI, age-related macular degeneration, glaucoma detection, video foundation models