Clear Sky Science · en
Instability and performance limits of convolutional neural networks on non-sequential medical tabular data: an empirical investigation
Why this matters for everyday medicine
Hospitals increasingly rely on artificial intelligence to help predict who has cancer, heart disease, or serious infections using spreadsheet-like medical records rather than images. This study asks a deceptively simple question with big practical consequences: are today’s popular image-based neural networks actually trustworthy when we feed them such non-image, column‑based medical data, or do they behave unpredictably in ways that could mislead doctors and patients?

Two types of brain-inspired calculators
The researchers compared two families of neural networks that mimic, in a very rough way, how brains process information. Convolutional neural networks, or CNNs, are the workhorses of modern image recognition. They scan pictures in small patches, looking for local patterns like edges or textures, then build up to more complex shapes. Multi‑layer perceptrons, or MLPs, take a simpler approach: they treat each input feature—such as age, blood pressure, or a lab value—as an independent number and learn weighted combinations of all of them at once, without assuming any particular order or neighborhood.
Putting medical tables to the test
To see how these models behave on real-world health data, the team used three well-known medical datasets that look more like spreadsheets than images. One contained laboratory and clinical features from patients with COVID‑19, used to predict who would survive. Another described microscope-based measurements of breast tumors, used to distinguish malignant from benign cases. The third captured classic heart disease risk factors from a cardiology database. Importantly, these datasets list variables side by side, but there is no natural “left‑to‑right” order that carries meaning, unlike pixels in an image.
Shuffling the columns and shaking the models
The heart of the study was a massive stress test. The authors repeatedly shuffled the order of the input columns and, at the same time, randomly changed key parts of the CNN design, such as how many small “patch readers” (kernels) it used, how wide those patches were, and how many neurons sat in its final decision-making layer. For each shuffle-and-architecture combination—1,000 permutations in all—they trained the CNN and, in parallel, a comparable MLP. Instead of focusing on a single “best” accuracy, they looked at how the performance scores spread out across all these runs, using the area under the ROC curve (AUROC) as a summary of how well each model separated sick from healthy patients.

What they found inside the black box
The results painted a sobering picture for CNNs on non-image medical tables. In some carefully chosen settings, CNNs could match or even slightly beat MLPs on peak performance—especially on the breast cancer data, which had many strong, clearly separating features. But across all shuffles and architectures, CNNs showed much wider swings in performance, with a worrying tendency toward occasional very poor runs. Their success or failure depended heavily on arbitrary choices: how the columns were ordered, how large each scanning window was, and how many filters and final-layer nodes the network used. Larger scanning windows, which mix many neighboring features together, consistently hurt both average performance and stability on these non‑sequential inputs.
Why simpler models often behaved better
MLPs, by contrast, were far less sensitive to column order. Because they do not rely on local neighborhoods, shuffling the features did not change what the model could, in principle, learn. When the researchers increased the number of neurons in the MLP’s hidden layer, its performance steadily improved and often surpassed that of CNNs, despite using fewer total parameters. Datasets with clearly informative features tended to yield high and stable scores for both models, but CNNs still carried a higher risk of occasional collapse. On harder datasets dominated by weaker signals, CNN performance varied wildly with architecture choices, while MLPs remained comparatively steady.
Take‑home message for clinical AI
For medical applications that rely on spreadsheet-like records instead of images, this study concludes that CNNs can be fragile tools. Their apparent strength on some benchmarks may reflect lucky ordering of columns and particular design decisions rather than genuinely robust learning of medical patterns. MLPs, and other methods that do not assume a meaningful spatial layout, generally offered more reliable behavior across thousands of trials. For doctors, hospital data scientists, and regulators, the lesson is clear: when building AI systems on tabular health data, it is safer to prioritize stability and transparency over chasing the highest single performance number from image‑style networks that were never designed for such inputs.
Citation: Wang, C., Elgendi, M. & Shin, H. Instability and performance limits of convolutional neural networks on non-sequential medical tabular data: an empirical investigation. Sci Rep 16, 11914 (2026). https://doi.org/10.1038/s41598-026-39875-9
Keywords: medical tabular data, convolutional neural networks, multi-layer perceptron, clinical prediction models, model stability