Clear Sky Science · en

A privacy preserving synthetic learner dataset for learning analytics in technology enhanced higher education

2026-03-23 · Back to index

Why student data without privacy risks matters

Universities now collect huge amounts of click‑by‑click information about how students learn online, from logins and video views to forum posts and quiz scores. These data could help teachers spot struggling students early and design better courses, but sharing them outside a campus is tightly limited by privacy law and ethics. This article describes a new way to unlock that value: a large, realistic but completely fake student dataset that aims to protect individuals while still supporting serious research.

Figure 1. How fake student records can mimic real learning data while protecting privacy

The idea of safe look‑alike student records

The study introduces SynEdu‑HEDL, a collection of 20,000 artificial student records built to resemble real university data without including any actual learner. Each record bundles together background information, week‑by‑week online activity over a 16‑week term, and final course results. The goal is for patterns that matter for education to survive in this invented data, such as how steady engagement relates to grades, while any trace of a real student is washed out. By releasing this dataset openly, the author hopes to give researchers a common playground for testing ideas without ever touching sensitive records.

How the synthetic students are created

To build SynEdu‑HEDL, the researcher first worked with a large public university that already tracks rich online learning activity across hundreds of courses. After strict ethical review, the real data were cleaned, simplified and stripped of direct identifiers. Then a multi‑step generation pipeline was used. One part of the system focuses on static information like age band or major, another learns how study behaviors change over the weeks of a term, and a third makes sure that behavior and outcomes still move together sensibly. Throughout, the system adds carefully calibrated randomness so that no single person’s trail can be reconstructed, while typical learning paths remain visible.

Figure 2. How patterns in real study behavior are transformed into privacy-safe synthetic data

Keeping privacy strong while staying useful

Protecting privacy is more than removing names. The team tested SynEdu‑HEDL against a battery of simulated attacks that attempt to guess whether a particular student was in the original data or to reconstruct their profile. These attacks did no better than random guessing, and formal mathematical checks show that the dataset meets a strict definition of privacy risk. At the same time, the author compared hundreds of statistics between the real and synthetic data. Basic distributions, relationships between variables and the shapes of engagement over time all lined up closely, including rare but important patterns like sudden drops in activity before a failure.

Can researchers trust results from fake data

To see whether the synthetic records are actually useful, the study rebuilt common learning analytics tools using SynEdu‑HEDL and then tested them on real students. Early warning models trained on synthetic data were almost as accurate at identifying at‑risk students as models trained directly on real data, often within a few percentage points. Cluster analyses still found meaningful groups of learners, and models that predict grades or estimate the effect of teaching changes behaved similarly. Perhaps most striking, when models were first trained on SynEdu‑HEDL and then lightly adjusted with only a small slice of real data, their performance jumped sharply, a promising sign for colleges that cannot easily share or pool full datasets.

What this means for future learning research

For readers, the key takeaway is that we may no longer have to choose between protecting students and advancing knowledge about how they learn. SynEdu‑HEDL shows that it is possible to build a detailed, shareable stand‑in for real educational data that keeps individual students safe while still supporting serious analysis. By making this synthetic dataset and its code freely available, the work offers a practical tool for open, reproducible studies and a template for other institutions. If widely adopted and refined, such privacy‑aware synthetic data could help educators worldwide test new ideas, improve support for vulnerable students and compare approaches across campuses without exposing anyone’s personal history.

Citation: Agal, S. A privacy preserving synthetic learner dataset for learning analytics in technology enhanced higher education. Sci Rep 16, 14772 (2026). https://doi.org/10.1038/s41598-026-44990-8

Keywords: learning analytics, synthetic data, student privacy, higher education, educational data