Clear Sky Science · en

Open data, private learners: a de-identified student activity and performance dataset for learning analytics

· Back to index

Why Your Online Study Habits Matter

Every time a student logs into an online course, clicks on lecture slides, or reads a discussion post, they leave behind a trail of digital footsteps. These traces can reveal who is struggling, who is breezing through, and which teaching strategies actually help. But they are also deeply personal. This article describes a large, carefully anonymized dataset of university students’ online learning behavior that aims to unlock insights for better teaching—without exposing individual learners.

Figure 1
Figure 1.

From Classroom Clicks to Research Gold

The dataset comes from first-year business students at KU Leuven who took two introductory courses—Accountancy and Global Economics—over three academic years, including the COVID-19 pandemic period when much teaching moved online. The courses relied heavily on a learning management system, where students accessed readings, slides, quizzes, and discussion forums. Each interaction, such as opening a file or viewing a forum thread, was logged with a time stamp. Combined with exam results, these logs provide a rich picture of how students actually study over weeks and months, rather than just how they perform on test day.

Protecting Students While Sharing Data

Sharing this kind of information raises serious privacy concerns: raw records contain unique student identifiers, exact grades, and precise times of activity that could make it possible to re-identify individuals. To prevent this, the authors applied several layers of de-identification before publishing the dataset. Student IDs were replaced with random codes, and the link back to real identities was destroyed. Exam scores were not shared as exact numbers but grouped into broad ranges such as fail, borderline, pass, or excellent. Details about a student’s specific study program were removed, and content items in the online platform were assigned to general types like course material or assessments instead of keeping their original file names.

Figure 2
Figure 2.

Blurring Details Without Losing the Story

Simply stripping names is not enough for strong privacy, so the team also adjusted how time and structure appear in the data. For example, they added a tiny, random shift of a few seconds to each student’s time stamps. This makes it much harder to match logs to real-world events while preserving the order of actions, which is crucial for studying learning patterns. Forum posts, session identifiers, and content IDs were all re-numbered randomly. The researchers then checked how anonymous the result really was using a standard measure called k-anonymity, which looks at how many students share the same combination of characteristics. In most cases, the transformed data made individuals blend into larger groups, boosting privacy protection.

Does the Data Still Tell the Truth?

Of course, anonymization is only useful if the data remains realistic enough to support solid research. To test this, the authors rebuilt dozens of learning features that earlier studies had used to detect unusual study patterns and predict exam success. These features include how often students log in, how evenly they spread their study sessions over the semester, and how actively they use forums. The team compared the distributions of each feature in the original and de-identified data using statistical tests. In nearly all cases, the two versions were indistinguishable, meaning the privacy steps did not distort the overall story of how students study online. Minor differences came mostly from improving how content types were categorized, not from privacy measures themselves.

What Researchers Can Do With It

Because the dataset covers two different courses and three years—including the major disruption of the pandemic—it can be used to examine how well findings hold up across subjects, cohorts, and changing conditions. The fine-grained time information supports process-mining studies that trace typical pathways through course materials, while rich forum records can underpin social network analyses of peer interaction. The authors also provide code for rebuilding learning features, making it easier to compare new models and methods against existing work and to explore explainable artificial intelligence in education.

Opening Doors Without Opening Identities

In everyday terms, this article shows that it is possible to learn a lot from how students click and scroll through online courses without exposing who they are. By thoughtfully masking personal details while preserving the patterns that matter, the authors offer a public resource that can help universities understand and improve learning at scale. For students, that could mean smarter support and more responsive teaching—built on data, but not at the cost of their privacy.

Citation: Tiukhova, E., Van Landuyt, D., Baesens, B. et al. Open data, private learners: a de-identified student activity and performance dataset for learning analytics. Sci Data 13, 548 (2026). https://doi.org/10.1038/s41597-026-06821-3

Keywords: learning analytics, student privacy, educational data, online learning, data anonymization