27 April 2021

Big Brother isn’t always as clever as we think

Big Data

New research questions the value of digital surveillance and big data. Sometimes traditional and less privacy-invasive data can predict human behaviour much more effectively, a study conducted among university students reveals.

Foto: DTU Compute
Digital monitoring of students' student life at DTU can only to a limited extent predict their academic performance, a new study shows. Photo: DTU Compute

We constantly leave small digital footprints when we use our smartphones, the Internet and other digital technologies. A stream of private information that not only shows where we have been and what we have been doing, but which may also reveal our interests and future behaviour.

For many years now, social media and online advertisers have utilised such ‘big data’ to tap into our dreams, and digital footprints are now being used more and more extensively to predict human behaviour.

But does this digital surveillance always measure up to the benefits? No, a group of researchers from the University of Copenhagen and the Technical University of Denmark conclude after having studied the extent to which digital surveillance of students’ study life at university can be used to predict their examination performance.

The results were not impressive.

“Using simple and less sensitive statistical data types we were able to develop models that more accurately predicted students’ performance. This came as a big surprise to us, considering the fact that big data utilisation is becoming more and more widespread today,” says Assistant Professor Andreas Bjerre Nielsen from the Department of Economics and the Copenhagen Center for Social Data Science (SODAS) at the University of Copenhagen.

Surveillance of students at the Technical University of Denmark

In the concrete study, the researchers monitored and outlined the study behaviour of just over 500 students at the Technical University of Denmark using data from their mobile phones, which among other things identified their location on campus, the courses they took, which students they socialised with and their academic performance. The digital surveillance was supplemented with questionnaire surveys which painted a picture of the individual students’ personality features.

But when this extensive set of private data was inserted into sophisticated algorithms and the researchers used machine learning (artificial intelligence), it was only able to predict students’ exam performance to a rather limited extent.

In fact, the study shows that something as simple as the students’ exam results from primary and secondary school offered a far more precise indication of their future performance at university (see figure).

Figure: Different data types’ ability to predict students’ academic performance

Figure
Using so-called violin diagrams (illustrating uncertainty), the figure shows to which extent the different data types are able to predict university students’ academic performance. Administrative data, which include e.g. upper secondary school examination marks, are far more precise than big data.

In the study, big data were able to predict whether students’ examination results were in the top, middle or bottom with an accuracy of around 43 per cent, which is only slightly better than random guessing, which was correct 33 per cent of the time.

Conversely, the model was correct 58 per cent of the time when based on simply data such as students’ exam results from primary and secondary school and information about their social background. And just as surprising was the fact that combining such traditional data with big data did not increase the accuracy of the model.

If I wanted to predict how fast you can run 100 metres, I could do all kinds of blood samples, muscles biopsies and strength tests, but if you have run the same distance in the past, I would probably be better off basing my predictions on your past performance.

Sune Lehmann

According to Professor Sune Lehmann from the Technical University of Denmark and SODAS, who is also one of the authors of the scientific article on the study, these results point to an aspect of big data utilisation that little research has focussed on: namely, whether big data always constitutes good, relevant data.

“Big behavioural data sets offer mediocre answers to a broad range of questions. But if we are only interested in answering a few, well-defined questions, using data that is relevant to the specific questions may be a better and easier solution,” Sune Lehmann says.

He draws a parallel to the world of sports:

“If I wanted to predict how fast you can run 100 metres, I could do all kinds of blood samples, muscles biopsies and strength tests, but if you have run the same distance in the past, I would probably be better off basing my predictions on your past performance.”

Questions the value of data and surveillance

According to Andreas Bjerre-Nielsen, the results are thought-provoking, because they question the value of the increasing digital surveillance and big data utilisation in society.

The ability to harvest and utilise digital data has increased considerably with the onward march of digital technologies. Today, private companies can track consumption and other behavioural patterns in great detail, the public sector can spot citizens at risk of falling ill or becoming long-term unemployed, and schools and educational institutions are able to monitor the students’ movements and study-related activities.

We should determine to which extent digital surveillance actually works and whether less sensitive data are in fact more relevant.

Andreas Bjerre-Nielsen

“Digital surveillance can be effective. For example, algorithms can prevent credit card fraud by monitoring card holders’ consumption patterns. But sometimes they fall short. And we should take this into account before introducing complex surveillance systems that violate people’s privacy,” says Andreas Bjerre-Nielsen.

“The study therefore tells us that we should determine to which extent digital surveillance actually works and whether less sensitive data are in fact more relevant. Like when the study demonstrates that looking at students’ past examination marks is a more effective method than surveillance, if you want to know which students are most at risk of struggling academically.”

The study, ’Task-specific information outperforms surveillance-style big data in predictive analytics’ , has just been published in PNAS, which is one of the world’s most cited scientific journals.