Big Brother isn’t always as clever as we think
New research questions the value of digital surveillance and big data. Sometimes traditional and less privacy-invasive data can predict human behaviour much more effectively, a study conducted among university students reveals.
We constantly leave small digital footprints when we use our smartphones, the Internet and other digital technologies. A stream of private information that not only shows where we have been and what we have been doing, but which may also reveal our interests and future behaviour.
For many years now, social media and online advertisers have utilised such ‘big data’ to tap into our dreams, and digital footprints are now being used more and more extensively to predict human behaviour.
But does this digital surveillance always measure up to the benefits? No, a group of researchers from the University of Copenhagen and the Technical University of Denmark conclude after having studied the extent to which digital surveillance of students’ study life at university can be used to predict their examination performance.
The results were not impressive.
“Using simple and less sensitive statistical data types we were able to develop models that more accurately predicted students’ performance. This came as a big surprise to us, considering the fact that big data utilisation is becoming more and more widespread today,” says Assistant Professor Andreas Bjerre Nielsen from the Department of Economics and the Copenhagen Center for Social Data Science (SODAS) at the University of Copenhagen.
Surveillance of students at the Technical University of Denmark
In the concrete study, the researchers monitored and outlined the study behaviour of just over 500 students at the Technical University of Denmark using data from their mobile phones, which among other things identified their location on campus, the courses they took, which students they socialised with and their academic performance. The digital surveillance was supplemented with questionnaire surveys which painted a picture of the individual students’ personality features.
But when this extensive set of private data was inserted into sophisticated algorithms and the researchers used machine learning (artificial intelligence), it was only able to predict students’ exam performance to a rather limited extent.
In fact, the study shows that something as simple as the students’ exam results from primary and secondary school offered a far more precise indication of their future performance at university (see figure).
Figure: Different data types’ ability to predict students’ academic performance
In the study, big data were able to predict whether students’ examination results were in the top, middle or bottom with an accuracy of around 43 per cent, which is only slightly better than random guessing, which was correct 33 per cent of the time.
Conversely, the model was correct 58 per cent of the time when based on simply data such as students’ exam results from primary and secondary school and information about their social background. And just as surprising was the fact that combining such traditional data with big data did not increase the accuracy of the model.
If I wanted to predict how fast you can run 100 metres, I could do all kinds of blood samples, muscles biopsies and strength tests, but if you have run the same distance in the past, I would probably be better off basing my predictions on your past performance.
According to Professor Sune Lehmann from the Technical University of Denmark and SODAS, who is also one of the authors of the scientific article on the study, these results point to an aspect of big data utilisation that little research has focussed on: namely, whether big data always constitutes good, relevant data.
“Big behavioural data sets offer mediocre answers to a broad range of questions. But if we are only interested in answering a few, well-defined questions, using data that is relevant to the specific questions may be a better and easier solution,” Sune Lehmann says.
He draws a parallel to the world of sports:
“If I wanted to predict how fast you can run 100 metres, I could do all kinds of blood samples, muscles biopsies and strength tests, but if you have run the same distance in the past, I would probably be better off basing my predictions on your past performance.”
Questions the value of data and surveillance
According to Andreas Bjerre-Nielsen, the results are thought-provoking, because they question the value of the increasing digital surveillance and big data utilisation in society.
The ability to harvest and utilise digital data has increased considerably with the onward march of digital technologies. Today, private companies can track consumption and other behavioural patterns in great detail, the public sector can spot citizens at risk of falling ill or becoming long-term unemployed, and schools and educational institutions are able to monitor the students’ movements and study-related activities.
We should determine to which extent digital surveillance actually works and whether less sensitive data are in fact more relevant.
“Digital surveillance can be effective. For example, algorithms can prevent credit card fraud by monitoring card holders’ consumption patterns. But sometimes they fall short. And we should take this into account before introducing complex surveillance systems that violate people’s privacy,” says Andreas Bjerre-Nielsen.
“The study therefore tells us that we should determine to which extent digital surveillance actually works and whether less sensitive data are in fact more relevant. Like when the study demonstrates that looking at students’ past examination marks is a more effective method than surveillance, if you want to know which students are most at risk of struggling academically.”
The study, ’Task-specific information outperforms surveillance-style big data in predictive analytics’ , has just been published in PNAS, which is one of the world’s most cited scientific journals.
About the study
The results of the study have just been published in an article titled ’Task-specific information outperforms surveillance-style big data in predictive analytics’ in the journal PNAS (Proceedings of the National Academy of Sciences of the United States of America).
Besides Andreas Bjerre-Nielsen, the study was conducted and the article co-authored by Professor Sune Lehmann from the Technical University of Denmark and SODAS/University of Copenhagen, Valentin Kassarnig from the Graz University of Technology and Prorector and Professor David Dreyer Lassen from the University of Copenhagen.
The study is based on data from 521 students from the Technical University of Denmark, who over a period of two years were given mobile phones which in detail monitored their behaviour on campus and social networks. This was supplemented with data from questionnaire surveys.
Using algorithms and machine learning, this data was used to predict the students’ place in three performance groups (top, middle and bottom, depending on their exam results). The same analysis was conducted using more traditional, administrative data such as primary and secondary examination results and social background variables.
Sune Lehmann, who is also an Affiliate Professor with the Department of Sociology, has written about the study on his personal blog: Big data vs the right data: Thoughts on a recently competed trilogy