Beware of big data – science says it's not as reliable as you think

The amount of honey produced by US bees rises and falls according to the number of people who drowned after falling out of fishing boats. Implausible? That's big data

DUBAI, UNITED ARAB EMIRATES. 30 October 2017. Tech students come together to try and find solutions to help people with special needs move and communicate in the first assistive technology Hackathon organised by the Al Noor Special Needs Centre. LtoR: Team The Mediators, Rohit Vasu, Shaurya Sood, Arsheen Mir, Alina Zaidi and Rewant Verma from Bits Pilani Dubai Campus. (Photo: Antonie Robertson/The National) Journalist: Hala Khalaf. Section: Arts & Culture.
Powered by automated translation

All this month, students, business owners and entrepreneurs from across the Emirates have been competing to improve our lives using computers.

They are taking part in this year’s UAE Hackathon, a coding-fest with the official theme of using data “for happiness and quality of life”.

It’s a laudable aim – and potentially a profitable one. Many businesses now rely on analysis of “big data” to improve service reliability and customer satisfaction. Entertainment behemoth Netflix is even said to use big data analysis to spot the best proposals for new dramas.

This year’s Hackathon has targeted challenges that range from predicting healthcare needs to finding ways of improving fire safety.

The winners will be announced on March 3, and what they find is sure to be intriguing. But the competition comes at a time of growing concern about the reliability of big data – and the suspicion that digging into it produces fool’s gold.

Earlier this month, a leading expert in the field became the latest to warn of unreliable findings emerging from so-called machine learning where computers make predictions based on patterns in huge data-sets.

A lot of these techniques are designed to always make a prediction - they never come back with 'I don't know,' or 'I didn't discover anything,'

At the annual meeting of the American Association for the Advancement of Science in Washington, Professor Genevera Allen of Rice University, Texas, said such methods are producing questionable findings in the search for better cancer treatment.

“A lot of these techniques are designed to always make a prediction”, she said. “They never come back with 'I don't know,' or 'I didn't discover anything,' because they aren't made to."

From business to finance to academia, there’s growing disillusionment in the ability of big data to deliver on its promise.

If there’s one bit of statistical theory everyone knows, it’s that the more data you have, the more reliable the insights. And that’s what big data supplies in colossal quantities.

Opinion pollsters have long used groups of about 1,000 people to forecast the outcome of elections, which, theoretically, is highly likely to pin down the answer to within around a few per cent. Now even hotel chains and supermarkets have data on millions of people, potentially offering far more precise insights.

But finding patterns in big data sets is easy, making sure they’re reliable is not.

That’s because sheer quantity isn’t enough. For the statistical theory to work its magic, the data should be a clean, unbiased random sample of whatever is out there. And failure to pay attention to these “terms and conditions” can lead to disaster – as the designers of the first big data experiment found out more than 80 years ago.

In 1936 the American magazine Literary Digest attempted to predict the outcome of that year's presidential race. To do it, the organisers sent out 10 million postcards to gauge support for challenger Alfred Landon against the incumbent Franklin D Roosevelt.

More than 2 million people responded, leading to a clear prediction: Landon would beat FDR by a hefty 57 per cent to 43 per cent. And with so colossal a sample, there seemed little reason to doubt the prediction.

Sure enough, the election resulted in a landslide – but to FDR, who won with 61 per cent of the vote.

What had gone wrong? The organisers had been careful to send the cards out to a representative mix of the US population, to avoid getting a biased result. But they had made a critical mistake – returning the cards was entirely voluntary. As a result, the magazine only heard back from people motivated to reply and  Landon supporters proved much keener than those for FDR.

Known as responder bias, this remains a key challenge for pollsters even today. Get it wrong, and the sample is no longer random – undermining the statistical theory that makes it trustworthy.

Bias is not the only concern. The way computers decide if patterns are genuine or just a fluke is also under intense scrutiny. Researchers are now racing to develop ways of checking the reliability of big data findings, and one approach is to check their plausibility against what is already known.

For example, computer analysis has shown that in the decade leading up to 2009, the amount of honey produced by US bees rose and fell according to the number of people who drowned after falling out of fishing boats. The “link” is strong statistically – but implausible.

Many links found in big data are not so easily challenged, however. Researchers are now trying to stem the flow of nonsensical findings by getting computers to separate real insights from meaningless flukes.

Like the participants in the UAE Hackathon, they are in a race against time. We should all hope they beat the cheerleaders of big data to the finish line.

Robert Matthews is Visiting Professor of Science at Aston University, Birmingham, UK