Big Data faces up to patterns that aren't

What if the details of your life picked out by data-miners are nothing more than a statistical mirage?

Powered by automated translation

Every time we go online, use our credit card or make a phone call, we reveal something about ourselves, bit by electronic bit. Extracting revelations from those bits has become a multibillion-dollar industry called data-mining, whose challenges were the focus of a major conference held in Dubai last month.

Multinationals throughout the region increasingly rely on data-mining to find out what we're doing, thinking and feeling. It helps them tailor their stock more accurately and target their adverts and junk mail more precisely.

By turning every website into a potential font of insights, data-mining has also helped keep many web services free - but at a cost to privacy many find intolerable.

Yet for big business, the allure remains irresistible. The sheer quantity of what is being called Big Data is made clear in a report by the conference organisers, the International Data Corporation (IDC).

It estimates the amount of information created and replicated last year alone was around 2 zettabytes - the equivalent of 2 billion 1,000-gigabyte computer hard disks.

Much of that information is generated by people like you and me. But, scarily, the report points out that the amount we are creating is dwarfed by the amount of information being created about us.

Small wonder many people feel uneasy about the uses to which it is being put by huge, faceless corporations. Only last week the social media company Twitter became embroiled in controversy when it emerged that it is selling tweets dating back years to data miners. The standard defence about boosting business efficiency and keeping the web free does not impress everyone, however.

Among the sceptics is the co-inventor of the web, Sir Tim Berners-Lee. "I want to know if I look up a whole lot of books about some form of cancer that that's not going to get to my insurance company and I'm going to find my insurance premium is going to go up by 5 per cent because they've figured I'm looking at those books," he told the BBC in 2008.

That sentiment contains a deeper, and potentially much more worrying, concern about data mining.

Suppose Sir Tim was checking out the latest information about cancer not for himself, but for a friend? Or that his interest had been piqued by wanting to give money to a cancer charity? How could data miners know Sir Tim's motivation, and thus reach the right conclusion?

This is one of the many challenges facing data miners, who constantly risk seeing significance in meaningless patterns- a phenomenon long recognised by psychologists known as apophenia.

A notorious example of it dogged the US space agency Nasa for decades. In 1976, one of its Mars probes sent back a picture that appeared to show the image of an alien on the Red Planet. Known as the "Face on Mars", it provoked controversy for 25 years, until much more detailed images sent by a later probe revealed the "face" to be a rocky outcrop.

The computers that trawl through huge data-files can also fall prey to a high-tech version of apophenia.

Of course, the companies that run them go to great lengths to avoid being fooled. For example, they use "significance tests" that compare patterns found in the data to what would be expected by chance alone. These tests have been used for years by scientists, but statisticians have long warned about their unreliability.To illustrate the dangers, some years ago two researchers at the University of Bristol, England, went through a database containing 133 different variables, looking for "statistically significant" connections between them.

The first shock - at least, for non-mathematicians - was just how many different pairings even this relatively modest number of variables allows: almost 8,800.

The researchers set about applying the usual tests to the various pairs, looking for "significant" correlations. They set the standard for significance pretty high, requiring that the chances of getting at least as impressive a result by chance alone was just 1 in 100.

When the researchers trawled the data, they found more than 3,000 supposedly "significant" relationships - the overwhelming majority of which were, of course, junk.

Within the scientific community, there is growing concern that many research findings may be undermined by such high-tech apophenia - or "voodoo correlations", to use the term coined by the psychology professor Edward Vul of the University of California, San Diego.

A few years ago, he spotted a bizarre study that claimed to show a link between brain activity and the speed at which people walk.

Mystified, Prof Vul and his colleagues investigated, and found the results were based on a widely used technique for correlating brain scanning images to what people are doing. Put simply, it homes in on random data that just happen to fit whatever theory is being tested - and then claims the chances of the results being a fluke are tiny.

Such cherry-picking is guaranteed to support pretty much any barmy idea and, worryingly, Prof Vul and his colleagues found that about half the studies they reviewed relied on such methods.

It is tempting to think voodoo correlations lie behind many of those headline-grabbing science stories about how, say, regions of the brain "responsible for" jealousy are more active in men than women.

It is even harder to avoid thinking they also explain why so much of the output of data-mining - the unwanted adverts, the irritating junk mail - remains precisely that: junk.

Robert Matthews is visiting reader in science at Aston University, Birmingham, England.