Opinion: The true cost of data collection

By John Pethica on Jun 25, 2009 4:05PM

Personal data holders must be open and accountable.

The fall in the cost of data storage, especially flash memory, has made it practical to keep vast banks of information which can be speedily accessed. People leave an ever-growing set of digital fingerprints and an accumulating personal digital history.

Mobile phones track calls and locations to within a few metres, browsing history and timings can be monitored, and travel, health, and financial transaction information are all stored on databases. These can be called up almost instantly for analysis and correlation which is very handy for advertising or efficient service delivery.

But using such data for security purposes or for commercial gain raises issues which need to be addressed if organisations are to avoid further damage to public confidence in IT systems.

Anonymity is becoming increasingly hard to maintain. Structured searches enable identities to be found from anonymised partial or meta data. This becomes vastly more powerful if large, multiple datasets can be searched and cross-correlated.

It is disingenuous to say, as some governments and companies do, that the content of messages is not monitored. Traffic analysis and network structure are often all that is needed to establish comprehensive surveillance information.

This all sounds a bit Big Brother, yet it has been generally accepted because of the conveniences that data access brings. As long as such databases provide tangible benefits that clearly outweigh the risks and disadvantages, most consumers tolerate them.

Take, for example, vehicle insurance data, used for online car tax applications. It is estimated that about one entry in every 1,000 is in error. That means problems for some 25,000 people not least because their cars appear uninsured when caught on an automatic number plate recognition camera. But it also means the other 24,975 million are potentially satisfied customers.

It is reasonable for an insurance company or bank to perform a cost/benefit analysis, and conclude that it is cheaper to fix and compensate a few errors than to spend vast amounts trying to get a “perfect” system. If it works 999 times in 1,000, that might be OK.

It is much less reasonable for security organisations. Someone who has been wrongly detained as a terrorist due to incorrect data will be much less forgiving than someone with a minor car insurance error. False positives and negatives can make data useless when looking for one in 100,000. It is quite different from general customer convenience.

We need to be very clear about the purpose of data collection before aggregating it.

However brilliant your IT systems, it is impossible to eliminate human effects and errors. Wrong information might be entered; data can be accessed or misused by insiders. Once leaked, all control is lost and the risk of misuse is aggravated. The more personal or irrevocable the data, the greater the potential harm that could result from error or misuse.

Perhaps we should not accumulate the data, but if there is real cost-benefit value, stringent regulations and segregation and meaningfully serious penalties for abuse should be put in place.

Those using data need to remember they are in a privileged position. It is essential to be honest and open with customers and citizens about the purposes to which data can, and might, be put.

Secrecy does not help. Regulation must be informed by independent, open research and testing, to give a level of confidence appropriate to the sensitivity of the data.

John Pethica is chief scientist at the National Physical Laboratory, the UK’s national measurement institute