Sprawling systems teeter on IT chaos

By Duncan Graham-Rowe

The UK government is spearheading a £10 million programme aimed at finding ways to avert catastrophic failures in large IT networks.

Some systems are now so large they are untestable, making it impossible to predict how they will behave under all circumstances. The hidden flaws could lead to crashes in critical networks like healthcare or banking systems.

The scheme has been given added urgency by the failures of power grids in the US and Italy last year. “The system failures in terms of electricity blackouts show that patterns of unexpected and negative behaviours can arise, and when they do they are often disastrous,” says the government’s chief scientist, David King. If a century-old technology like a power grid can fail, the same might easily happen to modern IT networks.

Cancelled government IT projects plus others that have run over budget – such as tax and child benefit computer systems – cost the UK £1.5 billion over six years, according to a 2003 report from the Office of Government Commerce.

Now all government departments, health services and education systems across the 25 countries of the European Union are being linked to the internet. And the UK government ultimately wants many departmental IT systems connected together in the name of “joined-up government”.

But there is a real danger that such massive interconnected systems will exhibit potentially disastrous “emergent behaviours”, says David Cliff of Hewlett-Packard’s laboratory in Bristol, UK, who along with Seth Bullock at the University of Leeds wrote the report that prompted the government to address the issue.

Unpredictable behaviour

The task of debugging computer code grows exponentially with a program’s size, Cliff says, making full testing prohibitively expensive. “Systems don’t always fail because the programmers are incompetent. It’s because they didn’t know about these unforeseen problems.”

Computer systems are traditionally built by breaking up a problem into smaller parts, and assuming they will continue to work as planned when joined together. But as the number of modules increases, the ways in which they can interact becomes increasingly difficult to predict.

The £10 million the UK is to spend will be used to set up a national centre to study IT complexity, managed by the Engineering and Physical Sciences Research Council.

The centre will have its work cut out. The mathematics of complexity makes it impossible to explain the behaviour of a large distributed system simply in terms of the sum of its parts. Security threats like viruses and denial-of-service attacks can lead to unexpected emergent behaviours or even a crash.

Strange non-linear behaviours – those which produce a widespread effect from a relatively small change in a system’s operating condition – can result from single component failures, Cliff says. These behaviours can always be traced after an incident – but by then it is too late.



Related Articles

Small World Networks key to memory
Evolution could speed net downloads

Frontiers of Complexity