Australasian Science: Australia's authority on science since 1938

Big Data for Big Astronomy

Credit: alexovicsattila/iStock

Credit: alexovicsattila/iStock

By Peter Quinn

The Square Kilometre Array will generate huge amounts of data. Can computing capacity keep up?

When completed, the SKA will be capable of producing a stream of raw science data that will require computers with processing power beyond the performance of the largest computing systems on Earth today. The resulting scientific data will accumulate at a rate that exceeds 1 Exabyte (one million Terabytes) per year.

These figures represent a scientific endeavour that is Google-scale, and consequently the SKA is attracting considerable international interest and excitement from within the industrial and academic communities. Much of this excitement, and the associated challenges and opportunities, are shared with other research and commercial disciplines under the banner of “Big Data”.

Some of the most significant challenges for the SKA may be addressed by learning from other Exa-scale enterprises to ensure the affordable, effective and global achievement of SKA science goals.

The SKA as a Big Data Project

A lot has been written about the Big Data problem, but what kind of a problem is it? Is there one root issue? Are there many? Are they all technical? And if we can define the parameters of the problem, can we identify where solutions might be found?

Phase 1 of the SKA is expected to be operational in the first half of the next decade. On that time scale, what are the main components of the Big Data challenge that must be faced to build the SKA data system and to enable global SKA science?

Technology

Large science projects like the SKA, as well as the next generation of resource surveys and global information-gathering systems, are examples of sensor networks that acquire, move, process and store data. The balance between data movement, storage and processing will be different for each network but the general trend is to expand along all three of these dimensions in “Big Data Space”.

The available technologies for processing, storage and movement are not all progressing at the same rate. In particular, the ratio of storage to processing power for typical High Performance Computing (HPC) centres is going down while end users’ data volumes are expanding faster than the processing power.

Big Data therefore requires a change of direction for generic HPC technologies. In some cases the desired technology trend and thinking should be to reduce the number of required computations per byte of data stored and per byte of data communicated, which is the opposite of the current tread in the HPC industry.

In 1965, Gordon Moore first observed an emerging trend in digital electronics and predicted that available computing power would double (relative to cost) approximately every 2 years. There is every indication that Moore’s Law, based on improvements in silicon technology, will continue well into the next decade.

We expect the top machines to exceed 1 ExaFlop (1018 calculations per second) by 2020. This implies that the SKA Phase 1 processing system, with requirements in the range of 100–200 PetaFlops (1015 calculations per second), will be within the capabilities of HPC.

Similarly, while the collection of final science data products from the SKA Phase 1 will accumulate at more than 1 Exabyte/year, the world as a whole will have passed 10 Zetabytes (10,000 Exabytes) of online storage for videos, social media and image sharing in 2015.

So the SKA by itself will not lead the race to higher performance or storage volumes. Whether the mix of performance and storage in conventional HPC centres better fits the needs of data-intensive research or social media will be a Big Data questions that the SKA will pose.

Cost

While the world is on track to create ExaFlop computing systems by 2020, the cost of these systems (including the buildings, power and human resources required to support them) is likely to be comparable to the cost of the SKA Phase 1 itself. Furthermore, for some time the cost of storage has been decreasing more slowly than the cost of processing.

The long-term storage costs of data by conventional HPC centres is also likely to exceed both the construction and operational budget of the SKA Phase 1. Hence, to do SKA science we will need to adopt optimised, diversified, cost-effective and limited approaches to turning raw data into scientific results within the financial limits of the SKA project.

Architecture

The plumbing of data flow – from acquisition to processing to storage and finally to end users – requires different pipes depending on whether the connection is to the “dam” or to the “bath tub”. It is important to understand where the “hot pipes” are and where there is a need to connect the sources and sinks directly. We need to ask if the “internet of things” (the sensors surrounding our daily lives) is suited to the architecture of our current internet. There will always be “dams” – sources of large data flows – that need special pipes and connections, but where do we start to tier or spread the flow to make best use of our capital and operational budgets, and deliver the best product?

One of the most interesting developments for the SKA as a Big Data project will be the structure of its data flow, processing and storage architecture, which must allow access by the global astronomical community and provide long-term data curation while recognising the finite budget available to build and operate the SKA Observatory.

This exact problem was faced by the community that designed and built the Large Hadron Collider (LHC) more than 15 years ago. The LHC community created a separate global effort outside of the LHC project proper to fund and execute the science data flow from the LHC. This effort created what is now known as The Grid, which in turn is a predecessor of today’s cloud-based approaches to storage and processing.

The diversification of the funds, jobs and responsibilities to process and store LHC data at various scales outside of the LHC facility led to a number of useful outcomes. First there was a multitude of national and regional HPC and eScience funding schemes, outside of high energy physics funding, that could be tapped to support LHC science. Second, spreading the effort and load was a very concrete way of directly engaging many large and small project partners, which is often difficult to do in a centralised international effort.

The SKA has a lot to learn from the LHC experience.

Methodology

Projects like the SKA will see an expansion of typical telescope data volumes by a factor of 100 in less than 10 years. Will our current algorithms, workflows and tools also scale up by a factor of 100? Normally a change of scale of less than ten requires a new architecture and new methods.

Furthermore, are we aware of which elements of the data expansion create difficulties for current methods? Clearly, methods will fail rapidly if their computational requirements grow quickly as the number of data entries increase. However, methods may also fail if they involve intensive data movement because the slowest rate of improvement in HPC configurations is the I/O rate.

We are still largely in the dark with respect to the methodological aspects of radio astronomy that will not effectively scale by a factor of ten, let alone 100. Accordingly, this is an area of active research within the SKA community.

It is clear that some of the fundamental methods used in radio astronomy – such as the Fast Fourier Transform (FFT) – do not run well on current HPC architectures. A recent survey of more than 300 high performance machines showed that typical FFT methods achieve peak speeds of less than 10% of the advertised performance. The SKA project certainly cannot afford to buy HPC hardware that is ten times more capable than (and ten times the cost of) what we actually need.

Human Resources

Data scientists are in high demand. In 2012, more than 1% of all jobs posted on the internet were for data scientists, and the number of data science vacancies is now doubling in less than 12 months – a rate comparable to the doubling time of astronomy data itself! Conventional computer science educational institutions are providing less than 35% of the positions filled. Instead, data-intensive research in science and industry is providing the bulk of the first generation of data scientists.

We need to significantly increase the number of people being trained in data science to meet the technological, cost, architectural and methodological challenges of Big Data. At this time, the skills that we should be identifying, refining and imparting are spread across science and industry in an uncoordinated and unrecognised manner. Data science needs coherent recognition and cultivation within international research projects like the SKA and the broader astronomical research community.

Astronomy already does a reasonable job in educating our graduate students in high performance computing, programming methodology and data structures. As the scale and scope of modern astronomical survey science grows, we will need to again cultivate and acquire the necessary skill sets for our future radio survey scientists. This requires recognition of the value of these skills as an essential part of training of an astronomer and within the wider research and commercial world as our graduates look for careers in a very competitive landscape.

Building panoramic radio surveys of the universe with the SKA facilities in South Africa and Australia will be a task that involves teams from all corners of the global astronomical community. The linkage, cooperation and commitment of the global community to the SKA science mission will be critical to the success of the SKA. As a member of the Big Data community, the SKA will share many of the complexities, challenges and costs of similar enterprises in the commercial and social world.

Previous developments undertaken to facilitate distributed research, such as the Sloan Digital Sky Survey and the LHC, are highly relevant to the challenges that the SKA community will face. Furthermore, the roll-out of globally available cloud technologies and services over the coming decade will significantly help to deliver on the potential of the SKA for major astronomical breakthroughs.

Peter Quinn is the Executive Director of the International Centre for Radio Astronomy Research and Professor at the University of Western Australia.