/ 9 December 2016

A cloud big enough for the stars

Processing power: South Africa will soon need cloud-based servers that can handle the massive amounts of data that will be generated by our astronomy programme
Processing power: South Africa will soon need cloud-based servers that can handle the massive amounts of data that will be generated by our astronomy programme

The Inter-University Institute for Data Intensive Astronomy (Idia) — a collaboration between the University of Cape Town, the University of Pretoria, North-West University, and the University of the Western Cape — is piloting the African Research Cloud (ARC), with the hope of making it a continent-wide platform.

“It’s primarily set up as a platform for data-intensive research, which is research with large data sets,” says Professor Russ Taylor, founding director of the institute. “It’s about the democratisation of big data, allowing individuals, including students and researchers, to interact with the data.”

At the launch of the institute last year, Science and Technology Minister Naledi Pandor said that South Africa’s investment in and focus on big data was “not only due, but it’s crucial if South Africa is to play a significant role in the world economy in the coming decades”.

There are currently two three-year pilot projects on the ARC, funded by the Idia institutions: radio astronomy and genomics.

Big data is one of the major challenges facing modern science and business: as our equipment and techniques improve, we create more data. But we are struggling to keep up with the pace of technological development and the glut of data. For data to be useful, it need to be processed, analysed, understood, and displayed in a way that whoever is engaging with it can understand it.

“The initiative is a first for Africa and will be a real benefit to researchers on the continent,” says Sakkie Janse van Rensburg, UCT’s executive director of ICT services. “Big data has added a new dimension to the research process across all disciplines. Before the launch of this data centre, researchers struggled to manage data-heavy information, with significant challenges when it came to storing it in a way that could be quickly accessed, analysed, visualised and shared.”

Radio astronomy — which is the driver behind the institute and ARC — is the apotheosis of this challenge. The Square Kilometre Array (SKA), which will be co-hosted by Australia and South Africa, will produce more data in one day than is currently on the entire internet.

The SKA will be the largest radio telescope in the world, with thousands of dishes and antennas. Funded by an international consortium, the SKA will seek to answer some of humanity’s most enigmatic questions: Are we alone in the universe? What happened straight after the Big Bang? How do galaxies form? What is dark matter?

South Africa’s 64-dish MeerKAT telescope, locally designed and funded, is scheduled for completion in 2017. It will form part of phase one of the massive scientific instrument. Construction of SKA phase one will begin in 2018.

Taylor, who is also an SKA research chair shared between the universities of Cape Town and the Western Cape, says: “The [ARC] will build the capacity for South African researchers to work the data from MeerKAT and to make scientific breakthroughs in South Africa.

“The MeerKAT starts producing data next year, and we need systems in place for researchers to interact with the data.”

This is why radio astronomy forms one of the two pilots currently underway on the ARC. The African Resource Cloud Astronomy Development project — otherwise known as Arcade — is a “focal point for the development and prototyping of ARC hardware and software resource deployment”, the project states. Researchers have already used this platform to train undergraduate students and to perform data-heavy research.

“While MeerKAT’s large surveys [in which astronomers survey large portions of the sky] present an unprecedented opportunity for discovery and scientific advance, astronomers will face an enormous challenge in their quest to transport, process and analyse the terabytes of information that MeerKAT will produce,” they say.

But radio astronomy is not the only area that is struggling to cope with — and train students in preparation for — this avalanche of data. Genomics, which involves next generation sequencing to understand the smallest biological building blocks in plant, animal and human genomes, requires researchers to be able to crunch large quantities of data.

North-West University, in collaboration with the South African National Bioinformatics Institute based at the Univeristy of the Western Cape, will be using the cloud to develop bioinformatics skills for genomics.

Many researchers “have extremely limited exposure to formal training in computing at undergraduate level”, says Boeta Pretorius, NWU’s chief director of IT. Pretorius, who is also an Idia board member, says that at the moment the goal is to develop a web-based interface that researchers can use, even if they do not know how to use computer-coding language.

“Although the demonstration project has a deliberately narrow scope, we believe that it will demonstrate that cloud environments can lower the barrier to entry for researchers with limited computer training to the world of big data analytics.”

The reality is that these bioinformatics capabilities are expensive, and institutions in South Africa cannot afford to each fund their own solutions, let alone on the continent.

“The cloud gives researchers the ability to develop within their disciplines collaborative research environments, which share data, compute and tools and which are free of institutional ICT borders,” says UCT deputy vice-chancellor for research and internationalisation, Professor Danie Visser. “This will help researchers all over Africa accelerate and advance their research practice to the level that rivals institutions around the globe.”

Taylor says that they do not want to grow too quickly, but that there are many applications for the ARC. “We’ve identified a small number of strategic science areas to build the tools and channels to interact with big data in the cloud,” he says. These include geospatial Earth imaging and land use data.

“Of course, we have plans to roll out [the cloud] as African infrastructure in the next five to six years, but we would need significantly more resources.”