/ 1 November 2015

Drive to decode array of big data

Information overload: Bernie ­Fanaroff says the SKA is the ultimate big data machine – collecting a massive amount of radio data from ­celestial ­bodies.
Information overload: Bernie ­Fanaroff says the SKA is the ultimate big data machine – collecting a massive amount of radio data from ­celestial ­bodies.

“Data is the new oil,” says Sean McLean, IBM South Africa’s university relations manager. “Just like crude, it is very valuable, but if it hasn’t been refined, it cannot be used.”

From smart phones and social media to financial systems and credit card data, consumers’ digital footprints continue to grow, and companies want the information written in this data. But current data analysis and visualisation techniques are struggling to cope with the quantity of data being generated.

This goes to the heart of “big data problem”, a phrase bandied about in academic and industry circles. But, as experts note, this phrase means different things to different people.

Professor Alan Christoffels, the National Research Foundation research chair in bioinformatics and health genomics, says: “For me, it’s all about the process of examining data, large quantities of it, to uncover hidden patterns, and then trying to make an informed decision.”

From insurance policies to service delivery, organisations use data to make decisions.

This problem of the growing volume of data and analytical methods that cannot keep up is exemplified in the Square Kilometre Array (SKA). It will be the world’s largest radio telescope, with thousands of antennas across Africa and Australia. The core will be in the Northern Cape in South Africa. These antennas will collect the relatively weak radio signals coming from celestial objects to answer questions such as: “Is there other life in the universe?”, “What is dark matter?” and “How do galaxies form?” The SKA dishes will produce 10 times the global internet traffic.

Bernie Fanaroff
Information overload: Bernie Fanaroff says the SKA is the ultimate big data machine – collecting a massive amount of radio data from celestial bodies. (Jaco Marais/Foto24/Gallo)

“The SKA is the ultimate big data machine,” says SKA South Africa director Bernie Fanaroff. “With these huge volumes of data, which you can’t look through by hand, you have to process the data before you even start analysing [it].”

Science projects producing large quantities of data, such as the SKA and the Large Hadron Collider (a giant particle accelerator in Europe), means that “science will be done in a different way … but so will everything else. [Big data] becomes the basis for a global industry itself,” Fanaroff says. “Data scientists will become critical to a whole range of activities, from service delivery to retailing and banking to the way governments are run.”

IBM’s Francois Luus, one of South Africa’s first data scientists, says: “The central challenge in modern data science is finding effective ways of dealing with high volume and high velocity data, and this challenge is exemplified by the SKA radio astronomy project.

“When we add the use of more complicated prescriptive methods, such as machine learning and artificial intelligence, to a high volume, high velocity data science problem, then we truly have a big data cognitive challenge.”

IBM is one of the organisations partnering with the SKA project to address the computing and big data challenges that the project will face.

Solomon Assefa, the laboratory director of IBM Research Africa, says that not only is “the amount of data growing exponentially, but also the type of data sets are changing – it is not the usual structured data, but a lot of unstructured data and noisy data. This requires new types of analytics and deep learning, machine learning to make sense of it.”

But the shortage of data scientists in South Africa is a concern for those in astronomy and other fields. In September, three universities – Cape Town, North West and Western Cape – launched the Inter-university Institute for Data Intensive Astronomy.

Professor Russ Taylor, a SKA research chair and founding director of the institute, says: “The chief driver of the institute is to develop the capacity in South Africa to deal with the large data flows of the SKA, and to make sure that we have the ability in South Africa to do the science for the SKA … We want South African researchers to be leaders in the science research. [Science with big data sets] is the new way things are done. You can’t decouple the data science from the science. We need to build up the capacity.”

Speaking at the launch of the institute, Science and Technology Minister Naledi Pandor said: “A significant focus and investment in big data in South Africa is not only due, but it probably crucial if South Africa is to play a significant role in the world economy in the coming decades.”

The SKA is a driver in this regard because “it’s an IT [information technology] project of the kind that pushes the boundaries of global technology”, she said. “Big tech companies like IBM and Cisco are already involved because they know it will allow them to develop the knowledge and technologies that will keep them at the leading edge of computing.”

Taylor says, although the institute will focus on astronomy and the SKA, “we will be working to develop solutions for the big data challenges. Those solutions will have greater applications for climate change [research], genomics, financial and economic systems.”


In some cases, size definitely does count

There are many examples of big data being used by companies: banks and insurers making their services run more efficiently and developing product offerings; Amazon using analytics to give you tailored recommendations; and Google using your GPS data to improve its route recommendations on Google maps – a few examples out of many. These applications will multiply as computing capacity increases and more money is put into big data research and development. Here are three of the mega science projects that are putting the “big” into data.

The Sloan Digital Sky Survey

Although most South Africans know about the Square Kilometre Array, which will be the world’s largest radio telescope with its core in the Northern Cape, few know about the Sloan Digital Sky Survey.

This ambitious project, which began in 2000, aimed to create the most detailed 3D map of the universe yet. The survey, undertaken on a dedicated optical telescope in the United States, has collected data on hundreds of millions of celestial objects. This telescope continues to collect 200GB of data every night that has to be processed, analysed and categorised.

The Large Hadron Collider

This year, the world’s largest particle accelerator made life even more difficult for those developing the systems to compute and store the data that the accelerator produces.

At the Large Hadron Collider (LHC), which straddles the border of France and Switzerland and is about 100m underground, scientists smash atoms together at ridiculously high energies, so that they can sift through the wreckage to understand and identify the smallest particles in the universe.

Cern
New processing capabilities will have to be developed to deal with so much data. (Fabrice Coffrini/AFP)

This year, they managed to increase the energies at which the atoms collide by more than 50%, creating a new record for the highest energy particle collisions. In the words of CERN (European Organisation for Nuclear Research), “Approximately 600-million times per second, particles collide within the LHC. Each collision generates particles that often decay in complex ways into even more particles. Electronic circuits record the passage of each particle through the detector as a series of electronic signals, and send the data to the CERN data centre for digital reconstruction … Physicists must sift through the 30 petabytes or so of data produced annually to determine if the collisions have thrown up any interesting physics.”

For some context, a petabyte is a million gigabytes. An iPhone 6 has 16 gigabytes of storage, so we’re talking about more than 62 000 iPhones. And with more energy, comes more data.

The human genome

Nothing shows the unrelenting march of technological development quite like the Human Genome Project. It began in 1990 and originally planned to map the more than three billion basepairs and six billion nucleotides (organic molecules that are the building blocks of DNA). It took more than a decade to sequence the part of the genome that contains the major genes. In 2015, it takes less than a day to sequence a person’s genome.

But now the challenge is to compare genomes and other biological data. This is where biobanks come in. These contain hundreds of thousands, even millions, of samples with donors’ medical and demographic information and allow scientists to conduct research with the possibility of statistically significant results. This, researchers say, is the future with personalised medicine. But to achieve that, you need to crunch the data. – Sarah Wild