/ 1 July 2016

Data rules the world

Data Rules The World

One of the first things many of us do when we’ve got to make a decision is to gather information to guide our choice. But what happens if you don’t have enough information at hand to be confident that you’re actually making a good decision? You just figure it out. Easy for people like you and me, but not if you’re a bank that’s been asked for a loan by someone with no credit record.

That’s where Professor Bhekisipho Twala, director of the Institute for Intelligent Systems and Professor of Artificial Intelligence and Statistical Sciences in the Faculty of Engineering and the Built Environment, University of Johannesburg, comes in.

Inspired by his maths teacher in high school, Eamon Molloy, Twala fell in love with the study of statistics. This led to his undergraduate degree in economics and statistics.

His first job was as a transport statistician, which is where he first encountered problems with data quality while working on transport modelling and validation. Never one to leave a problem unsolved, he realised that his only option was to do a master’s degree in computational statistics, which he followed up with his PhD in machine learning and statistical science at the Open University in Milton Keynes, United Kingdom.

Building on diverse expertise

His work over the past years has built on diverse expertise on making decisions with incomplete information, using artificial intelligence (AI) techniques for predicting outcomes, and classification techniques. This has been in fields such as banking and finance, insurance, biomedicine, robotics, psychology, software engineering and recently, in electrical and electronic engineering.

“As we continue into the 21st century, we are at the dawn of the Information Age,” Twala says. “Data and information are now as vital to an organisation’s wellbeing and future success as oxygen is to humans. Without a fresh supply of clean, unpolluted data, companies will struggle to survive.”

AI framework

He says that most of the problems that academia and industry deal with can be usefully cast in the framework of AI. This is the discipline that studies the design of agents that exhibit intelligent behaviour.

Since high quality data is critical to success in the Information Age, he has developed strategies for dealing with the incomplete data problem for classification and predictions tasks. He uses AI or machine learning technologies in different fields for dealing with uncertain knowledge.

The proposed methods estimate the limits on performance imposed by the quality of the database on which a task is defined, and involve a series of learning experiments.

Importance of data quality

Their research focuses on two goals. First, they seek to demonstrate that data quality is an important component of machine learning tools and it should be carefully considered when developing and using these tools.

They believe that while the importance of data quality is now understood in the business community — where researchers have equated quality decision making to earnings — in the engineering and science communities this realisation has not yet occurred.

Thus they embarked upon research into the effects of data quality upon the machine learning algorithms in an effort to demonstrate that data quality is a large factor in the outcomes of the algorithms and should be afforded more respect.

Second, after providing evidence that data quality is a large factor in the algorithms, they developed and tested some preliminary methods. These incorporated data quality assessments, thus creating more robust and useful algorithms.

They believe these modifications and methods have profound effects on the use of these machine learning algorithms in actual practice, particularly in the engineering and science communities.

Decision trees

Twala’s research merges two communities within computer science: data quality and machine learning, specifically the field of decision trees (a decision support tool that uses a tree-like graph or model of decisions and their possible consequences).

“While data quality has made great strides in gaining the respect of the business community in the past 10 years, within the machine learning/AI realm it has largely been neglected in order to focus more specifically on the learning algorithms and methods themselves,” he explains.

“Most research in these fields begins with the assumption that the data feeding the algorithms is of high quality — accurate, complete and timely. Researchers that do take data quality into account normally focus on the aspect of missing data. We start by presenting and elaborating on the theory of missing data, and use a variety of models to arrive at a collaborative prediction.”

Prof Twala’s work has and is still benefitting two distinct groups. Project or metrics managers, responsible for developing prediction models for software projects, now have access to the support tool enabling them to make the best use of their available data sets. This, in turn, allows them to develop more accurate prediction models. The work can be extended to other areas of business as well.

Academics in the area of empirical software engineering using data sets for research typically have restricted access to industrial project data. This is due in part to the reluctance of companies to release such data into the public domain.

Data sets are used for research into prediction, as well as many other areas to better understand software development phenomena. Results from such research will feed back into industrial practice, further benefitting the software development industry.

Discoveries in education

One of his areas of recent study has centred on making discoveries using data from education settings, and using those methods to better understand students and their learning environments.

Some of the work has helped identify factors affecting students’ academic performance. Findings have revealed that age, father/guardian’s socioeconomic status and daily study hours significantly contribute to the academic performance of graduate students.

Another area is estimating teaching effectiveness using data mining methods at high school levels. It’s anticipated that the findings of these studies will give curriculum developers new insights into emerging issues on performance, as well as influence policy formulation in the department of basic education.

“I love that my work lets me play in everyone else’s backyard,” says Twala. “The interdisciplinary nature of what we do means that we work with leading minds in philosophy, neuroscience, architecture and law, to name but a few.”

He says that he also loves that what his work is relevant to South Africa, and that it can help solve problems from traffic management to health, from insurance to software development.