Disrupting corruption with a data-driven approach is possible. (Getty Images)
South Africa has a big problem with corruption in government supply chains. The most salient recent example would be the looting of funds during the Covid-19 pandemic, specifically the procurement of personal protective equipment in the Gauteng health department. Mark Heywood correctly asserted in the Daily Maverick that unless we introduce the certainty of punishment for corrupt public officials, we will lose the fight against corruption .
The July looting and riots taught us that these events affect our daily lives. They cause job losses and food price increases and are especially hard on the youth sector. Even trade union alliance Cosatu recognised back in 2017 that corruption costs us at least R27-billion and 76 000 jobs a year. That was before the pandemic.
- Now imagine a scenario where an accounting officer in the department of health can accurately predict the likelihood that fruitless and wasteful expenditure will occur, and act to prevent it?
- Ponder for a moment the transformative power to predict the likelihood of xenophobic attacks or unrest such as the now-infamous July unrest of 2021, so that they may be averted altogether?
- What if an entire government supply chain can be managed by a distributed ledger like a blockchain, so that not a single public official is involved?
Business domains including the banking and insurance industries have been making these kinds of predictions using data science tools for some time now. By employing machine learning to predict risks to its own business models they predict the likelihood of a client defaulting on a loan or instituting an insurance claim. Have you ever wondered why the bank did not want to give you a loan? There is an algorithm behind that!
Data science is an emerging field of inquiry usually associated with buzzwords such as big data, machine learning and artificial intelligence (AI). All of these terms have their roots in classical statistics. Statistical learning is quite simply, learning from data.
This is made possible by two conspiring realities: the costs of storing data has decreased over the years, and computational power has increased exponentially. This means that it is possible to find patterns and correlations in very large datasets (hence the term big data).
One way of understanding this ability is to say: if the data is too large for an Excel spreadsheet or acentral processing unit to handle, its potentially a job for data science.
What if we brought data science and good governance into the same room for a chat?
I think there are enormous benefits to such an approach for good governance, evidence-based policy making, and in the fight against corruption. Full disclosure: I am wearing more than one hat.
As a legislator in the Gauteng provincial legislature and a member of the standing committee on public accounts (Scopa), I often hear well-founded complaints about the ex post facto way we do oversight. The sector oversight model adopted by the South African legislative sector is a backward-looking tool. Oversight typically occurs months after irregular public expenditure has occurred. Committee recommendations do not scare corrupt officials.
As a social researcher and budding code writer (Python is fun!), this makes me wonder about the potential to use an artificial neural network or a decision tree regressor as a catalytic mechanism in the fight against corruption.
The specific model is not as important as a few reality checks, though.
- In the web-native world of data science, anybody can write code, train a computer model and set it free on real data. This is indeed encouraged by the fact that free and open-source resources are now ubiquitous. Urban and Pineda adequately problematises this abundance of free information: Most of it is not sufficiently rigorous to warrant deep enquiry, and few resources, if any, are aimed specifically at the policymaker.
A simple Google search reveals thousands of short-form resources in the form of “How To” video tutorials, articles, listicles and blogs each dealing with a specific subset of the myriad of elements of data science. Topics such as “Preparing data with Pandas” or “How to select features and responses” or “How to determine if my chosen algorithm is performing” all invite enticing glimpses of problem-solving.
Rarely is the world of data science systematically unpacked, referenced and peer-reviewed specifically for the policymaker, the legislator and the government official. And yet, on the periphery of applied policy making, most officials are aware of concepts such as big data, machine learning and AI. These concepts need to be firstly demystified before being introduced formally in the governance domain.
- The second question is whether the data exists. Let me explain. Data about government performance is everywhere, and it is abundant. In my case Scopa members are inundated with data all the time. Data sources include portfolio committee reports, the auditor general’s office, the Public Service Commission, the Financial and Fiscal Commission, the Special Investigating Unit, internal audit reports, departmental quarterly reports, and the list goes on.
Yet in the middle of all this information I very much doubt that a dataset exists that is ready for machine learning. If anyone reading this would like to rebut my assertion, I would welcome such a development. It will save my research several months!
- The third issue is with reproducibility. If my team and I build a machine learning model that performs well on unseen data, we must share! It should be standard practice that not only datasets but the actual code must be made available as a standard part of the research. This is because sometimes algorithms don’t work, machine learning models experience degradation over time, or we make the wrong business decisions from the data. In such cases, we may all learn from our failures just as much as from our triumphs.
For the policymaker, machine learning can become the tool that helps us prove or disprove our intuition about the problem we are trying to solve.
Here are a few of my recommendations for disrupting corruption with a data-driven approach:
Firstly, governments need to become data-driven learning organisations. For this to happen much more research and experimentation is needed. Officials and domain experts in government department are vital to the success of the undertaking because they make those vital inferences from the data on which decisions for, say, corruption-busting, hinges.
Secondly, we need to find a place for data science in the policy life cycle. Ideally more than one place. Statistical models can help policymakers move from inputs and outputs to outcomes and impacts. This would depend on the computational efficiency available and the particular problem that the machine is trying to learn on, but every stage of the policy life cycle can benefit from machine learning.
The next step is to bring together the data scientist, the statistician and the domain expert. Governance is an immensely complex undertaking. Every moment of the day, thousands upon thousands of financial transactions happen across municipal, provincial and national budget line items, often involving staggering amounts. This is all happening in an environment regulated by a myriad of complex laws and regulations. Officials often underestimate their domain knowledge and they don’t get credit for being able to assimilate all this complexity. The current consensus is that the best problem-solving teams include data scientists, statisticians and domain experts. All three have distinct roles to play in arriving at informed governance decisions.
The possibilities for machine learning to tackle corruption are very exciting. But it still requires humility from the data scientist, an understanding of the theory, and a willingness from the government official to give freely of his domain knowledge. And finally, no model can ever predict with 100% accuracy. We should be honest about that when pitching solutions to legislatures and governments alike.