Assessment results don't add up
Last week, the minister of basic education announced the results of the Annual National Assessments (ANAs) for 2014. The ANAs test all children in grades one to six and nine, using standardised tests in mathematics and languages.
The problem is that these tests are being used as evidence of “improvements” in education when the ANAs cannot show changes over time.
There is absolutely no statistical or methodological foundation to make any comparison of ANA results over time or across grades.
Any such comparison is inaccurate, misleading and irresponsible. The difficulty levels of these tests differ between years and across grades, yielding different scores that have nothing to do with improvements or deteriorations necessarily, but rather test difficulty and the content covered.
Although the department of basic education tries to make the tests comparable across years, the way it goes about doing this (with teachers and experts setting the tests) means that in reality they are not at all comparable.
And the department knows this. On page 36 of its 2014 report, it states: “Even though care is taken to develop appropriate ANA tests each year, the results may not be perfectly comparable across years as the difficulty level and composition of the tests may not be identical from year to year.” Yet it then goes on to make explicit comparisons.
You can’t have it both ways. I can say categorically that the ANA tests are not at all comparable across years or grades. Despite this cloaked admission of incomparability, the report is full of the rhetoric of comparison, with scores reported side by side for 2012, 2013 and 2014, and 24 references to “increases” or “decreases” relative to last year’s ANA. Similarly the minister in her speech last week Thursday spoke about “consistent improvement in home language” as well as “an upward trend in performance”.
All of these statements are extremely misleading and factually incorrect. The ANAs cannot be compared across grades or years, at least not as they currently stand.
Those of us in the field of educational assessment have been saying this repeatedly for two years. Yet journalists continue to regurgitate these “increases” and “decreases” without any critical analysis, as if they must be true – but they are not. There are different ways of determining whether the quality of education is improving (primarily by using reliable international assessments over time) but the ANAs, in their current form, are not among them.
For tests to be comparable over time, one has to employ advanced statistical methods – for instance, item response theory, which essentially involves using some common questions across tests allowing us to compare performance on the common questions with performance on the noncommon questions within and between tests. This makes it possible to equate the difficulty of the tests (and adjust results) after they have been written. The common questions must also be used across grades and the ANA cycles.
This is standard practice around the world, and yet is not employed with the ANAs. Every single reliable international and national assessment around the world uses these methods if they intend to compare results over time or grades, but not the ANAs. There are no common questions used across any of the ANAs, either grade to grade within one year of an ANA, or between the ANA cycles. Using the ANA results to talk about “improvements” or “deteriorations” has no methodological or statistical justification whatsoever.
There is not a single educational statistician in the country or internationally who would go on record and say that the ANA results can be used to identify “improvements” or “deteriorations” over time or across grades.
Although the ANA report speaks about an “advisory committee” of “local and international experts”, it does not name them. These experts need to come forward and explain why they believe these tests are comparable over time, and if they do not believe they are comparable over time then the report should not refer to them.
On this matter, no one needs to take my word for it: the changes in results are so implausible that they speak for themselves. Take grade one mathematics, for example, where the average score was 68% in 2012, plummeted to 59% in 2013 and then soared to 68% in 2014. Very strange. Or, if we look at the proportion of grade three students with “acceptable achievement” (50% or higher) in mathematics, we have the fastest improving education system in recorded human history. The results went from 36% in 2012 to 65% in 2014. These changes are, educationally speaking, impossible.
Some of the provincial results are equally ridiculous. The average score for grade four home language in Limpopo doubled in two years, from 24% in 2012 to 51% in 2014. Given that the standard deviation for grade four home language in ANA 2012 was 26.5%, this amounts to a one-standard deviation increase in two years. For those who don’t know how large this is, it’s the same as the difference between township schools and suburban schools (mainly former Model C schools) in the 2011 study best known as “prePirls” (pre-Progress in International Reading Literacy Study), which recorded 0.9 standard deviations. There are clearly miracles happening in Limpopo.
I could go on and on and talk about other ridiculous changes, such as the national grade six mathematics average (from 27% in 2012 to 43% in 2014), grade five home-language increases in the North West (from 26% to 58%) or grade three mathematics increases in Mpumalanga (from 36% to 50%) or grade four home-language increases in KwaZulu-Natal (from 38% to 58%), and so on. These are all absolutely and unequivocally impossible, and have never been seen on a large scale anywhere in the world before. Ever.
Testing can be an extremely useful way to monitor progress and influence pedagogy and curriculum coverage, but only if it is done properly. Testing regimes usually take between five and 10 years to develop before they can offer the kinds of reliability needed to make claims about “improvement” or “deterioration”.
Test results send strong signals to students and teachers about what constitutes acceptable performance and whether things are improving or not. For example, the department assumes that 50% on the ANAs re-presents competent performance, but there is no rational basis for using this threshold as conceptually equivalent to “acceptable achievement”.
The overall decline in ANA achievement between grade one and grade nine is also extremely misleading, because it suggests that the problem lies higher up in the system. But all research shows that children are not acquiring foundational skills in grades one to three and that this is the root cause of underperformance in higher grades.
Testing children is a serious business that requires large teams of highly skilled professionals whose sole responsibility is to ensure the reliability and validity of the ANA results and process. This includes building a large bank of questions across grades, learning outcomes and subjects. It involves setting and moderating tests; linking and analysing test questions using item response theory; as well as reporting and disseminating results in ways that principals, teachers and parents understand. It needs intense collaboration across the curriculum and assessment branches of government and with those who develop the department’s workbooks. It requires a much longer planning, piloting and reporting cycle than the impossible time frames to which departmental officials are subject.
Let me be clear: the ANAs should not be scrapped – they are one of the most important policy interventions in the past 10 years. However, the first rule in educational assessment, as in medicine, is: “Do no harm.” Sending erroneous signals to teachers and students about “improvements” is extremely unhelpful. This makes it so much more difficult to really induce the improvement in behaviour at the classroom level that is central to real advances in learning outcomes.
In essence, the department needs to answer this: Are the ANA results comparable over time and across grades? If not, why are they being used as evidence for claims about “improvements” or “deteriorations” across grades or over time?
Nic Spaull is an education researcher in the economics department at Stellenbosch
University. He is on Twitter @NicSpaull and his research is available at nicspaull.com/research