The department of basic education has come in for a great deal of criticism for its management of the Annual National Assessments (ANAs). Challenges are inevitable in constructing an assessment instrument and even more so when taking assessment to a national scale. Criticism is inevitable, and constructive engagement is necessary on the part of all stakeholders. A problem arises when sweeping statements about assessment practices override scientific reasoning and nuanced understanding.
Let us be clear about what can be achieved by a large-scale systemic assessment across an entire school population, and what cannot. A single assessment is at most a one-to-two-hour dipstick into the current educational practices, not a comprehensive review. Professor David Andrich, the world renowned education and assessment specialist, writes that systemic assessments should not alone, nor even primarily, guide policy decisions in education.
Annual national systemic testing has to balance two contrasting imperatives: the first is to produce a sequence of fairly comparable tests that can provide reliable information about the current proficiency profile across the country at regular intervals; the second is to design a balanced set of probes that also assess multiple levels of conceptual challenge, across the entire curriculum, as far as is possible, over a series of years.
Overcompensation in favour of strong comparability may allow aggregated test outcomes to be compared year by year. The Western Cape education department had replicas of the same test for an extended period. These instruments have satisfied the first comparability imperative, but by replicating a single test every year the items necessarily explore only a narrow range of the curriculum – and fail to meet the second imperative.
The South African Mathematics Olympiad tests, on the other hand, fail to satisfy the first imperative, because the Olympiad items are designed anew every cycle. But they address the second imperative by testing a broad range of curriculum content and requiring a range of mathematical skills and insights to solve the problems.
An international perspective on “teaching to the test” by Jennifer Jennings and Jonathan Bearak shows test predictability can skew our understanding of student performance. They report that in New York State the systemic mathematics tests cover the same 41% of the standards (curriculum) year on year, over a four-year period. Each subsequent test is a clone of the first test.
On the comparability imperative, this system may appear strong. Yet “score inflation” as a result of testing only a common, limited curriculum sample threatens the validity of these annual comparisons. A second inherent problem is “loss of educational validity” because teachers understandably will narrow the taught curriculum to conform to the test.
Similarly, the demand that the pupils in South Africa “Practise ANA!” leads to a distortion of the educational purpose, in that practising a test will inevitably create anxiety in some learners and boredom in others, and decrease time for the in-depth engagement with mathematics concepts that does lead to improved learning and results.
In Texas, the balanced multiple-level competence imperative is prioritised. Systemic tests are constructed afresh every year by selecting items from an extensive item bank. These selections over a four-year period cover 93% of the mathematics standards or curriculum.
Here the critique may claim the tests are not directly comparable year on year. In this Texas scenario there is no danger of “score inflation” from any narrowing of focus. Nor is there a threat to educational validity. The best interests of the school, the teacher and the children require that the curriculum focus covers a broad range of content and mathematical processes. This approach demands a higher level of test-setting expertise and effort.
In a critique of the ANA on the Mail & Guardian‘s website, Nic Spaull of the economics department at Stellenbosch University claims that there “is absolutely no statistical or methodological foundation” to make any comparison of ANA results over time or across grades. This claim fails to distinguish between distinct statistical and methodological procedures.
The comparability of grade 12 department of basic education examinations across a sequence of years is assumed from methodological procedures which should involve experienced teachers setting the test items to more or less the same standard year on year.
Some moderation by the council for quality assurance in general and further education and training (Umalusi) occurs thereafter. This procedure is valid in principle – the moderation methodology imposed on the outcomes is required to reach statistical comparability of achievement profiles.
The same logic for comparability objectives applies across grade levels. Though the department cannot yet affirm its ANA test-construction processes are achieving the desired objective of grade-specific validity and between-grade reliability, attempts are being made to achieve cross-grade comparability through critical engagement on the part of teachers. This professional engagement is one of the benefits of such a process.
Spaull also asserts that having “teachers and experts setting the tests” nullifies the tests’ comparability. If the teachers and the experts are not to construct the tests and select the items, then who else should perform those tasks, and how can we know that they are competent?
One cannot simply assume that international tests are valid, reliable and beyond critique. While these highly sophisticated tests serve a particular purpose of cross-national comparisons, the over-interpretation of the results and inferences made can result in short-sighted policy decisions. The fact is that international large-scale studies are limited, simply because they cannot by their very nature be sensitive to every within-country nuance.
Svend Kreiner, a measurement specialist at Copenhagen University, warns countries against the radical change of education systems to match those countries at the top of the league. His research shows different subset analyses of the Programme for International Student Assessment data produce different country rankings. His metaphor for the change in rankings is the shifting boats in Copenhagen harbour on a stormy night.No test can ever be context-free; in fact, every test should be appropriately context-specific. The guides, in this respect, are subject specialists and practising teachers. The process of involving skilled teachers in the setting of items, reviewing, piloting, analysis and the final construction of the instrument will serve to steep these teachers in assessment and measurement principles.
The department of basic education routinely conducts robust analyses on pilot tests, which are shared with provincial representatives, thereby equipping them with critical skills to be shared further in their home communities. Though asserting that teachers should be involved in various phases of the ANA, the necessary condition is that these teachers already have or will develop the required subject knowledge, assessment skills and insights to meet the two contrasting imperatives.
Spaull speaks about “those of us in the field of educational assessment”, implying that there is one voice on these educational assessment matters. Such unanimity is not the case. This field is rightly immersed in philosophical, educational and measurement theory debates, which on many issues inevitably lead to conclusions of the form “It depends …”
While assessment specialists do their best to offer scientifically informed advice, there might be an interim solution good enough for a particular purpose at a particular time. The point to be made is all educational researchers and the general public should be circumspect about the tests pervading the educational landscape before leaping to inferences based on test results.
A cry against the ANAs recently voiced by teachers was that they felt humiliated by the publicity about the poor results, which affected their standing in the eyes of their communities. From the perspective that the professional teacher’s function is dynamic engagement with the curriculum, the pupils and assessment, this consequence of dipstick tests – this collateral damage – must be addressed. Using the ANAs to name and shame schools where the conditions are already very difficult is fundamentally unjust and counterproductive.
A way to improve mathematical and literacy competence is to support the teachers’ professional practice with well-designed assessment resources, reviewed and revised in consultation with teachers, so as to permit intermittent signals of progress within a grade year. Such resources, with high educational value and “low stakes”, can inform and enrich the learning and teaching process. The implementation of such programmes serves to support professional development through the enabling of teacher agency, a necessary standpoint to provide the conditions for learning.
With such interventions the year-end monitoring tests will then offer more precise windows into pupil needs and teacher challenges, which can feed into the next year’s teaching and testing cycle. This process supports the professional practice of the teacher and enables teacher agency, a necessary component of providing a learning environment.
The balancing of contrasting imperatives – having comparable items across year-to-year tests and ensuring the test items are selected from a broad range of items – is a goal towards which the department strives. The provision of “good education” however, should not be undermined by the tyranny of numbers. Neither should good practice be obascured under a mountain of test papers.
Dr Caroline Long works in the Centre for Evaluation and Assessment in the Faculty of Education at the University of Pretoria.