A brilliant article in yesterday's NYTimes by Michael Winerip, titled "10 years of assessing students with scientific exactitude" describes the ups and downs experienced by the New York school system over a decade of No Child Left Behind-inspired Testing.
It starts with:
In the last decade, we have emerged from the Education Stone Age. No longer must we rely on primitive tools like teachers and principals to assess children’s academic progress. Thanks to the best education minds in Washington, Albany and Lower Manhattan, we now have finely calibrated state tests aligned with the highest academic standards. What follows is a look back at New York’s long march to a new age of accountability.
After reading the full chronological listing, I didn't know whether to laugh or cry.
***
Perhaps the single hardest (worst) part of my job is making and grading tests. I never know in advance how a particular class of students is going to do on a particular test I make up. Well, the students that do extremely well might do well in any variant of it. The students who do extremely poorly may not do that much better in some other variant of it. But for the vast majority of students, I find there is great sensitivity to every aspect of the exam, from the choice of topics, to the subtle deviations of these questions from what has been covered exactly in the notes, the book, or the homework assignments, to the length of the exam, and even the ordering of the questions.
Even in a seemingly objective field of study like Engineering, there is a lot of subjectivity in how one grades (going beyond the obvious subjectivity inherent in the choice of questions to put on an exam). We try our best to be consistent and fair across all the students for the same class; but for the same question and answer, unless one uses shallow multiple-choice questions (the approach adopted by many standardized tests), it is certain that no two instructors would grade the same way. While there may be one way (or relatively few ways) to get the answer right, there are exponentially many combinations of errors that trip up students. Particularly if one wishes to go down the road of offering partial credit, the art of grading requires one to differentiate between these and place a value judgement on them: do you give a student that got the right numerical answer through incorrect reasoning some credit? Do you give a student that took completely the wrong approach to the problem but applied that approach correctly albeit to give the wrong answer more credit than one that tried out something new and original but failed with it and gave (if it is possible) an answer even further from the correct one? What if you discover upon grading that a question that seems perfectly straightforward to you has been misinterpreted by number of the students to be quite different from what you had intended? How large a number does this have to be for you to factor the possible ambiguity in the wording into account when grading? Does it matter if the misinterpreted question is easier or harder than the originally intended question?
Unfortunately, the politics of public K-12 education and the economics of higher education dictate that we must always have assessment and grading. Testing is a necessary evil that we cannot wish completely away. Let's continue to strive to be as fair as possible in making and grading tests, but let us not pretend that test scores and GPA's are objective, noiseless, measures of a student's intellectual capability (or, in the case of public schooling, of the effectiveness of a system of education).