I have a bad habit of using quotation marks a lot in my writing. I'm not sure where that particular tic comes from. But I believe in the case above, "growth" and "measured" deserve quotes.
My Algebra 1 colleagues and I developed a test to be given at the beginning and at the end of the first quarter. Another colleague in a different discipline worked with others to create 4 such tests - one for each quarter. I can only assume that other teachers spent the afternoon doing similar work. And at the end of the day, what had we all produced?
- Tests that will ultimately be used to evaluate us first, students second. The state of New Jersey now has a law requiring that 50% of a teacher's evaluation be based on student assessments.
- Tests that were developed in an ad hoc fashion. How confident are we, for example, that these tests measure anything worth measuring? How confident are we that a measure of student progress made in this way is any better than the traditional way - that is, teachers get to know their students, they assess them, and by their assessments determine whether or not the students understand a particular set of content?
- Tests that, depending on how the individual departments structured them, could very well be inappropriate for the task they're supposed to be used for. Hypothetically, one could imagine the physics benchmark exam at my school being used simultaneously for 9th grade physics (including "modified"/high level of special-needs students physics, "regular" physics, and honors physics) as well as senior physics, AP physics B, and AP physics C. Or the US history 1 benchmark exam being used for modified, regular, and honors levels.
- Tests that will take time away from classroom instruction, as they will be administered during class periods. On the model that two per quarter are given, that is 8 days for testing taken away every year.
But what is happening now is that the question has taken on a new reality. There are real, practical consequences for a teacher who does not produce sufficient "growth" among their students. The answer to the question above is now based on a very simple model: a student comes in at baseline knowledge level X, the teacher does their thing, the student reaches new level of knowledge Y, and the growth for that student = Y minus X. Aggregated over all students that a teacher has, the teacher's ability to produce growth can also be measured. And so teachers who produce higher levels of growth receive concomitantly higher ratings.
There are two sets of problems with this. The first I've already mentioned: what is meant by student growth anyway?
Assume, for the heck of it, that there is a valid answer, and that in some fashion it can be measured. The second set of problems revolves around the translation between student growth and teacher performance. These issues have been widely discussed, and where you fall on the spectrum of belief that performance drives growth seems largely to align with your political views if you're not a teacher. Using a crude model, an r value of 1 means that teacher performance perfectly correlates with student growth, while an r of 0 means that there is no correlation. (An r of -1 is scary to contemplate.) I am of course ignoring the fact that correlation ≠ causation in general, but market-driven reformers tend to ignore that.
Anyway, my rough estimate is that market-driven reformers tend to put r close to 1, the general public puts it at about 0.7, and teachers put it at about 0.5.
There are three major flaws that I see:
First, the idea that the growth of a student can be determined using this model is massively oversimplified. It might be considerably more difficult to take a student from 70% knowledge (whatever that means) to 90% than it is to go from 0% to 20%. But maybe not - maybe it's the other way around. Who knows? But in this model, the teacher has produced 20% knowledge growth in both cases.
Second, the idea that the student plays no role in their own growth is implicit in this model. The use of these benchmark exams for teacher evaluation gives support to that contention. If there truly were recognition that students came with different levels of motivation (either internal or evoked by the teacher), for example, then there would be no way to compare students - which would mean no way to compare teachers. What, for example, is learned by comparing two students who went from 70% knowledge to 90% knowledge, when one of them busted their behind to "get it" and the other simply was able to absorb the material presented to them and make sense of it? Which teacher did a better job? What if the first student was motivated by the teacher to put in that effort, while the second disliked the teacher but had intrinsic interest in the subject?
Third, the idea that external factors are irrelevant is also implicit. What happens to the teacher who has a student whose parents get a divorce? Or whose family is having a job or other financial crisis at home? Or have a serious illness or death in the family? That student, understandably, may not have their full attention on their academic work; their growth during that time may be less than it would otherwise. In a data-driven evaluation model, that won't matter.
It can be claimed, perhaps, that I'm creating a strawman argument: perhaps no one seriously believes that teacher performance as determined by test scores will be evaluated in such a simpleminded way. I'd be happy to be convinced.
(Note: As I am finishing this essay, a new post from Larry Cuban just arrived in my inbox, titled "Algorithms, Accountability, and Professional Judgement (Part 3)". I think it fits in well with this post, and if you've read this far I think you should read Dr. Cuban's post also.)