The goal of determining how much a teacher or school contributes to student academic achievement growth is a complicated and difficult aspiration. Under ideal conditions, reasonable estimates can be theoretically determined. But, the real world is far from ideal and the risk of classification errors is high.
A classification error occurs when a student, teacher or school is incorrectly assigned to a performance category. For instance, a school may be labeled as exceeding expectations for achievement growth, when, in reality, it only meets expectations – or vice versa.
Since many decisions ranging from public disclosures to employee compensation are at stake, we need to pursue the best VAM available and fully explain the level of uncertainty that goes with each rating. And if the uncertainty is too great, decisions should be deferred.
Most current VAM calculations start by finding the difference between achievement test scores for a student over a two or three year period. This is typically expressed as a scale score difference. As an example, a fifth grader scored 221 on the Oregon Assessment of Knowledge and Skills (OAKS) Reading Assessment. As a fourth grader, the same student scored, 216. And as a third grader the student scored 211.
Over a two-year period, this student “grew” 10 scale score points. This is an estimate. Given the measurement error of the tests, the actual growth score is most likely somewhere between 7 and 13 points. The actual growth is not known with precision. Thus, the achievement status of this individual student is not easily confirmed.
From a VAM point-of-view, the good news is that individual measurement errors cancel out when scores from multiple students are aggregated. If we find the average gain for a classroom of 30 students, we can get a pretty good estimate of the group’s actual average achievement growth – a pretty good estimate, but not a perfect measurement. With a sample of around 30 students, the measurement error may still be around 1 scale score point or so.
If our sample student’s fifth grade classroom has an average two-year scale score gain of 8.9 points, then the true average is most likely somewhere between 7.9 and 9.9 points. As you can imagine, this uncertainty can cause problems in drawing conclusions about effectiveness at the classroom level.
If the school is large enough to have three fifth grade classrooms of 30 students each, then measurement error, while still present, becomes marginally important when all 90-gain scores are averaged together. Thus for a relatively large elementary school, average growth calculations can be reasonably accurate.
But wait, there’s more. Going back to our sample student’s original two-year growth score of 10 scale score points, let’s assume that the calculated gain of 10 points is precisely correct. Is this a “good” gain or not? Normally we would answer this question with a comparison of some kind, most likely with other classmates, all other fifth graders in a school, or with fifth graders in other schools in the district, or perhaps even with all fifth graders in the state.
If we make these kinds of comparisons between individual growth and average scores for various other groups, our conclusions will almost certainly be wrong unless our sample student is absolutely “average” within the various groups. Since students are not homogeneous on many important dimensions, we need to adjust growth scores in some way so that we can fairly compare individuals and groups (classroom, school, district and state) with each other. This is not as easy as it sounds.
To Recap
Individual growth scores and various levels of group aggregation include measurement errors that confound interpretation of actual gains. Once group sizes reach about 90 students, measurement error becomes marginally important, so group comparisons are reasonable. However, raw gain score averages are only meaningful relative to some normative frame-of-reference. Average score comparisons are almost certainly inaccurate unless differences in comparison groups are systematically controlled.
Consider This Example
School A has 90 fifth graders with an average two-year reading growth of 8 scale score points. School B has 90 fifth graders with an average two-year reading growth of 12 scale score points. Is School B more effective than school A? This is impossible to determine without more information that makes it possible to fairly compare the two student populations.
How do we establish fair comparisons?
This is a significant challenge. The current approach in most VAM algorithms is to use various measures of difference already available to equate scores for various groups. Typically, some or all of the following are used: race/ethnicity, age, gender, economically disadvantaged status, special education status, LEP status, TAG status, repeated grade status, skipped grade status, mobility, and attendance.
Clearly there are differences among students on these dimensions. But how do we apply these to effectively equate growth scores? This is largely unknown. While there may be various correlations between these dimensions and achievement, there are no demonstrated causal links and the interactions between them are uncertain. So a VAM based on the inclusion of these kinds of variables is likely subject to contain unknown, but potentially substantial, sources of error – and therefore yield classification errors.
Unfortunately most VAM algorithms in use today are derived from data that are available, but of relatively low utility, rather than the data that are needed, but not on hand. Sometimes convenience trumps validity.
Cognitive Ability
Fortunately, there is one variable that can be used very effectively to correct for group differences. This variable’s connection to achievement is well understood and its measurement technology is mature. This variable is cognitive ability. A better VAM would be one that relies primarily on the analysis of cognitive ability to correct group growth averages. None of the complexities associated with the other proxies are needed, and the resulting comparisons would be much stronger.
Gathering the data is relatively easy to accomplish, but not without some time and expense. As already noted, the measurement of cognitive ability is well developed with a variety of valid and reliable assessments available. Administration of these assessments takes about one hour and the scoring infrastructure already exists. Excellent statistical guidance also already exists for equating cognitive ability and achievement growth.
Using cognitive ability data to produce school ability profiles can substantially improve VAM accuracy while reducing algorithm complexity all at a relatively low level of effort. What’s not to like?
Back to our Example
You will recall that School A had an average reading growth of 8 scale score points while School B had an average reading growth of 12 points. After collecting cognitive ability data, we know that School A’s average ability is at the 40th percentile and School B’s average ability at the 60th percentile. To continue the hypothetical, when we apply a cognitive ability correction to the growth measures, both schools end up with average reading growth of 10 scale score points.
So the two schools are judged the same in their effectiveness in promoting reading even though their raw scores are significantly different. If we suppose that the corrected state average reading growth is 8.5 scale score points, we can go on to say that both schools were more successful in promoting reading achievement than was typical in the state as a whole. VAM mission accomplished.
A better VAM is possible and within reach. Of course, some proof of concept testing is needed – but it would be well worth the effort for the potential of significantly improved information and better decisions.
If you have an interest in this topic I recommend Douglas Harris’ recent book, Value-Added Measures in Education: What Every Educator Needs to Know, Harvard Education Press, 2011.





One of the major problems is the narrowness of using reading tests as the way to assess education in general. Who really cares if you can come up with a decent value added system when the basis for it is so narrow? Using opportunities for students as your measuring stick for education makes a lot more sense. How good is the overall education a child is receiving, that is the real test. A reading test which isn’t really that accurate anyway is a pretty poor measurement and in fact hides a lot of the negative aspects of education because those aspects are not seen as important because of all the emphasis placed on the tests. The whole testing, accountability thing is doing more harm than good. Now people want to pay teachers based on a formula which purports to tell how effective they have been even though most research shows more money won’t make you a better teacher anyway. Hopefully someday peole will begin to realize the folly of all this and of spending so much time, energy, and money on what actually matters very little. (Well, they already have in Finland, but maybe someday they will here also.)