The elusive evidence of educator-prep program effectiveness…exists!

“You can find evidence to support whatever you want.” It’s a common refrain heard in policy circles – but what happens when the evidence appears to say nothing at all?

Such is the apparent dilemma around evaluating the effectiveness of educator-preparation programs. In a recent story for Chalkbeat, reporter and evidence-parser extraordinaire Matt Barnum examined new research from Paul von Hippel (UT-Austin) and Laura Bellows (Duke) on educator-preparation program effectiveness. Using their academic microscopes, these researchers reviewed accountability policies and related data in six states – NY, LA, MO, WA, TX, and FL – to see whether programs meaningfully varied in their performance.

Their conclusion? No, not really. After analyzing data related to the value-added impact on student learning of teachers trained by particular programs, von Hippel and Bellows concluded that “it appears differences between [educator-preparation programs] are rarely detectible, and that [even] if they could be detected they would usually be too small to support effective policy decisions.”

If you’re in the business of advocating for using data to inform program improvement and educator-preparation policy, this finding might seem like a giant bummer. After all, if we can’t detect meaningful differences across programs, then what’s the point in gathering the data in the first place? And we certainly shouldn’t be holding programs accountable based on these data, right?

Perhaps. But as is so often the case, a more nuanced understanding of the interrelationship between research, policy, and practice may lead to a more nuanced conclusion.

Let’s begin by adding to our evidentiary picture, which is more promising than Barnum’s Chalkbeat story suggests. Recently a team of researchers led by Kevin Bastian at UNC-Chapel Hill examined the impact of educator-preparation programs on the subsequent evaluation rankings of the teachers they prepared. Among other things they found that:

  • Teachers prepared by different programs had significantly different evaluation rankings;
  • These differences were a function of program effectiveness, not just selection into the programs (though this mattered too); and
  • Teacher-evaluation ratings provided evidence of program performance that was distinct from value-added results.

This research builds off a study of Tennessee’s educator-preparation programs released last year by Matt Ronfeldt and Shanyce Campbell from the University of Michigan. After looking at observational data for teachers throughout Tennessee, these researchers found “significant and meaningful differences between” across preparation programs. How significant? “[G]raduates from top-quartile [educator-preparation programs] performed as though they had an additional year of initial teaching experience when compared with graduates from bottom-quartile [educator-preparation programs].”

This more comprehensive look at the research suggests two implications for policy, at least to my reading. First, states should be wary of overly emphasizing the role of value-added data. But second, states should use teacher-evaluation and classroom-observation scores to inform decisions related to educator-preparation program effectiveness.

What might such a system look like? A few years ago, Douglas Harris of Tulane wrote a fascinating – and in my opinion underappreciated – essay on how to improve states’ teacher-evaluation policies. Harris argued that instead of arbitrarily assigning weights to various components of teacher evaluation – such as making value-added results 30 percent or 40 percent or 50 percent of a teacher’s total – states should use this data as an initial “diagnostic screen.” In such a system, “value-added would only serve to trigger a closer look at a teacher’s performance, but the actual decisions would be based on classroom observations by experts.”

A similar approach could be used to evaluate the performance of educator-preparation programs. For example, states could use value-added measures as the first screen solely to identify truly exemplary or struggling programs; as even von Hippel and Bellows admit, “it may be possible to single out one or two [programs] per state that are truly better or worse than average – perhaps even substantially.”  States could then use teacher-evaluation or classroom observation scores of recent program graduates to further identify high- and low-performing programs. Importantly, this part of the process would have to take into account the characteristics of the K-12 schools into which programs send their graduates, and even incentivize programs to steer graduates into high-need schools.

After these initial screens, states could then use a variety of methods to further evaluate program performance. For example, states might use employer and beginning-teacher survey data, retention data, and program site visits. All of this could culminate in states identifying programs in need of additional support, and finding ways to support building program capacity.

Whether states follow this approach or not, leaders in the field of educator preparation – including academic researchers – face a moment of truth. If we do not gather more actionable data, policymakers will either craft accountability policies based on questionable statistics, or follow the lead of states such as Utah and Arizona that are de-professionalizing teaching entirely. The leaders of Deans for Impact are working hard to prevent both sort of sorts of policies from spreading further, but we need more researchers to conduct high-quality empirical research to inform policy.

Gathering useful evidence is hard. So too is developing evidence-informed policies. But sometimes that which is most hard is most worth doing.

All Blog Posts
All Blog Posts

Stay Connected

Sign up for our newsletter to get the latest from Deans for Impact

Subscribe to Our Newsletter