Category Archives: Evaluation Quality

July 16, 2014 · 1:22 pm

New European Standard for Social Impact Measurement

Evaluation has truly become a global movement. The number of evaluators and evaluation associations around the world is growing, and they are becoming more interconnected. What affects evaluation in one part of the world increasingly affects how it is practiced in another.

That is why the European standard for social impact measurement, announced just a few weeks ago, is important for evaluators in the US.

According to the published report and its accompanying press release, the immediate purpose of the standard is to help social enterprises access EU financial support, especially in relation to the European Social Entrepreneurship Funds (EuSEFs) and the Programme for Employment and Social Innovation (EaSI).

But as László Andor, EU Commissioner for Employment, Social Affairs and Inclusion, pointed out, there is a larger purpose:

The new standard…sets the groundwork for social impact measurement in Europe. It also contributes to the work of the Taskforce on Social Impact Investment set up by the G7 to develop a set of general guidelines for impact measurement to be used by social impact investors globally.

That is big, and it has the potential to affect evaluation around the world.

What is impact measurement?

For evaluators in the US, the term impact measurement may be unfamiliar. It has greater currency in Europe and, of late, in Canada. Defining the term precisely is difficult because, as an area of practice, impact measurement is evolving quickly.

Around the world, there is a growing demand for evaluations that incorporate information about impact, values, and value. It is coming from government agencies, philanthropic foundations, and private investors who want to increase their social impact by allocating their public or private funds more efficiently.

Sometimes these funders are called impact investors. In some contexts, the label signals a commitment to grant making that incorporates the tools and techniques of financial investors. In others, it signals a commitment by private investors to a double bottom line—a social return on their investment for others and a financial return for themselves.

These funders want to know if people are better off in ways that they and other stakeholders believe are important. Moreover, they want to know whether those impacts are large enough and important enough to warrant the funds being spent to produce them. In other words, did the program add value?

Impact measurement may engage a wide range of stakeholders to define the outcomes of interest, but the overarching definition of success—that the program adds value—is typically driven by funders. Value may be assessed with quantitative, qualitative, or mixed methods, but almost all of the impact measurement work that I have seen has framed value in quantitative terms.

Is impact measurement the same as evaluation?

I consider impact measurement a specialized practice within evaluation. Others do not. Geographic and disciplinary boundaries have tended to isolate those who identify themselves as evaluators from those who conduct impact measurement—often referred to as impact analysts. These two groups are beginning to connect, like evaluators of every kind around the world.

I like to think of impact analysts and evaluators as twins who were separated at birth and then, as adults, accidentally bump into each other at the local coffee shop. They are delighted and confused, but mostly delighted. They have a great deal to talk about.

How is impact measurement different from impact evaluation?

There is more than one approach to impact evaluation. There is what we might call traditional impact evaluation—randomized control trials and quasi-experiments as described by Shadish, Cook, and Campbell. There are also many recently developed alternatives—contribution analysis, evaluation of collective impact, and others.

Impact measurement differs from traditional and alternative impact evaluation in a number of ways, among them:

how impacts are estimated and
a strong emphasis on valuation.

I discuss both in more detail below. Briefly, impacts are frequently estimated by adjusting outcomes for a pre-established set of potential biases, usually without reference to a comparison or control group. Valuation estimates the importance of impacts to stakeholders—the domain of human values—and expresses it in monetary units.

These two features are woven into the European standard and have the potential to become standard practices elsewhere, including the US. If they were to be incorporated into US practice, it would represent a substantial change in how we conduct evaluations.

What is the new European standard?

The standard creates a common process for conducting impact measurement, not a common set of impacts or indicators. The five-step process presented in the report is surprisingly similar to Tyler’s seven-step evaluation procedure, which he developed in the 1930s as he directed the evaluation of the Eight-Year Study across 30 schools. For its time, Tyler’s work was novel and the scale impressive.

Tyler’s evaluation procedure developed in the 1930s and the new European standard process: déjà vu all over again?

Tyler’s first two steps were formulating and classifying objectives (what do programs hope to achieve and which objectives can be shared across sites to facilitate comparability and learning). Deeply rooted in the philosophy of progressive education, he and his team identified the most important stakeholders—students, parents, educators, and the larger community—and conducted much of their work collaboratively (most often with teachers and school staff).

Similarly, the first two steps of the European standard process are identifying objectives and stakeholders (what does the program hope to achieve, who benefits, and who pays). They are to be implemented collaboratively with stakeholders (funders and program staff chief among them) with an explicit commitment to serving the interests of society more broadly.

Tyler’s third and fourth steps were defining outcomes in terms of behavior and identifying how and where the behaviors could be observed. The word behavior was trendy in Tyler’s day. What he meant was developing a way to observe or quantify outcomes. This is precisely setting relevant measures, the third step of the new European standard process.

Tyler’s fifth and sixth steps were selecting, trying, proving, and improving measures as they function in the evaluation. Today we would call this piloting, validation, and implementation. The corresponding step in the standard is measure, validate and value, only the last of these falling outside the scope of Tyler’s procedure.

Tyler concluded his procedure with interpreting results, which for him included analysis, reporting, and working with stakeholders to facilitate the effective use of results. The new European standard process concludes in much the same way, with reporting results, learning from them, and using them to improve the program.

How are impacts estimated?

Traditional impact evaluation defines an impact as the difference in potential outcomes—the outcomes participants realized with the program compared to the outcomes they would have realized without the program.

It is impossible to observe both of these mutually exclusive conditions at the same time. Thus, all research designs can be thought of as hacks, some more elegant than others, that allow us to approximate one condition while observing the other.

The European standard takes a similar view of impacts and describes a good research design as one that takes the following into account:

attribution,the extent to which the program, as opposed to other programs or factors, caused the outcomes;
deadweight, outcomes that, in the absence of the program, would have been realized anyway;
drop-off, the tendency of impacts to diminish over time; and
displacement, the extent to which outcomes realized by program participants prevent others from realizing those outcomes (for example, when participants of a job training program find employment, it reduces the number of open jobs and as a result may make it more difficult for non-participants to find employment).

For any given evaluation, many research designs may meet the above criteria, some with the potential to provide more credible findings than others.

However, impact analysts may not be free to choose the research design with the potential to provide the most credible results. According to the standard, the cost and complexity of the design must be proportionate to the size, scope, cost, potential risks, and potential benefits of the program being evaluated. In other words, impact analysts must make a difficult tradeoff between credibility and feasibility.

How well are analysts making the tradeoff between credibility and feasibility?

At the recent Canadian Evaluation Society Conference, my colleagues Cristina Tangonan, Anna Fagergren (not pictured), and I addressed this question. We described the potential weaknesses of research designs used in impact measurement generally and Social Return on Investment (SROI) analyses specifically. Our work is based on a review of publicly available SROI reports (to date, 107 of 156 identified reports) and theoretical work on the statistical properties of the estimates produced.

At the CES 2014 conference.

What we have found so far leads us to question whether the credibility-feasibility tradeoffs are being made in ways that adequately support the purposes of SROI analyses and other forms of impact measurement.

One design that we discussed starts with measuring the outcome realized by program participants. For example, how many participants of a job training program found employment, or the test scores realized by students who were enrolled in a new education program. Sometimes impact analysts will measure the outcome as a pre-program/post-program difference, often they measure the post-program outcome level on its own.

Once the outcome measure is in hand, impact analysts adjust it for attribution, deadweight, drop-off, and displacement by subtracting some amount or percentage for each potential bias. The adjustments may be based on interviews with past participants, prior academic or policy research, or sensitivity analysis. Rarely are they based on comparison or control groups constructed for the evaluation. The resulting adjusted outcome measure is taken as the impact estimate.

This is an example of a high-feasibility, low-credibility design. Is it good enough for the purposes that impact analysts have in mind? Perhaps, but I’m skeptical. There is a century of systematic research on estimating impacts—why didn’t this method, which is much more feasible than many alternatives, become a standard part of evaluation practice decades before? I believe it is because the credibility of the design (or more accurately, the results it can produce) is considered too low for most purposes.

From what I understand, this design–and others that are similar–would meet the European standard. That leads me to question whether the new standard has set the bar too low, unduly favoring feasibility over credibility.

What is valuation?

In the US, I believe we do far less valuation than is currently being done in Europe and Canada. Valuation expresses the value (importance) of impacts in monetary units (a measure of importance).

If the outcome, for example, were earned income, then valuation would entail estimating an impact as we usually would. If the outcome were health, happiness, or well-being, valuation would be more complicated. In this case, we would need to translate non-monetary units to monetary units in a way that accurately reflects the relative value of impacts to stakeholders. No easy feat.

In some cases, valuation may help us gauge whether the monetized value of a program’s impact is large enough to matter. It is difficult to defend spending $2,000 per participant of a job training program that, on average, results in additional earned income of $1,000 per participant. Participants would be better off if we gave $2,000 to each.

At other times, valuation may not be useful. For example, if one health program saves more lives than another, I don’t believe we need to value lives in dollars to judge their relative effectiveness.

Another concern is that valuation reduces the certainty of the final estimate (in monetary units) as compared to an impact estimate on its own (in its original units). That is a topic that I discussed at the CES conference, and will again at the conferences of the European Evaluation Society, Social Impact Analysts Association, and the American Evaluation Association .

There is more to this than I can hope to address here. In brief—the credibility of a valuation can never be greater than the credibility of the impact estimate upon which it is based. Call that Gargani’s Law.

If ensuring the feasibility of an evaluation results in impact estimates with low credibility (see above), we should think carefully before reducing credibility further by expressing the impact in monetary units.

Where do we go from here?

The European standard sets out to solve a problem that is intrinsic to our profession–stakeholders with different perspectives are constantly struggling to come to agreement about what makes an evaluation good enough for the purposes they have in mind. In the case of the new standard, I fear the bar may be set too low, tipping the balance in favor of feasibility over credibility.

That is, of course, speculation. But so too is believing the balance is right or that it is tipped in the other direction. What is needed is a program of research—research on evaluation—that helps us understand whether the tradeoffs we make bear the fruit we expect.

The lack of research on evaluation is a weak link in the chain of reasoning that makes our work matter in Europe, the US, and around the world. My colleagues and I are hoping to strengthen that link a little, but we need others to join us. I hope you will.

4 Comments

Filed under AEA Conference, Conference Blog, Evaluation, Evaluation Quality, Program Evaluation, Research

Tagged as AEA Conference, American Evaluation Association, European standard for social impact measurement, evaluation, impact analysts, impact measurement, measurement, Program Evaluation, randomized trials, social impact measurement, standards

February 13, 2013 · 2:30 pm

Measurement Is Not the Answer

Bill Gates recently summarized his yearly letter in an article for the Wall Street Journal entitled My Plan to Fix the World’s Biggest Problems…Measure Them!

As an evaluator, I was thrilled. I thought, “Someone with clout is making the case for high-quality evaluation!” I was ready to love the article.

To my great surprise, I didn’t.

The premise of the piece was simple. Organizations working to change the world should set clear goals, choose an approach, measure results, and use those measures to continually refine the approach.

At this level of generality, who could disagree? Certainly not evaluators—we make arguments like this all the time.

Yet, I must—with great disappointment—conclude that Gates failed to make the case that measurement matters. In fact, I believe he undermined it by the way he used measurements.

Gates is not unique in this respect. His Wall Street Journal article is just one instance of a widespread problem in the social sector—confusing good measures for good inference.

Measures versus Inference

The difference between measures and inferences can be subtle. Measures quantify something that is observable. The number of students who graduate from high school or estimates of the calories people consume are measures. In order to draw conclusions from measures, we make inferences. Two types of inference are of particular interest to evaluators.

(1) Inferences from measures to constructs. Constructs—unobservable aspects of humans or the world that we seek to understand—and the measures that shed light on them are not interchangeable. For example, what construct does the high school graduation rate measure? That depends. Possibly education quality, student motivation, workforce readiness, or something else that we cannot directly observe. To make an inference from measure to construct, the construct of interest must be well defined and its measure selected on the basis of evidence.

Evidence is important because, among other things, it can suggest whether many, few, or only one measure is required to understand a construct well. By using the sole measure of calories consumed, for example, we gain a poor understanding of a broad construct like health. However, we can use that single measure to gain a critical understanding of a narrower construct like risk of obesity.

(2) Inferences from measures to impacts. If high school graduation rates go up, was it the result of new policies, parental support, another reason left unconsidered, or a combination of several reasons? This sort of inference represents one of the fundamental challenges of program evaluation, and we have developed a number of strategies to address it. None is perfect, but more often than not we can identify a strategy that is good enough for a specific context and purpose.

Why do I think Gates made weak inferences from good measures? Let’s look at the three examples he offered in support of his premise that measurement is the key to solving the world’s biggest problems.

Example 1: Ethiopia

Gates described how Ethiopia became more committed to providing healthcare services in 2000 as part of the Millennium Development Goals. After that time, the country began tracking the health services it provided in new ways. As evidence that the new measurement strategy had an impact, Gates reported that child mortality decreased 60% in Ethiopia since 1990.

In this example, the inference from measure to impact is not warranted. Based on the article, the sole reason to believe that the new health measurement strategy decreased child mortality is that the former happened before the latter. Inferring causality from the sequential timing of events alone has been recognized as an inferential misstep for so long that it is best known by its Latin name, post hoc ergo propter hoc.

Even if we were willing to make causal inferences based on sequential timing alone, it would not be possible in this case—the tracking system began sometime after 2000 while the reported decrease in child mortality was measured from 1990.

Example 2: Polio

The global effort to eradicate polio has come down to three countries—Nigeria, Pakistan, and Afghanistan—where immunizing children has proven especially difficult. Gates described how new measurement strategies, such as using technology to map villages and track health workers, are making it possible to reach remote, undocumented communities in these countries.

It makes sense that these measurement strategies should be a part of the solution. But do they represent, “Another story of success driven by better measurement,” as Gates suggests?

Maybe yes, maybe no—the inference from measure to impact is again not warranted, but for different reasons.

In the prior example, Gates was looking back, claiming that actions (in the past) made an impact (in the past) because the actions preceded the impact. In this example, he made that claim that ongoing actions will lead to a future impact because the actions precede the intended impact of eradicating polio. The former was a weak inference, the latter weaker still because it incorporates speculation about the future.

Even if we are willing to trust an inference about an unrealized future in which polio has been eradicated, there is another problem. The measures Gates described are implementation measures. Inferring impact from implementation may be warranted if we have strong faith in a causal mechanism, in this case that contact with remote communities leads to immunization which in turn leads to reduction in the transmission of the disease.

We should have strong faith in second step of this causal mechanism—vaccines work. Unfortunately, we should have doubts about the first step because many who are contacted by health workers refuse immunization. The Bulletin of the World Health Organization reported that parental refusal in some areas around Karachi has been widespread, accounting for 74% of missed immunizations there. It is believed that the reasons for the refusals were fear related to safety and the religious implications of the vaccines. New strategies for mapping and tracking cannot, on the face of it, address these concerns.

So I find it difficult to accept that polio immunization is a story of success driven by measurement. It seems more like a story in which new measures are being used in a strategic manner. That’s laudable—but quite different from what was claimed.

Example 3: Education

The final example Gates provided came from the foundation’s $45 million Measures of Effective Teaching (MET) study. As described in the article, the MET study concluded that multiple measures of teacher effectiveness can be used to improve the way administrators manage school systems and teachers provide instruction. The three measures considered in the study were standardized test scores (transformed into controversial units called value-added scores), student surveys of teacher quality, and scores provided by trained observers of classroom instruction.

The first problem with this example is the inference from measures to construct. Everyone wants more effective teachers, but not everyone defines effectiveness the same way. There are many who disagree with how the construct of teacher effectiveness was defined in the MET study—that a more effective teacher is one who promotes student learning in ways that are reflected by standardized test scores.

Even if we accept the MET study’s narrow construct of teacher effectiveness, we should question whether multiple measures are required to understand it well. As reported by the foundation, all three measures in combination explain about 52% of the variation in teacher effectiveness in math and 26% in English-language arts. Test scores alone (transformed into value-added scores) explain about 48% and 20% of the variation in the math and English-language arts, respectively. The difference is trivial, making the cost of gathering additional survey and observation data difficult to justify.

The second problem is inference from measures to impact. Gates presented Eagle County’s experience as evidence that teacher evaluations improve education. He stated that Eagle County’s teacher evaluation system is “likely one reason why student test scores improved in Eagle County over the past five years.” Why does he believe this is likely? He doesn’t say. I can only respond post hoc ergo propter hoc.

So What?

The old chestnut that lack of evidence is not evidence of lacking applies here. Although Gates made inferences that were not well supported by logic and evidence, it doesn’t mean he arrived at the wrong conclusions. Or the right conclusions. All we can do is shrug our shoulders.

And it doesn’t mean we should not be measuring the performance and impact of social enterprises. I believe we should.

It does mean that Gates believes in the effectiveness of potential solutions for which there is little evidence. For someone who is arguing that measurement matters, he is setting a poor example. For someone who has the power to implement solutions on an unprecedented scale, it can also be dangerous.

5 Comments

Filed under Commentary, Evaluation, Evaluation Quality, Program Evaluation

Tagged as Bill & Melinda Gates Foundation, Bill Gates, bill gates annual letter, evaluation, inference, measurement, Program Evaluation, Wall Street Journal

November 8, 2012 · 11:21 pm

Evaluation in the Post-Data Age: What Evaluators Can Learn from the 2012 Presidential Election

Stop me if you’ve heard this one before. An evaluator uses data to assess the effectiveness of a program, arrives at a well-reasoned but disappointing conclusion, and finds that the conclusion is not embraced—perhaps ignored or even rejected—by those with a stake in the program.

People—even evaluators—have difficulty accepting new information if it contradicts their beliefs, desires, or interests. It’s unavoidable. When faced with empirical evidence, however, most people will open their minds. At least that has been my experience.

During the presidential election, reluctance to embrace empirical evidence was virtually universal. I began to wonder—had we entered the post-data age?

The human race creates an astonishing amount of data—2.5 quintillion bytes of data per day. In the last two years, we created 90% of all data created throughout human history.

In that time, I suspect that we also engaged in more denial and distortion of data than in all human history.

The election was a particularly bad time for data and the people who love them—but there was a bright spot.

On election day I boarded a plane for London (after voting, of course). Although I had no access to news reports during the flight, I already knew the result—President Obama had about an 84% chance of winning reelection. When I stepped off the plane, I learned he had indeed won. No surprise.

How could I be so certain of the result when the election was hailed as too close to call? I read the FiveThiryEight blog, that’s how. By using data—every available, well-implemented poll—and a strong statistical model, Nate Silver was able to produce a highly credible estimate of the likelihood that one or the other candidate would win.

Most importantly, the estimate did not depend on the analysts’—or anyone’s—desires regarding the outcome of the election.

Although this first-rate work was available to all, television and print news was dominated by unsophisticated analysis of poll data. How often were the results of an individual poll—one data point—presented in a provocative way and its implications debated for as long as breath and column inches could sustain?

Isn’t this the way that we interpret evaluations?

News agencies were looking for the story. The advocates for each candidate were telling their stories. Nothing wrong with that. But when stories shape the particular bits of data that are presented to the public, rather than all of the data being used to shape the story, I fear that the post-data age is already upon us.

Are evaluators expected to do the same when they are asked to tell a program’s story?

It has become acceptable to use data poorly or opportunistically while asserting that our conclusions are data driven. All the while, much stronger conclusions based on better data and data analysis are all around us.

Do evaluators promote similar behavior when we insist that all forms of evaluation can improve data-driven decision making?

The New York Times reported that on election night one commentator, with a sizable stake in the outcome, was unable to accept that actual voting data were valid because they contradicted the story he wanted to tell.

He was already living in the post-data age. Are we?

6 Comments

Filed under Commentary, Evaluation, Evaluation Quality, Program Evaluation

Tagged as evaluation, evaluations, FiveThirtyEight Blog, How to Lie with Statistics, Nate Silver, obama, Program Evaluation

March 26, 2012 · 2:17 pm

Running Hot and Cold for Mixed Methods: Jargon, Jongar, and Code

Jargon is the name we give to big labels placed on little ideas. What should we call little labels placed on big ideas? Jongar, of course.

A good example of jongar in evaluation is the term mixed methods. I run hot and cold for mixed methods. I praise them in one breath and question them in the next, confusing those around me.

Why? Because mixed methods is jongar.

Recently, I received a number of comments through LinkedIn about my last post. A bewildered reader asked how I could write that almost every evaluation can claim to use a mixed-methods approach. It’s true, I believe that almost every evaluation can claim to be a mixed-methods evaluation, but I don’t believe that many—perhaps most—should.

Why? Because mixed methods is also jargon.

Confused? So were Abbas Tashakkori and John Creswell. In 2007, they put together a very nice editorial for the first issue of the Journal of Mixed Methods Research. In it, they discussed the difficulty they faced as editors who needed to define the term mixed methods. They wrote:

…we found it necessary to distinguish between mixed methods as a collection and analysis of two types of data (qualitative and quantitative) and mixed methods as the integration of two approaches to research (quantitative and qualitative).

By the first definition, mixed methods is jargon—almost every evaluation uses more than one type of data, so the definition attaches a special label to a trivial idea. This is the view that I expressed in my previous post.

By the second definition, which is closer to my own perspective, mixed methods is jongar—two simple words struggling to convey a complex concept.

My interpretation of the second definition is as follows:

A mixed-methods evaluation is one that establishes in advance a design that explicitly lays out a thoughtful, strategic integration of qualitative and quantitative methods to accomplish a critical purpose that either qualitative or quantitative methods alone could not.

Although I like this interpretation, it places a burden on the adjective mixed that it cannot support. In doing so, my interpretation trades one old problem—being able to distinguish mixed methods evaluations from other types of evaluation—for a number of new problems. Here are three of them:

Evaluators often amend their evaluation designs in response to unanticipated or dynamic circumstances—so what does it mean to establish a design in advance?
Integration is more than having quantitative and qualitative components in a study design—how much more and in what ways?
A mixed-methods design should be introduced when it provides a benefit that would not be realized otherwise—how do we establish the counterfactual?

These complex ideas are lurking behind simple words. That’s why the words are jongar and why the ideas they represent may be ignored.

Technical terms—especially jargon and jongar—can also be code. Code is the use of technical terms in real-world settings to convey a subtle, non-technical message, especially a controversial message.

For example, I have found that in practice funders and clients often propose mixed methods evaluations to signal—in code—that they seek an ideological compromise between qualitative and quantitative perspectives. This is common when program insiders put greater faith in qualitative methods and outsiders put greater faith in quantitative methods.

When this is the case, I believe that mixed methods provide an illusory compromise between imagined perspectives.

The compromise is illusory because mixed methods are not a middle ground between qualitative and quantitative methods, but a new method that emerges from the integration of the two. At least by the second definition of mixed methods that I prefer.

The perspectives are imagined because they concern how results based on particular methods may be incorrectly perceived or improperly used by others in the future. Rather than leap to a mixed-methods design, evaluators should discuss these imagined concerns with stakeholders in advance to determine how to best accommodate them—with or without mixed methods. In many funder-grantee-evaluator relationships, however, this sort of open dialogue may not be possible.

This is why I run hot and cold for mixed methods. I value them. I use them. Yet, I remain wary of labeling my work as such because the label can be…

jargon, in which case it communicates nothing;
jongar, in which case it cannot communicate enough; or
code, in which case it attempts to communicate through subtlety what should be communicated through open dialogue.

Too bad—the ideas underlying mixed methods are incredibly useful.

6 Comments

Filed under Commentary, Evaluation, Evaluation Quality, Program Evaluation, Research

Tagged as abbas tashakkori, code, complexity, evaluation, evaluations, jargon, john creswell, jongar, mixed methods, qualitative and quantitative methods, research, simpicity

March 12, 2012 · 10:27 am

Evaluator, Watch Your Language

As I was reading a number of evaluation reports recently, the oddity of evaluation jargon struck me. It isn’t that we have unusual technical terms—all fields do—but that we use everyday words in unusual ways. It is as if we speak in a code that only another evaluator can decipher.

I jotted down five words and phrases that we all use when we speak and write about evaluation. On the surface, their meanings seem perfectly clear. However, they can be used for good and bad. How are you using them?

(1) Suggest

As in: The data suggest that the program was effective.

Pros: Suggest is often used to avoid words such as prove and demonstrate—a softening of this is so to this seems likely. Appropriate qualification of evaluation results is desirable.

Cons: Suggest is sometimes used to inflate weak evidence. Any evaluation—strong or weak—can be said to suggest something about the effectiveness of a program. Claiming that weak evidence suggests a conclusion overstates the case.

Of special note: Data, evaluations, findings, and the like cannot suggest anything. Authors suggest, and they are responsible for their claims.

(2) Mixed Methods

As in: Our client requested a mixed-methods evaluation.

Pros: Those who focus on mixed methods have developed thoughtful ways of integrating qualitative and quantitative methods. Thoughtful is desirable.

Cons: All evaluations use some combination of qualitative and quantitative methods, so any evaluation can claim to use—thoughtfully or not—a mixed-methods approach. A request for a mixed-methods evaluation can mean that clients are seeking an elusive middle ground—a place where qualitative methods tell the program’s story in a way that insiders find convincing and quantitative methods tell the program’s story in a way that outsiders find convincing. The middle ground frequently does not exist.

(3) Know

As in: We know from the literature that teachers are the most important school-time factor influencing student achievement.

Pros: None.

Cons: The word know implies that claims to the contrary are unfounded. This shuts down discussion on topics for which there is almost always some debate. One could argue that the weight of evidence is overwhelming, the consensus in the field is X, or we hold this belief as a given. Claiming that we know, with rare exception, overstates the case.

(4) Nonetheless [we can believe the results]

As in: The evaluation has flaws, nonetheless it reaches important conclusions.

Pros: If the phrase is followed by a rationale (…because of the following reasons…), this turn of phrase might indicate something quite important.

Cons: All evaluations have flaws, and it is the duty of evaluators to bring them to the attention of readers. If the reader is then asked to ignore the flaws, without being given a reason, it is at best confusing and at worst misleading.

(5) Validated Measure

As in: We used the XYZ assessment, a previously validated measure.

Pros: None

Cons: Validity is not a characteristic of a measure. A measure is valid for a particular group of people for a particular purpose in a particular context at a specific point in time. This means that evaluators must make the case that all of the measures that they used were appropriate in the context of the evaluation.

The Bottom Line

I am guilty of sometimes using bad language. We all are. But language matters, even in causal conversations among knowledgeable peers. Bad language leads to bad thinking, as my mother always said. So I will endeavor to watch my language and make her proud. I hope you will too.

12 Comments

Filed under Evaluation, Evaluation Quality, Program Evaluation

Tagged as evaluation, jargon, language, Program Evaluation, reporting

November 13, 2010 · 2:22 pm

The AEA Conference (So Far)

The AEA conference has been great. I have been very impressed with the presentations that I have attended so far, though I can’t claim to have seen the full breadth of what is on offer as there are roughly 700 presentations in total. Here are a few that impressed me the most. Continue reading →

1 Comment

Filed under AEA Conference, Evaluation Quality, Program Evaluation

Tagged as AEA Conference, evaluation, evaluations, logic models, Program Evaluation

November 11, 2010 · 2:33 am

AEA 2010 Conference Kicks Off in San Antonio

In the opening plenary of the Evaluation 2010 conference, AEA President Leslie Cooksy invited three leaders in the field—Eleanor Chelimsky, Laura Leviton, and Michael Patton– to speak on The Tensions Among Evaluation Perspectives in the Age of Obama: Influences on Evaluation Quality, Thinking and Values. They covered topics ranging from how government should use evaluation information to how Jon Stewart of the Daily Show outed himself as an evaluator during his Rally to Restore Sanity/Fear (“I think you know that the success or failure of a rally is judged by only two criteria; the intellectual coherence of the content, and its correlation to the engagement—I’m just kidding. It’s color and size. We all know it’s color and size.”)

One piece that resonated with me was Laura Leviton’s discussion of how the quality of an evaluation is related to our ability to apply its results to future programs—what is referred to as generalization. She presented a graphic that described a possible process for generalization that seemed right to me; it’s what should happen. But how it happens was not addressed, at least in the short time in which she spoke. It is no small task to gather prior research and evaluation results, translate them into a small theory of improvement (a program theory), and then adapt that theory to fit specific contexts, values, and resources. Who should be doing that work? What are the features that might make it more effective?

Stewart Donaldson and I recently co-authored a paper on that topic that will appear in New Directions for Evaluation in 2011. We argue that stakeholders are and should be doing this work, and we explore how the logic underlying traditional notions of external validity—considered by some to be outdated—can be built upon to create a relatively simple, collaborative process for predicting the future results of programs. The paper is a small step toward raising the discussion of external validity (how we judge whether a program will work in the future) to the same level as the discussion of internal validity (how we judge whether a program worked in the past), while trying to avoid the rancor that has been associated with the latter.

Good versus Eval

After another blogging hiatus, the battle between good and eval continues. Or at least my blog is coming back online as the American Evaluation Association’s Annual Conference in San Antonio (November 10-14) quickly approaches.

I remember that twenty years ago evaluation was widely considered the enemy of good because it took resources away from service delivery. Now evaluation is widely considered an essential part of service delivery, but the debate over what constitutes a good program and a good evaluation continues. I will be joining the fray when I make a presentation as part of a session entitled Improving Evaluation Quality by Improving Program Quality: A Theory-Based/Theory-Driven Perspective (Saturday, November 13, 10:00 AM, Session Number 742). My presentation is entitled The Expanding Profession: Program Evaluators as Program Designers, and I will discuss how program evaluators are increasingly being called upon to help design the programs they evaluate, and why that benefits program staff, stakeholders, and evaluators. Stewart Donaldson is my co presenter (The Relationship between Program Design and Evaluation), and our discussants are Michael Scriven, David Fetterman, and Charles Gasper. If you know these names, you know to expect a “lively” (OK, heated) discussion.

If you are an evaluator in California, Oregon, Washington, New Mexico, Hawaii, any other place west of the Mississippi, or anywhere that is west of anything, be sure to attend the West Coast Evaluators Reception Thursday, November 11, 9:00 pm at the Zuni Grill (223 Losoya Street, San Antonio, TX 78205) co-sponsored by San Francisco Bay Area Evaluators and Claremont Graduate University. It is a conference tradition and a great way to network with colleagues.

Quality is a Joke

If you have been following my blog (Who hasn’t?), you know that I am writing on the topic of evaluation quality, the theme of the 2010 annual conference of the American Evaluation Association taking place November 10-13. It is a serious subject. Really.

But here is a joke, though perhaps only the evaluarati (you know who you are) will find it amusing.

Without looking up from his newspaper, the quantitative evaluator calmly responds, “That is an awfully strong causal claim you are making. There is anecdotal evidence to suggest that buses can kill people, but the research does not bear it out. People ride buses all the time and they are rarely killed by them. The correlation between riding buses and being killed by them is very nearly zero. I defy you to produce any credible evidence that buses pose a significant danger. It would really be an extraordinary thing if we were killed by a bus. I wouldn’t worry.”

Dismayed, the normal person starts gesticulating and shouting, “But there is a bus! A particular bus! That bus! And it is heading directly toward some particular people! Us! And I am quite certain that it will hit us, and if it hits us it will undoubtedly kill us!”

At this point the qualitative evaluator, who was observing this exchange from a safe distance, interjects, “What exactly do you mean by bus? After all, we all construct our own understanding of that very fluid concept. For some, the bus is a mere machine, for others it is what connects them to their work, their school, the ones they love. I mean, have you ever sat down and really considered the bus-ness of it all? It is quite immense, I assure you. I hope I am not being too forward, but may I be a critical friend for just a moment? I don’t think you’ve really thought this whole bus thing out. It would be a pity to go about pushing the sort of simple linear logic that connects something as conceptually complex as a bus to an outcome as one dimensional as death.”

Very dismayed, the normal person runs away screaming, the bus collides with the quantitative and qualitative evaluators, and it kills both instantly.

Very, very dismayed, the normal person begins pleading with a bystander, “I told them the bus would kill them. The bus did kill them. I feel awful.”

To which the bystander replies, “Tut tut, my good man. I am a statistician and I can tell you for a fact that with a sample size of 2 and no proper control group, how could we possibly conclude that it was the bus that did them in?”

To the extent that this is funny (I find it hilarious, but I am afraid that I may share Sir Isaac Newton’s sense of humor) it is because it plays on our stereotypes about the field. Quantitative evaluators are branded as aloof, overly logical, obsessed with causality, and too concerned with general rather than local knowledge. Qualitative evaluators, on the other hand, are suspect because they are supposedly motivated by social interaction, overly intuitive, obsessed with description, and too concerned with local knowledge. And statisticians are often looked upon as the referees in this cat-and-dog world, charged with setting up and arbitrating the rules by which evaluators in both camps must (or must not) play.

The problem with these stereotypes, like all stereotypes, is that they are inaccurate. Yet we cling to them and make judgments about evaluation quality based upon them. But what if we shift our perspective to that of the (tongue in cheek) normal person? This is not an easy thing to do if, like me, you spend most of your time inside the details of the work and the debates of the profession. Normal people want to do the right thing, feel the need to act quickly to make things right, and hope to be informed by evaluators and others who support their efforts. Sometimes normal people are responsible for programs that operate in particular local contexts, and at others they are responsible for policies that affect virtually everyone. How do we help normal people get what they want and need?

I have been arguing that we should, and when we do we have met one of my three criteria for quality—satisfaction. The key is first to acknowledge that we serve others, and then to do our best to understand their perspective. If we are weighed down by the baggage of professional stereotypes, it can prevent us from choosing well from among all the ways we can meet the needs of others. I suppose that stereotypes can be useful when they help us laugh at ourselves, but if we come to believe them, our practice can become unaccommodatingly narrow and the people we serve—normal people—will soon begin to run away (screaming) from us and the field. That is nothing to laugh at.

8 Comments

Filed under Evaluation, Evaluation Quality, Program Evaluation

Tagged as bus, evaluation, evaluations, joke, Program Evaluation, quality, satisfaction, stereotypes

March 22, 2010 · 11:45 pm

What the Hell is Quality?

In Zen and the Art of Motorcycle Maintenance, an exasperated Robert Pirsig famously asked, “What the hell is quality?” and expended a great deal of energy trying to work out an answer. As I find myself considering the meaning of quality evaluation, the theme of the upcoming 2010 Conference of the American Evaluation Association, it feels like déjà vu all over again. There are countless definitions of quality floating about (for a short list see Garvin, (1984)), but arguably few if any examples of the concept being applied to modern evaluation practice. So what the hell is quality evaluation? And will I need to work out an answer for myself?

Luckily there is some agreement out there. Quality is usually thought of as an amalgam of multiple criteria, and quality is judged by comparing the characteristics of an actual product or service to those criteria.

Isn’t this exactly what evaluators are trained to do?

Yes. And judging quality in this way poses some practical problems that will be familiar to evaluators:

Who devises the criteria?
Evaluations serve many often competing interests. Funders, clients, direct stakeholders, and professional peers make the short list. All have something to say about what makes an evaluation high quality, but they do not have equal clout. Some are influential because they have market power (they pay for evaluation services). Others are influential because they have standing in the profession (they are considered experts or thought leaders). And as the table below illustrates, some are influential because they have both (funders) and others lack influence because they have neither (direct stakeholders). More on this in a future blog.

Who makes the comparison?
Quality criteria may be devised by one group and then used by another to judge quality. For example, funders may establish criteria and then hire independent evaluators (professional peers) who use the criteria to judge the quality of evaluations. This is what happens when evaluation proposals are reviewed and ongoing evaluations are monitored. More on this in a future blog.

How is the comparison made?
Comparisons can be made in any number of ways, but (imperfectly) we can lump them into two approaches—the explicit, cerebral, and systemic approach, and the implicit, intuitive, and inconsistent approach. Individuals tend to judge quality in the latter fashion. It is not a bad way to go about things, especially when considering everyday purchases (a pair of sneakers or a tuna fish sandwich). When considering evaluation, however, it would seem best to judge quality in the former fashion. But is it? More on this in a future blog.

So what the hell is quality? This is where I propose an answer that I hope is simple yet covers most of the relevant issues facing our profession. Quality evaluation is comprised of three distinct things—all important separately, but only in combination reflecting quality. They are:

Standards
When the criteria used to judge quality come from those with professional standing, the criteria describe an evaluation that meets professional standards. Standards focus on technical and nontechnical attributes of an evaluation that are under the direct control of the evaluator. Perhaps the two best examples of this are the Program Evaluation Standards and the Program Evaluations Metaevaluation Checklist.

Satisfaction
When the criteria used to judge quality come from those with market power, the criteria describe an evaluation that would satisfy paying customers. Satisfaction focuses on whether expectations—reasonable or unreasonable, documented in a contract or not—are met by the evaluator. Collectively, these expectations define the demand for evaluation in the marketplace.

Empowerment
When the criteria used to judge quality come from direct stakeholders with neither professional standing nor market power, the criteria change the power dynamic of the evaluation. Empowerment evaluation and participatory evaluation are perhaps the two best examples of evaluation approaches that look to those served by programs to help define a quality evaluation.

Standards, satisfaction, and empowerment are related, but they are not interchangeable. One can be dissatisfied with an evaluation that exceeds professional standards, or empowered by an evaluation with which funders were not satisfied. I will argue that the quality of an evaluation should be measured against all three sets of criteria. Is that feasible? Desirable? That is what I will hash out over the next few weeks.

5 Comments

Filed under Evaluation, Evaluation Quality

Tagged as empowerment, evaluation, Program Evaluation, quality, Robert Pirsig, satisfaction, standards, Zen

Category Archives: Evaluation Quality

Categories