It has been years since I blogged. Now is the time to start again. I want to share my thoughts about Scaling Impact, the book Rob McLean and I wrote. And I want to preview new research that I’m preparing for publication and presentation. I’m really excited about the work, which is connecting impact, scaling, and evaluation. And maybe, just maybe, evaluation needs a little off-axis commentary. Thanks Chris Lysy for nudging me.
Tag Archives: evaluation
The theme of the upcoming 2014 annual conference of the American Evaluation Association (AEA) challenges participants to consider how evaluation can contribute to a sustainable and equitable future. It’s a fantastic challenge, one that cuts to the core of why evaluation matters—its potential to promote the public good locally and globally, today and in the future.
As I prepare my presentations, I want to share some of my thoughts and encourage others to take up the challenge.
The End is Nigh(ish)
The natural and social environments in which we live have limits. Exceed them, and society puts itself at risk.
It’s a simple idea, but one that did not enter the public’s thinking until Thomas Malthus wrote about it in the late 18th century. He famously predicted that, unless something changed, the British population would soon grow too large to feed itself. As it turns out, something did change—among other things, merchants imported food—and the crisis never came to pass.
Today, Malthus is strongly—and unjustly—associated with, as Lauren F. Landsburg put it, “a pessimistic prediction of the lock-step demise of a humanity doomed to starvation via overpopulation.” This jolly point of view is sometimes referred to as Malthusianism, and applied to all forms of catastrophic environmental and social decline.
The underlying concept Malthus articulated—there are real environmental and societal limits, and real consequences for exceeding them—is not controversial. There are, however, controversial perspectives related to it, including:
- “Malthusiasm”: A passionate belief in—bordering on enthusiasm for—the inevitability of environmental and social collapse, especially in the short term.
- Denialism: An equally passionate belief that predictions of environmental and social disaster, like those made by Malthus, never come to pass.
- Self-correctionism: A belief that many small, undirected changes in individual and organizational behavior, related primarily to markets and other social structures, will naturally correct for problems in complex ways that may, at first, be difficult to notice.
- Intentionalism: A belief that intentional action at the individual, organizational, and social levels—when well planned, executed, and evaluated—can not only help avoid disaster, but produce positive benefits that serve the public good.
I reject the first two. I hope for the third. I’ve spent my life working for the fourth—and this is where evaluation can play a significant role.
From Avoiding Disaster to Promoting Sustainability
I am as much for avoiding disaster as the next guy, but—rightly or wrongly—I expect more from organized human action. Like sustainability. It’s a concept that I and others strongly believe should guide the actions of every organization. It is also a slippery concept that we have not fully defined, making it a rough guide, at best.
So, connecting ideas from various sources (and a few of my own), I’ve developed a preliminary working definition based on a set of underlying principles (in parentheses):
Actions are sustainable when they do not affect future generations adversely (futurity), social groups differentially (equity), larger social and natural systems destructively (globality), or their own objectives negatively (complexity).
I’m not fully satisfied with the definition, but so far it has helped clarify my thinking.
Why Evaluation Matters
Unfortunately, action is only weakly linked to upholding these principles, in part because there is often a lack of information about how well the principles have been (or will be) met.
That is where evaluation comes in. If we use our skills to help design the actions of commercial and social enterprises in ways that uphold these principles, we serve the public good. If we evaluate programs in ways that shed light on these principles—which would require most of us to expand our field of view—we also serve the public good.
This is why evaluation matters—because it has the potential to serve the public good—and why we need to work together to make it matter more. That would truly be evaluation for a sustainable and equitable future.
The 2014 Conference of the African Evaluation Association (AfrEA) was just opened. Organizers delayed the start of the opening ceremony, however, as they waited for the arrival of officials from the government of Cameroon. Fifteen minutes. Thirty minutes. An hour. More.
This may sound like a problem, but it wasn’t—the unofficial conference had already begun. Participants from around the world were mixing, laughing, and learning. I met evaluators from Kenya, South Africa, Sri Lanka, Europe, and America. I learned about health programs, education systems, evaluation use in government, and the development of evaluation as a profession across the continent. It was a truly delightful delay.
And it reflects the mindset I am finding here—a strong belief that commitment and community can overcome circumstance.
: : : : : : : : : : : :
During the opening ceremony, the Former President of AGRA, Dr. Namanga Ngongi, stated that one of the greatest challenges facing development programs is finding enough qualified evaluators—those who not only have technical skills, but also the ability to help organizations increase their impact.
Where will these much-needed evaluators come from?
Historically, many evaluators have come from outside of Africa. The current push for made-in-Africa evaluations promises to change that by training more African evaluators.
Evaluators are trained in many ways, chief among them university programs, professional mentoring, practical experience, and ongoing professional development. The CLEAR initiative—Centers for Learning on Evaluation and Results—is a new approach. With centers in Anglophone and Francophone Africa, CLEAR has set out to strengthen monitoring, evaluation, performance management, and evaluation use at the country level.
While much of CLEAR’s work is face-to-face, a great many organizations have made training material available on the web. One can now piece together free resources online—webinars, documents, videos, correspondence, and even one-on-one meetings with experts—that can result in highly contextualized learning. This is what many of the African evaluators I have met are telling me they are doing.
The US, Canada, Australia, and New Zealand appear to be leading exporters of evaluation content to Africa. Claremont Graduate University, Western Michigan University, the American Evaluation Association, the Canadian Government, and BetterEvaluation are some of the better-known sources.
What’s next? Perhaps consolidators who organize online and in-person content into high-quality curricula that are convenient, coherent, and comprehensive.
: : : : : : : : : : : :
Although the supply of evaluators may be limited in many parts of Africa, the demand for evaluation continues to increase. The history of evaluation in the US, Canada, and Europe suggests that demand grows when evaluation is required as a condition of funding or by law. From what I have seen, it appears that history is repeating itself in Africa. In large part this is due to the tremendous influence that funders from outside of Africa have.
An important exception is South Africa, where there government and evaluators work cooperatively to produce and use evaluations. I hope to learn more about this in the days to come.
“Tell me again why you are going to Cameroon?” my wife asked. I paused, searching for an answer. New business? Not really, although that is always welcome. Old connections? I have very few among those currently working in Africa. What should I say? How could I explain?
I decided to confess.
“Because I am curious. There is something exciting going on across Africa. The African Evaluation Association—AfrEA—is playing a critical role. I want to learn more about it. Support it. Maybe be a part of it.”
She found that perfectly reasonable. I suppose that is why I married her.
Then she asked more questions about the conference and how my work might be useful to practitioners in that part of the world. As it turns out, she was curious, too. I believe many are, especially evaluation practitioners.
It takes a certain irrational obsessiveness, however, to fly 32 hours because you are curious.
For those not yet prepared to follow their curiosity to such lengths, I will be blogging about the AfrEA Conference over the next week.
Check back here for the latest conference news from Youndé, Cameroon.
Are you suffering from “post-parting depression” now that the conference of the American Evaluation Association has ended? Maybe this will help–a sampling of the professionals who attended the conference, along with their thoughts on the experience. Special thanks to Anna Fagergren who collected most of these photos and quotes.
Stefany Tobel Ramos, City Year
This is my first time here and I really enjoyed the professional development workshop Evaluation-Specific Methodology. I learned a lot and have new ideas about how to get a sense of students as a whole.
Jonathan Karanja, Independent Consultant with Nielsen, Kenya
This is my first time here and Nielsen is trying to get into the evaluation space, because that is what our clients want. The conference is a little overwhelming but I have a strategy – go to the not technically demanding, easy-to-digest sessions. Baby steps. I want to ensure that our company learns to not just apply market research techniques but to actually do evaluation.
When I attend AEA, I get to present to enthusiastic groups of evaluation professionals. It makes me feel like a rock star for a week. Then I go home and do the dishes.
I’m returning to the conference after some years away—it’s great to renew contact with acquaintances and colleagues. I am struck by the conference’s growth and the huge diversity of TIGs (topical interest groups), and I’m finding a lot of sessions of interest.
Pieta Blakely, Commonwealth Corporation
It’s my first time here and it’s a little overwhelming. I’m getting to know what I don’t know. But it’s also really exciting to see people working on youth engagement because I’m really interested in that.
I’ve been coming for many years, and I really like the two professional development workshops I took—Sampling and Empowerment Evaluation Strategies—and how they helped guide my way through the greater conference program.
John, I really like your blog. You have…how do you say it in English?…a twisted mind. I really like that.
Aske Graulund, National Board of Social Services, Denmark
Nina Middelboe, Oxford Research AS, Denmark
[nods of agreement]
No greater compliment, Carsten! And my compliments to all 3,500 professionals who participated in the conference.
Recursion is when your local bookstore opens a café inside the store in order to attract more readers, and then the café opens a bookstore inside itself to attract more coffee drinkers.
Chris Lysy at Freshspectrum.com noticed, laughed at, and illustrated (above) the same phenomenon as it relates to my blogging (or rather lack of it) during the American Evaluation Association Conference last week.
I intended to harness the power of recursion by blogging about blogging at the conference. I reckoned that would nudge a few others to blog at the conference, which in turn would nudge me to do the same.
I ended up blogging very little during those hectic days, and none of it was about blogging at the conference. Giving up on that idea, I landed on blogging about not blogging, then not blogging about not blogging, then blogging about not blogging about not blogging, and so on.
Once Chris opened my eyes to the recursive nature of recursion, I noticed it all around me at the conference.
For example, the Research on Evaluation TIG (Topical Interest Group) discussed using evaluation methods to evaluate how we evaluate. Is that merely academic navel gazing? It isn’t. I would argue that it may be the most important area of evaluation today.
As practitioners, we conduct evaluations because we believe they can make a positive impact in the world, and we choose how to evaluate in ways we believe produce the greatest impact. Ironically, we have little evidence upon which to base our choices. We rarely measure our own impact or study how we can best achieve it.
ROE (research on evaluation, for those in the know) is setting that right. And the growing community of ROE researchers and practitioners is attempting to do so in an organized fashion. I find it quite inspiring.
A great example of ROE and the power of recursion is the work of Tom Cook and his colleagues (chief among them Will Shadish).I must confess that Tom is a hero of mine. A wonderful person who finds tremendous joy in his work and shares that joy with others. So I can’t help but smile every time I think of him using experimental and quasi-experimental methods to evaluate experimental and quasi-experimental methods.
Experiments and quasi-experiments follow the same general logic. Create two (or more) comparable groups of people (or whatever may be of interest). Provide one experience to one group and a different experience to the other. Measure outcomes of interest for the two groups at the end of their experiences. Given that, differences in outcomes between the groups are attributable to differences in the experiences of the groups.
If on group received a program and the other did not, you have a very strong method for estimating program impacts. If on group received a program designed one way and the other a program designed another way, you have a strong basis for choosing between program designs.
Experiments and quasi-experiments differ principally in how they create comparable groups. Experiments assign people to groups at random. In essence, names are pulled from a hat (in reality, computers select names at random from a list). This yields two highly comparable but artificially constructed groups.
Quasi-experiments typically operate by allowing people to choose experiences as they do in everyday life. This yields naturally constructed groups that are less comparable. Why are they less comparable? The groups are comprised of people who made difference choices, and these choice may be associated with other factors that affect outcomes. The good news is that the groups can be made more comparable–to some degree–by using a variety of statistical methods.
Is one approach better than another? At the AEA Conference, Tom described his involvement with efforts to answer that question. One way that is done is by randomly assigning people to two groups–one group that will be part of an experiment or another group that will be part of a quasi-experiment (referred to as an observational study in the picture above). Within the experimental group, participants are randomly assigned to either a treatment group (e.g., math training) or control group (vocabulary training). Within the quasi-experimental group, participants choose between the same two experiences, forming treatment and comparison groups according to their preference.
Program impact estimates are compared for the experimental and quasi-experimental groups. Differences at this level are attributable to the evaluation method and can indicate whether one method is biased with respect to the other. So far, there seems to be pretty good agreement between the methods (when implemented well–no small achievement), but much work remains to be done.
Perhaps the most important form of recursion at the AEA Conference is membership. AEA is comprised of members who manage themselves by forming groups of members who manage themselves by forming groups of members who manage themselves. The board of AEA, TIGs, local affiliates, task forces, working groups, volunteer committees, and conference sessions are all organized by and comprised of groups of members who manage themselves. That is power of recursion–3,500 strangers coming together to create a community dedicated to making the world a better place. And what a joy to watch them pull it off.
Rodney Hopson, Professor, George Mason University (Past President of AEA)
I’m plotting. I’m always plotting. That’s how you make change in the world. You find the opportunities, great people to work with, and make things happen.
Tina Christie, Professor, UCLA
I’ve just finished three years on the AEA board with Rodney. The chance to connect with colleagues like Rodney–work with them, debate with them, laugh with them–is something I look forward to each year. It quickly starts to feel like family.
It’s true—I am addicted to conferences. While I read about evaluation, write about evaluation, and do evaluations in my day-to-day professional life, it’s not enough. To truly connect to the field and its swelling ranks of practitioners, researchers, and supporters, I need to attend conferences. Compulsively. Enthusiastically. Constantly.
Over the past few months, I was honored to be the keynote speaker at the Canadian Evaluation Society conference in Toronto and the Danish Evaluation Society in Kolding. Over the past two years I have been from Helsinki to Honolulu to speak, present, and give workshops. The figure below shows some of that travel (conferences indicated with darker circles, upcoming travel with dashed lines).
But today is special—it’s the first day of the American Evaluation Association conference in Washington, DC. If conferences were cities, this one would be New York—big, vibrant, and international.
EvalBlog has been quiet this summer. Time to make a little digital noise.
Bill Gates recently summarized his yearly letter in an article for the Wall Street Journal entitled My Plan to Fix the World’s Biggest Problems…Measure Them!
As an evaluator, I was thrilled. I thought, “Someone with clout is making the case for high-quality evaluation!” I was ready to love the article.
To my great surprise, I didn’t.
The premise of the piece was simple. Organizations working to change the world should set clear goals, choose an approach, measure results, and use those measures to continually refine the approach.
At this level of generality, who could disagree? Certainly not evaluators—we make arguments like this all the time.
Yet, I must—with great disappointment—conclude that Gates failed to make the case that measurement matters. In fact, I believe he undermined it by the way he used measurements.
Gates is not unique in this respect. His Wall Street Journal article is just one instance of a widespread problem in the social sector—confusing good measures for good inference.
Measures versus Inference
The difference between measures and inferences can be subtle. Measures quantify something that is observable. The number of students who graduate from high school or estimates of the calories people consume are measures. In order to draw conclusions from measures, we make inferences. Two types of inference are of particular interest to evaluators.
(1) Inferences from measures to constructs. Constructs—unobservable aspects of humans or the world that we seek to understand—and the measures that shed light on them are not interchangeable. For example, what construct does the high school graduation rate measure? That depends. Possibly education quality, student motivation, workforce readiness, or something else that we cannot directly observe. To make an inference from measure to construct, the construct of interest must be well defined and its measure selected on the basis of evidence.
Evidence is important because, among other things, it can suggest whether many, few, or only one measure is required to understand a construct well. By using the sole measure of calories consumed, for example, we gain a poor understanding of a broad construct like health. However, we can use that single measure to gain a critical understanding of a narrower construct like risk of obesity.
(2) Inferences from measures to impacts. If high school graduation rates go up, was it the result of new policies, parental support, another reason left unconsidered, or a combination of several reasons? This sort of inference represents one of the fundamental challenges of program evaluation, and we have developed a number of strategies to address it. None is perfect, but more often than not we can identify a strategy that is good enough for a specific context and purpose.
Why do I think Gates made weak inferences from good measures? Let’s look at the three examples he offered in support of his premise that measurement is the key to solving the world’s biggest problems.
Example 1: Ethiopia
Gates described how Ethiopia became more committed to providing healthcare services in 2000 as part of the Millennium Development Goals. After that time, the country began tracking the health services it provided in new ways. As evidence that the new measurement strategy had an impact, Gates reported that child mortality decreased 60% in Ethiopia since 1990.
In this example, the inference from measure to impact is not warranted. Based on the article, the sole reason to believe that the new health measurement strategy decreased child mortality is that the former happened before the latter. Inferring causality from the sequential timing of events alone has been recognized as an inferential misstep for so long that it is best known by its Latin name, post hoc ergo propter hoc.
Even if we were willing to make causal inferences based on sequential timing alone, it would not be possible in this case—the tracking system began sometime after 2000 while the reported decrease in child mortality was measured from 1990.
Example 2: Polio
The global effort to eradicate polio has come down to three countries—Nigeria, Pakistan, and Afghanistan—where immunizing children has proven especially difficult. Gates described how new measurement strategies, such as using technology to map villages and track health workers, are making it possible to reach remote, undocumented communities in these countries.
It makes sense that these measurement strategies should be a part of the solution. But do they represent, “Another story of success driven by better measurement,” as Gates suggests?
Maybe yes, maybe no—the inference from measure to impact is again not warranted, but for different reasons.
In the prior example, Gates was looking back, claiming that actions (in the past) made an impact (in the past) because the actions preceded the impact. In this example, he made that claim that ongoing actions will lead to a future impact because the actions precede the intended impact of eradicating polio. The former was a weak inference, the latter weaker still because it incorporates speculation about the future.
Even if we are willing to trust an inference about an unrealized future in which polio has been eradicated, there is another problem. The measures Gates described are implementation measures. Inferring impact from implementation may be warranted if we have strong faith in a causal mechanism, in this case that contact with remote communities leads to immunization which in turn leads to reduction in the transmission of the disease.
We should have strong faith in second step of this causal mechanism—vaccines work. Unfortunately, we should have doubts about the first step because many who are contacted by health workers refuse immunization. The Bulletin of the World Health Organization reported that parental refusal in some areas around Karachi has been widespread, accounting for 74% of missed immunizations there. It is believed that the reasons for the refusals were fear related to safety and the religious implications of the vaccines. New strategies for mapping and tracking cannot, on the face of it, address these concerns.
So I find it difficult to accept that polio immunization is a story of success driven by measurement. It seems more like a story in which new measures are being used in a strategic manner. That’s laudable—but quite different from what was claimed.
Example 3: Education
The final example Gates provided came from the foundation’s $45 million Measures of Effective Teaching (MET) study. As described in the article, the MET study concluded that multiple measures of teacher effectiveness can be used to improve the way administrators manage school systems and teachers provide instruction. The three measures considered in the study were standardized test scores (transformed into controversial units called value-added scores), student surveys of teacher quality, and scores provided by trained observers of classroom instruction.
The first problem with this example is the inference from measures to construct. Everyone wants more effective teachers, but not everyone defines effectiveness the same way. There are many who disagree with how the construct of teacher effectiveness was defined in the MET study—that a more effective teacher is one who promotes student learning in ways that are reflected by standardized test scores.
Even if we accept the MET study’s narrow construct of teacher effectiveness, we should question whether multiple measures are required to understand it well. As reported by the foundation, all three measures in combination explain about 52% of the variation in teacher effectiveness in math and 26% in English-language arts. Test scores alone (transformed into value-added scores) explain about 48% and 20% of the variation in the math and English-language arts, respectively. The difference is trivial, making the cost of gathering additional survey and observation data difficult to justify.
The second problem is inference from measures to impact. Gates presented Eagle County’s experience as evidence that teacher evaluations improve education. He stated that Eagle County’s teacher evaluation system is “likely one reason why student test scores improved in Eagle County over the past five years.” Why does he believe this is likely? He doesn’t say. I can only respond post hoc ergo propter hoc.
The old chestnut that lack of evidence is not evidence of lacking applies here. Although Gates made inferences that were not well supported by logic and evidence, it doesn’t mean he arrived at the wrong conclusions. Or the right conclusions. All we can do is shrug our shoulders.
And it doesn’t mean we should not be measuring the performance and impact of social enterprises. I believe we should.
It does mean that Gates believes in the effectiveness of potential solutions for which there is little evidence. For someone who is arguing that measurement matters, he is setting a poor example. For someone who has the power to implement solutions on an unprecedented scale, it can also be dangerous.