Tag Archives: evaluation

Measurement Is Not the Answer

gates_impact_evalblog

Bill Gates recently summarized his yearly letter in an article for the Wall Street Journal entitled My Plan to Fix the World’s Biggest Problems…Measure Them!

As an evaluator, I was thrilled. I thought, “Someone with clout is making the case for high-quality evaluation!” I was ready to love the article.

To my great surprise, I didn’t.

The premise of the piece was simple. Organizations working to change the world should set clear goals, choose an approach, measure results, and use those measures to continually refine the approach.

At this level of generality, who could disagree? Certainly not evaluators—we make arguments like this all the time.

Yet, I must—with great disappointment—conclude that Gates failed to make the case that measurement matters. In fact, I believe he undermined it by the way he used measurements.

Gates is not unique in this respect. His Wall Street Journal article is just one instance of a widespread problem in the social sector—confusing good measures for good inference.

Measures versus Inference

The difference between measures and inferences can be subtle. Measures quantify something that is observable. The number of students who graduate from high school or estimates of the calories people consume are measures. In order to draw conclusions from measures, we make inferences.  Two types of inference are of particular interest to evaluators.

(1) Inferences from measures to constructs. Constructs—unobservable aspects of humans or the world that we seek to understand—and the measures that shed light on them are not interchangeable. For example, what construct does the high school graduation rate measure? That depends. Possibly education quality, student motivation, workforce readiness, or something else that we cannot directly observe. To make an inference from measure to construct, the construct of interest must be well defined and its measure selected on the basis of evidence.

Evidence is important because, among other things, it can suggest whether many, few, or only one measure is required to understand a construct well. By using the sole measure of calories consumed, for example, we gain a poor understanding of a broad construct like health. However, we can use that single measure to gain a critical understanding of a narrower construct like risk of obesity.

(2) Inferences from measures to impacts. If high school graduation rates go up, was it the result of new policies, parental support, another reason left unconsidered, or a combination of several reasons? This sort of inference represents one of the fundamental challenges of program evaluation, and we have developed a number of strategies to address it. None is perfect, but more often than not we can identify a strategy that is good enough for a specific context and purpose.

Why do I think Gates made weak inferences from good measures? Let’s look at the three examples he offered in support of his premise that measurement is the key to solving the world’s biggest problems.

Example 1: Ethiopia

Gates described how Ethiopia became more committed to providing healthcare services in 2000 as part of the Millennium Development Goals. After that time, the country began tracking the health services it provided in new ways. As evidence that the new measurement strategy had an impact, Gates reported that child mortality decreased 60% in Ethiopia since 1990.

In this example, the inference from measure to impact is not warranted. Based on the article, the sole reason to believe that the new health measurement strategy decreased child mortality is that the former happened before the latter. Inferring causality from the sequential timing of events alone has been recognized as an inferential misstep for so long that it is best known by its Latin name, post hoc ergo propter hoc.

Even if we were willing to make causal inferences based on sequential timing alone, it would not be possible in this case—the tracking system began sometime after 2000 while the reported decrease in child mortality was measured from 1990.

Example 2: Polio

The global effort to eradicate polio has come down to three countries—Nigeria, Pakistan, and Afghanistan—where immunizing children has proven especially difficult. Gates described how new measurement strategies, such as using technology to map villages and track health workers, are making it possible to reach remote, undocumented communities in these countries.

It makes sense that these measurement strategies should be a part of the solution. But do they represent, “Another story of success driven by better measurement,” as Gates suggests?

Maybe yes, maybe no—the inference from measure to impact is again not warranted, but for different reasons.

In the prior example, Gates was looking back, claiming that actions (in the past) made an impact (in the past) because the actions preceded the impact. In this example, he made that claim that ongoing actions will lead to a future impact because the actions precede the intended impact of eradicating polio. The former was a weak inference, the latter weaker still because it incorporates speculation about the future.

Even if we are willing to trust an inference about an unrealized future in which polio has been eradicated, there is another problem. The measures Gates described are implementation measures. Inferring impact from implementation may be warranted if we have strong faith in a causal mechanism, in this case that contact with remote communities leads to immunization which in turn leads to reduction in the transmission of the disease.

We should have strong faith in second step of this causal mechanism—vaccines work. Unfortunately, we should have doubts about the first step because many who are contacted by health workers refuse immunization. The Bulletin of the World Health Organization reported that parental refusal in some areas around Karachi has been widespread, accounting for 74% of missed immunizations there. It is believed that the reasons for the refusals were fear related to safety and the religious implications of the vaccines. New strategies for mapping and tracking cannot, on the face of it, address these concerns.

So I find it difficult to accept that polio immunization is a story of success driven by measurement. It seems more like a story in which new measures are being used in a strategic manner. That’s laudable—but quite different from what was claimed.

Example 3: Education

The final example Gates provided came from the foundation’s $45 million Measures of Effective Teaching (MET) study. As described in the article, the MET study concluded that multiple measures of teacher effectiveness can be used to improve the way administrators manage school systems and teachers provide instruction. The three measures considered in the study were standardized test scores (transformed into controversial units called value-added scores), student surveys of teacher quality, and scores provided by trained observers of classroom instruction.

The first problem with this example is the inference from measures to construct. Everyone wants more effective teachers, but not everyone defines effectiveness the same way. There are many who disagree with how the construct of teacher effectiveness was defined in the MET study—that a more effective teacher is one who promotes student learning in ways that are reflected by standardized test scores.

Even if we accept the MET study’s narrow construct of teacher effectiveness, we should question whether multiple measures are required to understand it well. As reported by the foundation, all three measures in combination explain about 52% of the variation in teacher effectiveness in math and 26% in English-language arts. Test scores alone (transformed into value-added scores) explain about 48% and 20% of the variation in the math and English-language arts, respectively. The difference is trivial, making the cost of gathering additional survey and observation data difficult to justify.

The second problem is inference from measures to impact. Gates presented Eagle County’s experience as evidence that teacher evaluations improve education. He stated that Eagle County’s teacher evaluation system is “likely one reason why student test scores improved in Eagle County over the past five years.” Why does he believe this is likely? He doesn’t say. I can only respond post hoc ergo propter hoc.

So What?

The old chestnut that lack of evidence is not evidence of lacking applies here. Although Gates made inferences that were not well supported by logic and evidence, it doesn’t mean he arrived at the wrong conclusions. Or the right conclusions. All we can do is shrug our shoulders.

And it doesn’t mean we should not be measuring the performance and impact of social enterprises. I believe we should.

It does mean that Gates believes in the effectiveness of potential solutions for which there is little evidence. For someone who is arguing that measurement matters, he is setting a poor example. For someone who has the power to implement solutions on an unprecedented scale, it can also be dangerous.

5 Comments

Filed under Commentary, Evaluation, Evaluation Quality, Program Evaluation

Evaluation in the Post-Data Age: What Evaluators Can Learn from the 2012 Presidential Election

Stop me if you’ve heard this one before.  An evaluator uses data to assess the effectiveness of a program, arrives at a well-reasoned but disappointing conclusion, and finds that the conclusion is not embraced—perhaps ignored or even rejected—by those with a stake in the program.

People—even evaluators—have difficulty accepting new information if it contradicts their beliefs, desires, or interests.  It’s unavoidable.  When faced with empirical evidence, however, most people will open their minds.  At least that has been my experience.

During the presidential election, reluctance to embrace empirical evidence was virtually universal.  I began to wonder—had we entered the post-data age?

The human race creates an astonishing amount of data—2.5 quintillion bytes of data per day.  In the last two years, we created 90% of all data created throughout human history.

In that time, I suspect that we also engaged in more denial and distortion of data than in all human history.

The election was a particularly bad time for data and the people who love them—but there was a bright spot.

On election day I boarded a plane for London (after voting, of course).  Although I had no access to news reports during the flight, I already knew the result—President Obama had about an 84% chance of winning reelection.  When I stepped off the plane, I learned he had indeed won.  No surprise.

How could I be so certain of the result when the election was hailed as too close to call?  I read the FiveThiryEight blog, that’s how.  By using data—every available, well-implemented poll—and a strong statistical model, Nate Silver was able to produce a highly credible estimate of the likelihood that one or the other candidate would win.

Most importantly, the estimate did not depend on the analysts’—or anyone’s—desires regarding the outcome of the election.

Although this first-rate work was available to all, television and print news was dominated by unsophisticated analysis of poll data.  How often were the results of an individual poll—one data point—presented in a provocative way and its implications debated for as long as breath and column inches could sustain?

Isn’t this the way that we interpret evaluations?

News agencies were looking for the story.  The advocates for each candidate were telling their stories.  Nothing wrong with that.  But when stories shape the particular bits of data that are presented to the public, rather than all of the data being used to shape the story, I fear that the post-data age is already upon us.

Are evaluators expected to do the same when they are asked to tell a program’s story?

It has become acceptable to use data poorly or opportunistically while asserting that our conclusions are data driven.  All the while, much stronger conclusions based on better data and data analysis are all around us.

Do evaluators promote similar behavior when we insist that all forms of evaluation can improve data-driven decision making?

The New York Times reported that on election night one commentator, with a sizable stake in the outcome, was unable to accept that actual voting data were valid because they contradicted the story he wanted to tell.

He was already living in the post-data age.  Are we?

6 Comments

Filed under Commentary, Evaluation, Evaluation Quality, Program Evaluation

Conference Blog: Evaluation 2012 (Part 1)—Complexity

I have a great fondness for the American Evaluation Association and its Annual Conference.  At this year’s conference—Evaluation 2012—roughly 3,000 evaluators from around the world came together to share their work, rekindle old friendships, and establish new ones.  I was pleased and honored to be a part of it.

As I moved from session to session, I would ask those I met my favorite question—What have you learned that you will use in your practice?

Their answers—lists, connections, reflections—were filled with insights and surprises.  They helped me understand the wide range of ideas being discussed at the conference and how those ideas are likely to emerge in practice.

In the spirit of that question, I would like to share some thoughts about a few ideas that were thick in the air, starting with this post on complexity.

Complexity: The Undefined Elephant in the Room

The theme of the conference was Evaluation in Complex Ecologies: Relationships, Responsibilities, Relevance.  Not surprisingly, the concept of complexity received a great deal of attention.

Like many bits of evaluation jargon, it has a variety of legitimate formal and informal definitions.  Consequently, evaluators use the term in different ways at different times, which led a number of presenters to make statements that I found difficult to parse.

Here are a few that I jotted down:

“That’s not complex, it’s complicated.”

“A few simple rules can give rise to tremendous complexity.”

“Complexity can lead to startling simplicity.”

“A system can be simple and complicated at the same time.”

“Complexity can lead to highly stable systems or highly unstable systems.”

“Much of time people use the term complexity wrong.”

We are, indeed, a profession divided by a common language.

Why can’t we agree on a definition for complexity?

First, no other discipline has.  Perhaps that is too strong a statement—small sub-disciplines have developed common understandings of the term, but across those small groups there is little agreement.

Second, we cannot decide if complexity, simplicity, and complicatedness, however defined, are:

(A) Mutually exclusive

(B) Distinct but associated

(C) Inclusive and dependent

(D) All of the above

From what I can tell, the answer is (D).  That doesn’t help much, does it?

Third, we conflate the entities that we label as complex, complicated, or simple.  Over the past week, I heard the term complexity used to describe:

  • real-world structures such as social, environmental, and physical systems;
  • cognitive structures that we use to reason about real-world structures;
  • representations that we use to describe and communicate our cognitive structures;
  • computer models that we use to reveal the behavior of a system that is governed by a mathematically formal interpretation of our representations;
  • behaviors exhibited by real-world structures, cognitive structures, and computer models;
  • strategies that we develop to change the real world in a positive way;
  • human actions undertaken to implement change strategies; and
  • evaluations of our actions and strategies.

When we neglect to specify which entities we are discussing, or treat these entities as interchangeable, clarity is lost.

Where does this get us?

I hope it encourages us to do the following when we invoke the concept of complexity: define what we mean and identify what we are describing.  If we do that, we don’t need to agree—and we will be better understood.

Leave a comment

Filed under AEA Conference, Evaluation, Program Evaluation

Conference Blog: The American Evaluation Association Conference About to Kickoff

It’s been a busy few months for me.  I have been leading workshops, making presentations, attending conferences, and working in Honolulu, Helsinki, London, Tallinn (Estonia), and Claremont.  I met some amazing people and learned a great deal about how evaluation is being practiced around the world.  More about this in later posts.

This morning, I am in Minneapolis for the Annual Conference of the American Evaluation Association, which begins today. While I am here, I will be reporting on the latest trends, techniques, and opportunities in evaluation.

Today will be interesting.  I lead a half-day workshop on program design with Stewart Donaldson. Then I chair a panel discussion on the future of evaluation (a topic that, to my surprise, has mushroomed from a previous EvalBlog post  into a number of conference presentations and a website).

Off to the conference–more later.

Leave a comment

Filed under AEA Conference, Design, Evaluation, Program Design, Program Evaluation

Conference Blog: Catapult Labs 2012

Did you miss the Catapult Labs conference on May 19?  Then you missed something extraordinary.

But don’t worry, you can get the recap here.

The event was sponsored by Catapult Design, a nonprofit firm in San Francisco that uses the process and products of design to alleviate poverty in marginalized communities.  Their work spans the worlds of development, mechanical engineering, ethnography, product design, and evaluation.

That is really, really cool.

I find them remarkable and their approach refreshing.  Even more so because they are not alone.  The conference was very well attended by diverse professionals—from government, the nonprofit sector, the for-profit sector, and design—all doing similar work.

The day was divided into three sets of three concurrent sessions, each presented as hands-on labs.  So, sadly, I could attend only one third of what was on offer.  My apologies to those who presented and are not included here.

I started the day by attending Democratizing Design: Co-creating With Your Users presented by Catapult’s Heather Fleming.  It provided an overview of techniques designers use to include stakeholders in the design process.

Evaluators go to great lengths to include stakeholders.  We have broad, well-established approaches such as empowerment evaluation and participatory evaluation.  But the techniques designers use are largely unknown to evaluators.  I believe there is a great deal we can learn from designers in this area.

An example is games.  Heather organized a game in which we used beans as money.  Players chose which crops to plant, each with its own associated cost, risk profile, and potential return.  The expected payoff varied by gender, which was arbitrarily assigned to players.  After a few rounds the problem was clear—higher costs, lower returns, and greater risks for women increased their chances of financial ruin, and this had negative consequences for communities.

I believe that evaluators could put games to good use.  Describing a social problem as a game requires stakeholders to express their cause-and-effect assumptions about the problem.  Playing with a group allows others to understand those assumptions intimately, comment upon them, and offer suggestions about how to solve the problem within the rules of the game (or perhaps change the rules to make the problem solvable).

I have never met a group of people who were more sincere in their pursuit of positive change.  And honest in their struggle to evaluate their impact.  I believe that impact evaluation is an area where evaluators have something valuable to share with designers.

That was the purpose of my workshop Measuring Social Impact: How to Integrate Evaluation & Design.  I presented a number of techniques and tools we use at Gargani + Company to design and evaluate programs.  They are part of a more comprehensive program design approach that Stewart Donaldson and I will be sharing this summer and fall in workshops and publications (details to follow).

The hands-on format of the lab made for a great experience.  I was able to watch participants work through the real-world design problems that I posed.  And I was encouraged by how quickly they were able to use the tools and techniques I presented to find creative solutions.

That made my task of providing feedback on their designs a joy.  We shared a common conceptual framework and were able to speak a common language.  Given the abstract nature of social impact, I was very impressed with that—and their designs—after less than 90 minutes of interaction.

I wrapped up the conference by attending Three Cups, Rosa Parks, and the Polar Bear: Telling Stories that Work presented by Melanie Moore Kubo and Michaela Leslie-Rule from See Change.  They use stories as a vehicle for conducting (primarily) qualitative evaluations.  They call it story science.  A nifty idea.

I liked this session for two reasons.  First, Melanie and Michaela are expressive storytellers, so it was great fun listening to them speak.  Second, they posed a simple question—Is this story true?—that turns out to be amazingly complex.

We summarize, simplify, and translate meaning all the time.  Those of us who undertake (primarily) quantitative evaluations agonize over this because our standards for interpreting evidence are relatively clear but our standards for judging the quality of evidence are not.

For example, imagine that we perform a t-test to estimate a program’s impact.  The t-test indicates that the impact is positive, meaningfully large, and statistically significant.  We know how to interpret this result and what story we should tell—there is strong evidence that the program is effective.

But what if the outcome measure was not well aligned with the program’s activities? Or there were many cases with missing data?  Would our story still be true?  There is little consensus on where to draw the line between truth and fiction when quantitative evidence is flawed.

As Melanie and Michaela pointed out, it is critical that we strive to tell stories that are true, but equally important to understand and communicate our standards for truth.  Amen to that.

The icing on the cake was the conference evaluation.  Perhaps the best conference evaluation I have come across.

Everyone received four post-it notes, each a different color.  As a group, we were given a question to answer on a post-it of a particular color, and only a minute to answer the question.  Immediately afterward, the post-its were collected and displayed for all to view, as one would view art in a gallery.

Evaluation as art—I like that.  Immediate.  Intimate.  Transparent.

Gosh, I like designers.

4 Comments

Filed under Conference Blog, Design, Evaluation, Program Design, Program Evaluation

Conference Blog: The Wharton “Creating Lasting Change” Conference

How can corporations promote the greater good?  Can they do good and be profitable?  How well can we measure the good they are doing?

These were some of the questions explored at a recent Wharton School Conference entitled Creating Lasting Change: From Social Entrepreneurship to Sustainability in Retail.  I provide a brief recap of the event.  Then I discuss why I believe program evaluators, program designers, and corporations have a great deal to learn from each other.

The Location

The conference took place at Wharton’s stunning new San Francisco campus.  By stunning I mean drop-dead gorgeous.  Here is one of its many views.

An Unusual and Effective Conference

The conference was jointly organized by three entities within the Wharton School—the Jay H. Baker Retailing Center, the Initiative for Global Environmental Leadership, and the Wharton Program for Social Impact.

When I first read this I scratched my head.  A conference that combined the interests of any two made sense to me.  Combining the interests of all three seemed like a stretch.  I found—much to my delight—that the conference worked very well because of its two-panel structure.

Panel 1 addressed the social and environmental impact of new ventures; Panel 2 addressed the impact of large, established corporations.  This offered an opportunity to compare and contrast new with old, small with large, and risk takers with the risk averse.

Fascinating and enlightening.  I explain why after I describe the panels.

Panel 1: Social Entrepreneurship/Innovation

The first panel considered how entrepreneurs and venture capitalists can promote positive environmental and social change.

  • Andrew D’Souza, Chief Revenue Officer at Top Hat Monocle, discussed how his company developed web-based clickers for classrooms and online homework tools that are designed to promote learning—a social benefit that can be directly monetized.
  • Mike Young, Director of Technology Development at Innova Dynamics, described how his company’s social mission drives their development and commercialization of “disruptive advanced materials technologies for a sustainable future.”
  • Amy Errett, Partner at the venture capital firm Maveron, emphasized the firm’s belief that businesses focusing on a social mission tend to achieve financial success.
  • Susie Lee, Principal at TBL Capital, outlined her firm’s patient capital approach, which favors companies that balance their pursuit of social, environmental, and financial objectives.
  • Raghavan Anand, Chief Financial Officer at One Million Lights, moderated the panel.

Panel 2: Sustainability/CSR in the Retail Industry

The second panel discussed how large, established companies impact society and the natural world, and what it means for a corporation to act responsibly.

Christy Consler, Vice President of Sustainability at Safeway Inc., made the case that the large grocer (roughly 1,700 stores and 180,000 employees) needs to focus on sustainable, socially responsible operations to ensure that it has dependable sources for its product—food—as the world population swells by 2 billion over the next 35 years.

Lori Duvall, Director of Operational Sustainability at eBay Inc., summarized eBay’s sustainability efforts, which include solar power installations, reusable packaging, and community engagement.

Paul Dillinger, Senior Director-Global Design at Levi Strauss & Co., made an excellent presentation on the social and environmental consequences—positive and negative—of the fashion industry, and how the company is working to make a positive impact.

Shauna Sadowski, Director of Sustainability at Annie’s (you know, the company that makes the cute organic, bunny-shaped mac and cheese), discussed how bringing natural foods to the marketplace motivates sustainable, community-centered operations.

Barbara Kahn moderated.  She wins the prize for having the longest title—the Patty & Jay H. Baker Professor, Professor of Marketing; Director, Jay H. Baker Retailing Center—and from what I could tell, she deserves every bit of the title.

Measuring Social Impact

I was thrilled to find corporations, new and old, concerned with making the world a better place.  Business in general, and Wharton in particular, have certainly changed in the 20 years since I earned my MBA.

The unifying theme of the panels was impact.  Inevitably, that discussion turned from how corporations were working to make social and environmental impacts to how they were measuring impacts.  When it did, the word evaluation was largely absent, being replaced by metrics, measures, assessments, and indicators.  Evaluation, as a field and a discipline, appears to be largely unknown to the corporate world.

Echoing what I heard at the Harvard Social Enterprise Conference (day 1 and day 2), impact measurement was characterized as nascent, difficult, and elusive.  Everyone wants to do it; no one knows how.

I find this perplexing.  Is the innovation, operational efficiency, and entrepreneurial spirit of American corporations insufficient to crack the nut of impact measurement?

Without a doubt, measuring impact is difficult—but not for the reasons one might expect.  Perhaps the greatest challenge is defining what one means by impact.  This venerable concept has become a buzzword, signifying both more an less than it should for different people in different settings.  Clarifying what we mean simplifies the task of measurement considerably.  In this setting, two meanings dominated the discussion.

One was the intended benefit of a product or service.  Top Hat Monocle’s products are intended to increase learning.  Annie’s foods are intended to promote health.  Evaluators are familiar with this type of impact and how to measure it.  Difficult?  Yes.  It poses practical and technical challenges, to be sure.  Nascent and elusive?  No.  Evaluators have a wide range of tools and techniques that we use regularly to estimate impacts of this type.

The other dominant meaning was the consequences of operations.  Evaluators are probably less familiar with this type of impact.

Consider Levi’s.  In the past, 42 liters of fresh water were required to produce one pair of Levi’s jeans.  According to Paul Dillinger, the company has since produced about 13 million pairs using a more water-efficient process, reducing the total water required for these jeans from roughly 546 million liters to 374 million liters—an estimated savings of 172 million liters.

Is that a lot?  The Institute of Medicine estimates that one person requires about 1,000 liters of drinking water per year (2.2 to 3 liters per day making a variety of assumptions)—so Levi’s saved enough drinking water for about 172,000 people for one year.  Not bad.

But operational impact is more complex than that.  Levi’s still used the equivalent yearly drinking water for 374,000 people in places where potable water may be in short supply.  The water that was saved cannot be easily moved where it may be needed more for drinking, irrigation, or sanitation.  If the water that is used for the production of jeans is not handled properly, it may contaminate larger supplies of fresh water, resulting in a net loss of potable water.  The availability of more fresh water in a region can change behavior in ways that negate the savings, such as attracting new industries that depend on water or inducing wasteful water consumption practices.

Is it difficult to measure operational impact?  Yes.  Even estimating something as tangible as water use is challenging.  Elusive?  No.  We can produce impact estimates, although they may be rough.  Nascent?  Yes and no.  Measuring operational impact depends on modeling systems, testing assumptions, and gauging human behavior.  Evaluators have a long history of doing these things, although not in combination for the purpose of measuring operational impact.

It seems to me that evaluators and corporations could learn a great deal from each other.  It is a shame these two worlds are so widely separated.

Designing Corporate Social Responsibility Programs

With all the attention given to estimating the value of corporate social responsibility programs, the values underlying them were not fully explored.  Yet the varied and often conflicting values of shareholders and stakeholders pose the most significant challenge facing those designing these programs.

Why do I say that?  Because it has been that way for over 100 years.

The concept of corporate social responsibility has deep roots.  In 1909, William Tolman wrote about a trend he observed in manufacturing.  Many industrialists, by his estimation, were taking steps to improve the working conditions, pay, health, and communities of their employees.  He noted that these unprompted actions had various motives—a feeling that workers were owed the improvements, unqualified altruism, or the belief that the efforts would lead to greater profits.

Tolman placed a great deal of faith in the last motive.  Too much faith.  Twentieth-century industrial development was not characterized by rational, profit-maximizing companies competing to improve the lot of stakeholders in order to increase the wealth of shareholders.  On the contrary, making the world a better place typically entailed tradeoffs that shareholders found unacceptable.

So these early efforts failed.  The primary reason was that their designs did not align the values of shareholders and stakeholders.

Can the values of shareholders and stakeholders be more closely aligned today?  I believe they can be.  The founders of many new ventures, like Top Hat Monocle and Innova Dynamics, bring different values to their enterprises.  For them, Tolman’s nobler motives—believing that people deserve a better life and a desire to do something decent in the world—are the cornerstones of their company cultures.  Even in more established organizations—Safeway and Levi’s—there appears to be a cultural shift taking place.  And many venture capital firms are willing to take a patient capital approach, waiting longer and accepting lower returns, if it means they can promote a greater social good.

This is change for the better.  But I wonder if we, like Tolman, are putting too much faith in win-win scenarios in which we imagine shareholders profit and stakeholders benefit.

It is tempting to conclude that corporate social responsibility programs are win-win.  The most visible examples, like those presented at this conference, are.  What lies outside of our field of view, however, are the majority of rational, profit-seeking corporations that are not adopting similar programs.  Are we to conclude that these enterprises are not as rational as they should be? Or have we yet to design corporate responsibility programs that resolve the shareholder-stakeholder tradeoffs that most companies face?

Again, there seems to be a great deal that program designers, who are experienced at balancing competing values, and corporations can learn from each other…if only the two worlds met.

1 Comment

Filed under Commentary, Conference Blog, Design, Evaluation, Program Design, Program Evaluation

Running Hot and Cold for Mixed Methods: Jargon, Jongar, and Code

Jargon is the name we give to big labels placed on little ideas. What should we call little labels placed on big ideas? Jongar, of course.

A good example of jongar in evaluation is the term mixed methods. I run hot and cold for mixed methods. I praise them in one breath and question them in the next, confusing those around me.

Why? Because mixed methods is jongar.

Recently, I received a number of comments through LinkedIn about my last post. A bewildered reader asked how I could write that almost every evaluation can claim to use a mixed-methods approach. It’s true, I believe that almost every evaluation can claim to be a mixed-methods evaluation, but I don’t believe that many—perhaps most—should.

Why? Because mixed methods is also jargon.

Confused? So were Abbas Tashakkori and John Creswell. In 2007, they put together a very nice editorial for the first issue of the Journal of Mixed Methods Research. In it, they discussed the difficulty they faced as editors who needed to define the term mixed methods. They wrote:

…we found it necessary to distinguish between mixed methods as a collection and analysis of two types of data (qualitative and quantitative) and mixed methods as the integration of two approaches to research (quantitative and qualitative).

By the first definition, mixed methods is jargon—almost every evaluation uses more than one type of data, so the definition attaches a special label to a trivial idea. This is the view that I expressed in my previous post.

By the second definition, which is closer to my own perspective, mixed methods is jongar—two simple words struggling to convey a complex concept.

My interpretation of the second definition is as follows:

A mixed-methods evaluation is one that establishes in advance a design that explicitly lays out a thoughtful, strategic integration of qualitative and quantitative methods to accomplish a critical purpose that either qualitative or quantitative methods alone could not.

Although I like this interpretation, it places a burden on the adjective mixed that it cannot support. In doing so, my interpretation trades one old problem—being able to distinguish mixed methods evaluations from other types of evaluation—for a number of new problems. Here are three of them:

  • Evaluators often amend their evaluation designs in response to unanticipated or dynamic circumstances—so what does it mean to establish a design in advance?
  • Integration is more than having quantitative and qualitative components in a study design—how much more and in what ways?
  • A mixed-methods design should be introduced when it provides a benefit that would not be realized otherwise—how do we establish the counterfactual?

These complex ideas are lurking behind simple words. That’s why the words are jongar and why the ideas they represent may be ignored.

Technical terms—especially jargon and jongar—can also be code. Code is the use of technical terms in real-world settings to convey a subtle, non-technical message, especially a controversial message.

For example, I have found that in practice funders and clients often propose mixed methods evaluations to signal—in code—that they seek an ideological compromise between qualitative and quantitative perspectives. This is common when program insiders put greater faith in qualitative methods and outsiders put greater faith in quantitative methods.

When this is the case, I believe that mixed methods provide an illusory compromise between imagined perspectives.

The compromise is illusory because mixed methods are not a middle ground between qualitative and quantitative methods, but a new method that emerges from the integration of the two. At least by the second definition of mixed methods that I prefer.

The perspectives are imagined because they concern how results based on particular methods may be incorrectly perceived or improperly used by others in the future. Rather than leap to a mixed-methods design, evaluators should discuss these imagined concerns with stakeholders in advance to determine how to best accommodate them—with or without mixed methods. In many funder-grantee-evaluator relationships, however, this sort of open dialogue may not be possible.

This is why I run hot and cold for mixed methods. I value them. I use them. Yet, I remain wary of labeling my work as such because the label can be…

  • jargon, in which case it communicates nothing;
  • jongar, in which case it cannot communicate enough; or
  • code, in which case it attempts to communicate through subtlety what should be communicated through open dialogue.

Too bad—the ideas underlying mixed methods are incredibly useful.

6 Comments

Filed under Commentary, Evaluation, Evaluation Quality, Program Evaluation, Research

Evaluator, Watch Your Language

As I was reading a number of evaluation reports recently, the oddity of evaluation jargon struck me.  It isn’t that we have unusual technical terms—all fields do—but that we use everyday words in unusual ways.  It is as if we speak in a code that only another evaluator can decipher.

I jotted down five words and phrases that we all use when we speak and write about evaluation.  On the surface, their meanings seem perfectly clear.  However, they can be used for good and bad.  How are you using them?

(1) Suggest

As in: The data suggest that the program was effective.

Pros: Suggest is often used to avoid words such as prove and demonstrate—a softening of this is so to this seems likely.  Appropriate qualification of evaluation results is desirable.

Cons: Suggest is sometimes used to inflate weak evidence.  Any evaluation—strong or weak—can be said to suggest something about the effectiveness of a program. Claiming that weak evidence suggests a conclusion overstates the case.

Of special note:  Data, evaluations, findings, and the like cannot suggest anything.  Authors suggest, and they are responsible for their claims.

(2) Mixed Methods

As in: Our client requested a mixed-methods evaluation.

Pros: Those who focus on mixed methods have developed thoughtful ways of integrating qualitative and quantitative methods.  Thoughtful is desirable.

Cons: All evaluations use some combination of qualitative and quantitative methods, so any evaluation can claim to use—thoughtfully or not—a mixed-methods approach.  A request for a mixed-methods evaluation can mean that clients are seeking an elusive middle ground—a place where qualitative methods tell the program’s story in a way that insiders find convincing and quantitative methods tell the program’s story in a way that outsiders find convincing.  The middle ground frequently does not exist.

(3) Know

As in: We know from the literature that teachers are the most important school-time factor influencing student achievement.

Pros: None.

Cons: The word know implies that claims to the contrary are unfounded.  This shuts down discussion on topics for which there is almost always some debate.  One could argue that the weight of evidence is overwhelming, the consensus in the field is X, or we hold this belief as a given.  Claiming that we know, with rare exception, overstates the case.

(4) Nonetheless [we can believe the results]

As in: The evaluation has flaws, nonetheless it reaches important conclusions.

Pros: If the phrase is followed by a rationale (…because of the following reasons…), this turn of phrase might indicate something quite important.

Cons: All evaluations have flaws, and it is the duty of evaluators to bring them to the attention of readers.  If the reader is then asked to ignore the flaws, without being given a reason, it is at best confusing and at worst misleading.

(5) Validated Measure

As in: We used the XYZ assessment, a previously validated measure.

Pros: None

Cons: Validity is not a characteristic of a measure. A measure is valid for a particular group of people for a particular purpose in a particular context at a specific point in time.  This means that evaluators must make the case that all of the measures that they used were appropriate in the context of the evaluation.

The Bottom Line

I am guilty of sometimes using bad language.  We all are.  But language matters, even in causal conversations among knowledgeable peers.  Bad language leads to bad thinking, as my mother always said.  So I will endeavor to watch my language and make her proud.  I hope you will too.

12 Comments

Filed under Evaluation, Evaluation Quality, Program Evaluation

Conference Blog: The Harvard Social Enterprise Conference (Day 2)

What follows is a second series of short posts written while I attended the Social Enterprise Conference (#SECON12).  The conference (February 25-26) was presented by the Harvard Business School and the Harvard Kennedy School.

I spent much of the day attending a session entitled ActionStorm: A Workshop on Designing Actionable InnovationsSuzi Sosa, Executive Director of the Dell Social Innovation Challenge did a great job of introducing a design thinking process for those developing new social enterprises.  I plan to blog more about design thinking in a future post.

The approach presented in the workshop combined basic design thinking activities (mind mapping, logic modeling, and empathy mapping) that I believe can be of value to program evaluators as well as program designers.

I wonder, however, how well these methods fit the world of grant-funded programs.  Increasingly, the guidelines for grant proposals put forth by funding agencies specify the core elements of a program’s design.  It is common for funders to specify the minimum number of contact hours, desired length of  service, and required service delivery methods.  When this is the case, designers may have little latitude to innovate, closing off opportunities to improve quality and efficiency.

Evaluation moment #4: Suzi constantly challenged us to specify how, where, and why we would measure the impact of the social enterprises we were discussing.  It was nice to see someone advocating for evaluation “baked into” program designs.  The participants were receptive, but they seemed somewhat daunted by the challenge of measuring impact.

Next, I attended Taking Education Digital: The Impact of Sharing KnowledgeChris Dede (Harvard Graduate School of Education) moderated.  I have always found his writing insightful and thought provoking, and he did not disappoint today. He provided a clear, compelling call for using technology to transform education.

His line of reasoning, as I understand it, is this: the educational system, as currently structured, lacks the capacity to meet federal and state mandates to increase (1) the quality of education delivered to students and (2) desired high school and college graduation rates.  Technology can play a transformational role by increasing the quality and capacity of the educational system.

Steve Carson followed by describing his work with MIT OpenCourseWare, which illustrated very nicely the distinction between innovation and transformation.  MIT OpenCourseWare was, at first, a humble idea–use the web to make it easier for MIT students and faculty to share learning-related materials.  Useful, but not innovative (as the word is typically used).

It turned out that the OpenCourseWare materials were being used by a much larger, more diverse group of formal and informal learners for wonderful, unanticipated educational purposes.  So without intending, MIT had created a technology with none of the trappings of innovation yet tremendous potential to be transformational.

The moral of the story: social impact can be achieved in unexpected ways, and in cultures that value innovation, the most unexpected way is to do something unexceptional exceptionally well.

Next, Chris Sprague (OpenStudy) discussed his social learning start up.  OpenStudy connects students to each other–so far 150,000 from 170 countries–in ways that promote learning.  Think of it as a worldwide study hall.

Social anything is hot in the tech world, but this is more than Facebook dressed in a scholar’s robe.  The intent is to create meaningful interactions around learning, tap expertise, and spark discussions that build understanding.  Think about how much you can learn about a subject simply by having a cup of coffee with an expert.  Imagine how much more you could learn if you were connected to more experts and did not need to sit next to them in a cafe to communicate.

The Pitch for Change took place in the afternoon.  It was the culmination of a process in which young social entrepreneurs give “elevator pitches” describing new ventures.  Those with the best pitches are selected to move on to the next round, where they make another pitch.

To my eyes, the final round combined the most harrowing elements of job interviews and Roman gladiatorial games–one person enters the arena, fights for survival for three minutes, and then looks to the crowd for thumbs up or down (see the picture at the top of this entry). Of course, they don’t use thumbs–that would too BC (before connectivity).  Instead, they use smartphones to vote via the web.

At the end, the winners were given big checks (literally, the checks were big; the dollar amounts, not so much).

But winners receive more than a little seed capital.  The top two winners are fast-tracked to the semifinal round of the 2013 Echoing Green Fellowship, the top four winners are fast-tracked to the semifinal round of the 2012 Dell Social Challenge, and the project that best makes use of technology to solve a social or environmental problem wins the Dell Technology Award.  Not bad for a few minutes in the arena.

Afterward, Dr. Judith Rodin, President of the Rockefeller Foundation, made the afternoon keynote speech, which focused on innovation.  She is a very good speaker and the audience was eager to hear about the virtues of new ideas.  It went over well.

Evaluation Moment 5: Dr. Rodin made the case for measuring social impact.  She described it as essential to traditional philanthropy and more recent efforts around social impact investing.  She noted that Rockefeller is developing its capacity in this area, however, evaluation remains a tough nut to crack.

The last session of the day was fantastic–and not just because an evaluator was on the panel.  It was entitled If at First You Don’t Succeed: The Importance of Prototyping and Iteration in Poverty Alleviation.  Prototyping is not just a subject of interest for me, it is a way of life.

Mike North (ReAllocate) discussed how he leverages volunteers–individuals and corporations–to prototype useful, innovative products.  In particular, he described his ongoing efforts to prototype an affordable corrective brace for children in developing countries who are born with clubfoot.  You can learn more about it in this video.

Timothy Prestero (Design that Matters) walked us through the process he used to prototype the Firefly.  About 60% of newborns in developing countries suffer from jaundice, and about 10% of these go on to suffer brain damage or another disability.  The treatment is simple–exposure to blue light.  Firefly is the light source.

What is so hard about designing a lamp that shines a blue light?  Human behavior.

For example, hospital workers often put more than one baby in the same phototherapy device, which promotes infectious disease.  Consequently, Firefly needed to be designed in such a way that only one baby could be treated at a time.  It also needed to be inexpensive in order to address the root cause of the problem behavior–too few devices in hospitals.  Understanding these behaviors, and designing with them in mind, requires lengthy prototyping.

Molly Kinder described her work at Development Innovation Ventures (DIV), a part of USAID.  DIV provides financial and other support to innovative projects selected through a competitive process.  In many ways, it looks more like a new-style venture fund than part of a government agency.  And DIV rigorously evaluates the impact of the projects it supports.

Evaluation moment #5: Wow, here is a new-style funder routinely doing high quality evaluations–including but not limited to randomized control trials–in order t0 scale projects strategically.

Shawn Powers, from the Jameel Poverty Action Lab at MIT (J-PAL), talked about J-PAL’s efforts to conduct randomized trials in developing countries.  Not surprisingly, I am a big fan of J-PAL, which is dedicated to finding effective ways of improving the lives of the poor and bringing them to scale.

Looking back on the day:  The tight connection between design and evaluation was a prominent theme.  While exploring the theme, the discussion often turned to how evaluation can help social enterprises scale up.  It seems to me that we first need t0 scale up rigorous evaluation of social enterprises.  The J-PAL model is a good one, but it isn’t possible for academic institutions to scale up fast enough or large enough to meet the need.  So what do we do?

Leave a comment

Filed under Conference Blog, Design, Evaluation, Program Design, Program Evaluation

Conference Blog: The Harvard Social Enterprise Conference (Day 1)

What follows is a series of short posts written while I attended the Social Enterprise Conference (#SECON12).  The conference (February 25-26) was presented by the Harvard Business School and the Harvard Kennedy School.

What is a social enterprise?

The concept of a social enterprise is messy.  By various definitions, it can include:

  • A for-profit company that seeks to benefit society;
  • a nonprofit organization that uses business-like methods;
  • a foundation that employs market investing principles; and
  • a government agency that leverages the work of private-sector partners.

The concept of a social enterprise is disruptive. It blurs the lines separating organizations that do good for stakeholders, do well for shareholders, and do right by constituents.

The concept of a social enterprise is inspiring.  It can foster flexible, creative solutions to our most pressing problems.

The concept of a social enterprise is dangerous.  It can attach the patina of altruism to organizations motivated solely by profits.

The concept of a social enterprise is catching fire.  The evaluation community needs to learn how it fits into this increasingly common type of organization.

The conference started with a young entrepreneurs keynote panel that was moderated by Daniel Epstein (Unreasonable Institute).

Kavita Shukla of Fenugreen discussed the product she invented.  Amazing.  It is a piece of paper permeated with organic, biodegradable herbs.  So what?  It keeps produce fresh 2-4 times longer.  The potential social and financial impact of the product—especially in parts of the world where food is in short supply and refrigeration scarce—is tremendous. Watch a TED talk about it here.

Next, Taylor Conroy (Destroy Normal Consulting) discussed his fundraising platform that allows people to raise $10,000 in three hours for projects like building schools in developing countries.  Sound crazy?  Check it out here and decide for yourself.

Finally, Lauren Bush (FEED Projects) discussed how she has used the sale of FEED bags and other fashion items to provide over 60 million meals for children in need around the world.

Evaluation moment #1: The panelists were asked how they measured the social impact of their enterprises.  Disappointingly, they do not seem to be doing so in a systematic way beyond counting units of service provided or number of products sold—a focus on outputs, not outcomes.

The first session I attended had the provocative title Social Enterprise: Myth or Reality?: Measuring Social Impact and Attracting Capital. Jim Bildner did an outstanding job as moderator.  Panelists included Kimberlee Cornett (Kresge Foundation), Clara Miller (F. B. Heron Foundation), Margaret McKenna (Harvard Kennedy School), and David Wood (Hauser Center for Nonprofit Organizations).

The discussion addressed three questions.

Q: What is social enterprise?

A: It apparently can be anything, but it should be something that is more precisely defined.

Q: How are foundations and financial investors getting involved?

A: By making loans and taking equity stakes in social enterprises.  That promotes social impact through the enterprise and generates more cash to invest in other social enterprises.

Evalution moment #2: Q: How can the social impact of enterprises be measured?

A: It isn’t.  One panelist suggested that measuring social impact is such a tough nut to crack that, if someone could figure out how, it would make for a fantastic new social enterprise.  I was both shocked and flattered, given I have been doing just that for decades.  Why were there no evaluators on this panel?


Ami Dalal and Jo-Ann Tan of Acumen Fund conducted a “bootcamp” on the approach their firm uses to make social investments.  They focused on methods of due diligence and valuation (that is, how they attach a dollar value to a social enterprise).

I found their approach to measuring the economic impact of the their investments very interesting—perhaps evaluators would benefit from learning more about it.  There are details at their website.

Evaluation moment #3

When the topic of measuring the social impact of their investments came up, the presenters provided the most direct answer I have heard so far.  They always measure outputs—those are easy to measure and can indicate if something is going wrong.  In some cases they also measure outcomes (impacts) using randomized control trials.  Given the cost, they do this infrequently.

Looking back on the day

A social enterprise that measures social impact but does not measure financial success would be considered ridiculous.  Yet a social enterprise that measures financial success but does not measure social impact is not.  Why?

2 Comments

Filed under Conference Blog, Evaluation, Program Evaluation