Forecasting Results in a Pay for Success World

Published Wednesday, February 7, 2018 | by Christine Kidd

Quick Facts

Issue Area

How much do you think your model impacts the people you serve? How confident are you in that estimate? If you’re a service provider getting involved in Pay for Success (PFS) it’s likely because you’re confident that what you’re doing works. PFS may force you to go a step further and start setting expectations about how much it works.

In addition to resolving some big questions discussed in our earlier post, PFS project partners have to reach agreement on a payment schedule that links provider performance to government repaying investors. The schedule and associated performance have to meet a few requirements.

Performance needs to be significant enough to justify government payment yet  conservative enough to make investors feel reasonably confident they’ll get repaid. Expectations also need to be realistic enough that a provider can be reasonably confident in their ability to deliver on results.

All these parties need to find alignment, even as the project might face additional dimensions of risk including the uncertainty that comes with variables like scaling to new cities or adjusting the program model to accommodate an evaluation. 

This post focuses on the challenges providers face in deciding what’s realistic. At the Center for Employment Opportunities (CEO) we’ve grappled with this question in PFS and performance-based contracting negotiations and hope our experience can help to inform the field.

The Difficulty of Predicting Results

What do RCTs tell us and what do they not tell us?

If it were just a matter of predicting how many of CEO’s participants will obtain unsubsidized employment, for example, this might not be so challenging. At CEO we track data like this as part of our routine performance management. But in many PFS projects government wants to know about the value-add of an intervention, meaning the difference the program is making compared to a control group. That's what evaluators call the impact question, or establishing causation instead of correlation. 

Many of us have been paying attention and evaluating our programs, often using Randomized Control Trials (RCTs) or other quasi-experimental forms of evaluation. So for those of us with past RCTs, why not take the results of the earlier study and commit to achieving them again?  You did an RCT already, so you can replicate that impact, right? After all, RCTs are the “Gold Standard,” right? 

Well, yes - RCTs are the “Gold Standard” for establishing causality or “internal validity” because the only difference between the treatment and control group should be the intervention being studied. RCTs can provide key insights on the marginal impact of your program and can provide compelling evidence of your success.

So what’s the holdup?

Evaluators would remind us that, while rigorous in establishing causality, RCTs’ weakness is their “external validity,” or ability to predict whether impacts would persist in a different setting, for a different population, or at a different point in time.

Like other policy areas, the field of reentry is affected by local and state-level variables like “ban the box”, mandatory minimum sentences, and regulations that bar people with convictions from sectors of the economy. All of these change across time and location, affecting the success of organizations like CEO and the people we serve.

Impact estimates and confidence intervals

But even if you were replicating a study in exactly the same way, evaluations only give you an impact estimate. CEO’s MDRC study, for example, estimated a 16% reduction in the number of days that the treatment group spent behind bars.  When a provider like CEO achieves statistically significant results, it’s a big deal. In the strictest sense, however, statistical significance means that the math makes us reasonably confident that it’s not zero impact. If you’ve heard the term “rejecting the null hypothesis,” that’s what it means - we get to reject the idea that we have zero impact.

In addition to an impact estimate, (warning - stats ahead), evaluators usually offer us a 95% confidence interval - a range of likely impacts as demonstrated by the data. The technical definition there is that if they were to repeat the same study 100 times, the impact estimate would fall within the range of the confidence interval in 95 of those studies - conversely, there’s a 5 in 100 chance that the true impact falls outside of the confidence interval. For example, a study might say that treatment group members in a weight loss program lost 15 pounds more on average than a control group, with a 95% confidence interval ranging from 5-25 pounds lost. Since the interval does not include zero we can be fairly confident that the intervention helps people lose weight, but if we repeated the study the impact estimate might fall closer to 5 pounds lost or closer to 25 pounds lost. While we should still be really excited and impressed any time a program shows statistically significant results, as a field we need to we need to remember the range, not just the average, even though it can be easier to focus on a specific number. 

Control group experience & changing contexts

Another variable that can change across times and locations is the control group’s experience. As David Butler, CEO’s evaluation consultant and former evaluator for MDRC describes it, “You could have the same population, same policies, same design and a perfect replication of the intervention. But if the control services are very different than in the initial evaluation setting, all bets are off...if there are other programs in the community that offer very similar services to you that is not a good place for an RCT.”  In the criminal justice field, for example, Cognitive Behavioral Therapy (CBT) has expanded dramatically over the past decade. If a CBT evaluation were done in the 1990s it would be unlikely that control group members were getting CBT elsewhere but, if conducted today, control group members would be much more likely to access similar interventions in other settings. In that case the evaluation will tell you how effective a CBT program is relative to other CBT programs; it will not tell you whether CBT is effective as a practice unless control group members are somehow excluded from receiving CBT.  

The challenge of guessing impact estimates

At CEO we’ve completed three different evaluations that were either experimental or quasi-experimental: an RCT by MDRC, and matched comparison studies by Harder & Co and New York State. The evaluations each contained strong signs of CEO's efficacy, though specific impacts varied. Impact estimates (how much CEO’s model works) have been quite different in all three. CEO’s MDRC evaluation showed positive results across a range of measures and subgroups - a 88 bed day (30%) reduction over 3 years for high risk individuals, 22% bed day decrease for recently released individuals, and other impacts that varied based on metric and subgroup.[1] A matched comparison study of our San Diego site showed an 11 bed day (21%) bed day reduction but without a uniform observation window.[2] For some subgroups and measures in these studies, the data went in the opposite direction, although without statistical significance. To add to the complication, Harder & Co looked only at jail days, while MDRC looked at a more comprehensive measure. MDRC was, by most accounts, a more rigorous and more comprehensive evaluation than the matched comparison studies. CEO’s evaluation data makes us confident that the model works, but really hard to know exactly how much it will work from study to study. The dimensions of local context, time and geography add difficulty in predicting results with precision. When designing a payment schedule, PFS negotiations may push you to put a stake in the ground even in the face of this uncertainty.

So what’s a service provider to do?

Even with all of this uncertainty -- and this only addresses one risk area within PFS negotiations -- the benefits of PFS remain persuasive. Recognizing this, we offer the following recommendations to providers.

  • Negotiate multiple paths to victory. There is no perfect metric to evaluate a program; in the reentry and employment field where CEO works there are dozens of performance indicators to choose from, each with benefits and drawbacks. Data can behave unpredictably; some metrics have higher variation than others, making it more difficult to power an evaluation and anticipate results. Rather than choosing only one or two metrics to drive repayment, select a range of success indicators that complement one another and that you feel more fully encompass your organization’s value. Regardless of payment metrics, ensure that an evaluation will provide data broken out by subgroups to allow you to maximize your understanding of where and for whom your program had impact.
  • Hire an evaluation consultant. This post is only scratching the surface of the statistical complexities that come into play when evaluating a program. CEO has a part time evaluation consultant who was part of our MDRC evaluation team. He knows CEO well and has spent his career designing and implementing evaluations. Can’t afford one? Get one on your board or convene an evaluation advisory committee. Our colleagues at Roca have created an evaluation advisory board to provide guidance. While the project evaluator may be able to answer questions that arise, coaching a provider through the design process is unlikely to be in their scope of work and they certainly are not tasked with warning you of the risks of an evaluation. Having in-house capacity can make you a stronger partner at the negotiating table.
  • Avoid payment thresholds. Some PFS projects have payment thresholds where governments say that they won’t repay investors for anything below a certain impact level. Instead, try to negotiate a structure where payments start with any impact above zero and increase with greater levels of impact
  • Insist on pilot periods. Stakeholders from many sides of the table are seeing the value of a test period before the pressures of an evaluation kick in. These test periods serve as a time to practice the evaluation protocol and test assumptions like referral numbers and program attrition. While pilot periods won’t tell you whether you’re on track to achieve impact, they can flag other potential issues which would have downstream effects or affect the power of an evaluation.
  • Evaluate established programs or sites: If you’re using PFS to evaluate a new office or new program, the clearest advice we’ve heard is “don’t evaluate til you’re proud”; don’t let the evaluation kick in until you’re confident that you’re delivering the intervention as intended.[3] While CEO has a defined model and experience with replication, we know that our new offices usually take 12-24 months to hit their stride. Along with the technical challenges of opening a new office, things like relationships with parole and the local employer community just take time to build and have a significant impact on performance. 
  • Use conservative language in PR and communications. Does your evaluation design require someone else - a parole office, doctor, social worker - to act in a certain way to get the right people to your door? If so, the evaluation will take into account more than just what happens behind your doors. Describe the intervention as a partnership rather than just your program.
  • Be willing to walk away. Talk with your decision makers regularly and do your best to make sure everyone understands the benefits and tradeoffs of the evaluation options in front of you. As much as possible, identify non-negotiables and communicate them to your project partners, but avoid ultimatums - PFS negotiations are long and new information emerges throughout which may change your thinking. Above all, remember that if you do walk away, you will continue to deliver an excellent program and can find other resources to scale and evaluate it. 

CEO remains committed to the role of evaluation in our work. Each of our evaluations has expanded our understanding of our participants, our model, and the world of reentry and workforce development. In the process we have become savvier about evaluation choices and humbled by the complexity of the work; we hope these lessons learned can help others in the field to engage with their evidence in a more sophisticated way and use it to advance their mission.

Christine Kidd is the Director of Program Innovation at the Center for Employment Opportunities (CEO). CEO helps men and women coming home from incarceration to find and keep jobs. CEO has been involved in two PFS projects and multiple performance-based contracting efforts.

This blog was supported through funding awarded in 2014 by the Corporation for National and Community Service Social Innovation Fund.

The Corporation for National and Community Service is the federal agency for volunteering, service, and civic engagement. The agency engages millions of Americans in citizen service through its AmeriCorps, Senior Corps, and Volunteer Generation Fund programs, and leads the nation's volunteering and service efforts. For more information, visit

The Social Innovation Fund (SIF) was a program of the Corporation for National and Community Service that received funding from 2010 to 2016. Using public and private resources to find and grow community-based nonprofits with evidence of results, SIF intermediaries received funding to award subgrants that focus on overcoming challenges in economic opportunity, healthy futures, and youth development. Although CNCS made its last SIF intermediary awards in fiscal year 2016, SIF intermediaries will continue to administer their subgrant programs until their federal funding is exhausted.


[1], page 113


[2], page 5. I use “Uniform Observation Window” to mean that every participant in a study was observed for the same amount of time. In re-entry, this might mean observing everyone for 2 years post-release. Alternatively, some studies observe everyone who interacted with a system during a certain time period; for example, they might get administrative data from 2010-2012 but some people in that data might only be in the community for the last month of that observation period while others would have all three years of observation time.


[3] All credit to Professor Larry Bailis from Brandeis University’s Heller School for Social Policy and Management, who said this to me at a meeting.