Program Evaluation Glossary
For more on understanding program evaluation research, see Early Childhood Program Evaluations: A Decision-Maker’s Guide.
- Effect Sizes
- "Intent to Treat" Impacts
- Quasi Experimental Design
- Randomized, Controlled Trial/ Experimental Study
- Regression Discontinuity Design (RDD)
- Statistical Significance
- "Treatment on the Treated" Impacts
- Weighted Data Analysis
Effect Sizes
Increasingly, program evaluators express impacts as "effect sizes," which are a statistical means for comparing outcomes that may otherwise be difficult to compare. For example, the scales of the SAT test and the IQ test are completely different, so it’s difficult to compare one program that raises SAT test scores by 20 points, and another that raises IQ scores by 5 points. "Effect sizes" provide the solution. By subtracting the outcomes of the control group from the outcomes of the treatment group, we get an effect (e.g., raising SAT scores by 20 points). By dividing that effect by the study’s "standard deviation" (which indicates how widely dispersed the results are from the mean), we get an effect size -- a fraction that indicates how large the effects are in comparison to the scale of results.
The SAT test, for example, is scaled with a standard deviation of 100, so a program that boosted SAT scores by 10 points would have an effect size of 0.1 or one-tenth of a standard deviation—which is considered very small. IQ tests are typically scaled with a standard deviation of 15, so a program that boosted IQ scores by 10 points would have an effect size of 0.66 or two-thirds of a standard deviation—which is much larger. Generally speaking, the larger the effect size, the better. Conventional guidelines consider effect sizes of at least 0.8 as "large"; 0.3 to 0.8 as "moderate"; and less than 0.3 as "small." The best studies translate effect sizes into practical information. For example, effects on a standardized measure of achievement might be translated into how much of a fraction of a school year the program group exceeds the control group. Effect sizes on grade retention can be translated into percentages of children held back a grade.
"Intent to Treat" Impacts
When evaluating interventions in which substantial numbers of children or families fail to take up any of the offered services, there is an important technical detail that must be addressed. Should program effects be considered for only those who receive the services or for all families who are offered the program, regardless of whether they participate? Effects assessed across all children or families offered program services, regardless of whether they actually used them, are called "intent to treat" (or ITT) impacts. They answer the vital policy question about the effects of the program on all families that are offered services. Suppose, however, that services are highly effective for those who participate, but only a small fraction of the targeted children or families actually use them. The intent to treat impact estimates will show that the overall impact on targeted families is small and will point to implementation or program take-up as a key problem in program design.
For example, programs designed to promote residential mobility among public housing residents, in which between one quarter and one-half of the families that are offered financial assistance and mobility counseling fail to take advantage of the offer. Thus, an evaluation of impact of the program across all families offered the chance to move would show much smaller effects than an evaluation focused only only those families that actually moved in conjunction with the program (see "Treatment on the Treated Impacts"). However, ITT effects take into consideration the reality that not all families will use a service offered to them, and may point out problems with the program’s accessibility to target families.
Quasi_Experimental Design
A quasi-experimental design is one in which the comparison group is formed by some method other than random assignment (see Randomized Control Trial/ Experimental Study). Although random assignment of children or parents to program and comparison groups is the "gold standard" for program evaluation, sometimes this is not possible. In some circumstances, a randomized controlled trial is neither practical nor ethical. For example, if access to services is a legal entitlement, denying program services to some children would be a violation of the law. In such cases, alternative ways of constructing "no treatment" groups are needed and it is essential that the children and families in the comparison group be as similar to the program group as possible.
The strengths of quasi-experimental designs are highly variable, with an approach called Regression Discontinuity Design (RDD) considered by experts to be the strongest alternative to random assignment. Evaluations that select comparison groups in other ways should probably be assumed guilty of bias until proven otherwise. Countless studies have shown how difficult it is to create comparison groups that are similar, absent an RCT design or close approximation. Especially important indicators of treatment/comparison group comparability are assessments of test scores, behaviors and other outcomes of interest for both groups of children taken just prior to the point of program entry. Demonstrating that the program group and comparison group children or parents were initially similar on characteristics that the program was intending to affect is vital for trusting that differences emerging after the beginning of the program can be attributed to the program itself. Evaluations that do not compare and discuss pre-service characteristics of program and comparison-group children should be viewed with skepticism.
Randomized, Controlled Trial/Experimental Study
The ideal method for assessing program effects is an experimental study referred to as a randomized, controlled trial (RCT). In an RCT, children who are eligible to participate in a program are entered into a "lottery" where they either win the chance to receive services or are assigned to a comparison (control) group. Parents or program administrators have no say in who is selected in this lottery. When done correctly, this process creates two groups of children who would be similar if not for the intervention. Any post-program differences in achievement, behavior, or other outcomes of interest between the two groups can thus be attributed to the program with a high degree of confidence.
It is possible for an RCT to be flawed, and result in a comparison group that is not comparable to program participants. Examples of how this may occur include problems implementing the lottery process, too few children in the program and comparison groups, and too many children or families dropping out of the study after random assignment has occurred. For this reason, even an RCT study should demonstrate that the comparison group used was similar to the treatment group before the study began.
Regression Discontinuity Design (RDD)
Regression Discontinuity Designs are considered by experts to be the strongest alternative to random assignment. In this case, assignment to either the control or the intervention group is defined by a cut-off point along some measurable continuum (for example, age). For example, some Pre-K evaluations have taken advantage of strict birthday cut-off dates for program eligibility. Specifically, in some states, children who are 4 years old as of September 1 are eligible for enrollment in Pre-K, while those who turn 4 after September 1 must wait a year to attend. In this case, the key comparison in an RDD is between children with birthdays that just make or miss the cutoff. These children presumably differ only in the fact that the older children attend Pre-K in the given year while the younger ones do not. Comparing kindergarten entry achievement scores for children who have completed a year in Pre-K with the scores measured at the same time for children who just missed the birthday cutoff can be a strong assessment of program impacts.
Statistical Significance
Impacts are usually accompanied by a statement regarding their statistical significance. This indicates how much confidence we have that the measured impact is real and not just something that appeared by chance. Impacts that are statistically significant at the 5% level – a common standard – mean that if we could somehow conduct 100 evaluation trials, we would expect to confirm those impacts in 95 of them. That is a good bet that the impacts are real.
As the number of children or families in the treatment and control groups increases, smaller effect sizes become more statistically significant, simply because a larger sample means a lower probability of a chance finding. Typically, evaluations involving less than 100 children require very large effect sizes to be judged statistically significant, while evaluations based on several thousand children are much more likely to calculate small effects as statistically significant. All other things being equal, bigger studies are better. Even in large studies, however, small effect sizes imply that the program is not likely to change outcomes very much, so policymakers should consider carefully the cost required to achieve small benefits.
"Treatment on the Treated" Impacts
In evaluations of interventions in which substantial numbers of children or families fail to take up any of the offered services, there is an important technical detail that must be addressed. Should program effects be considered for only those who receive the services or for all families who are offered the program, regardless of whether they participate? (See "Intent to Treat" Impacts.) Under certain circumstances, it is possible to isolate program impacts on the subset of families that actually use the services and compare them to families that did not use similar services. These are sometimes called "treatment on the treated" (or TOT) impacts, and amount to scaling up intent-to-treat estimates in proportion to program take-up. Treatment-on-the-treated estimates address important policy questions about program impacts on the children or families who actually use the services. If program take up is not a concern and you want to concentrate on how a program affects children or families who participate in it, then treatment-on-the-treated estimates are most relevant. When comparing across studies, it is important to compare like with like—ITT with ITT impacts or TOT with TOT impacts.
Weighted Data Analysis
It can be difficult to compare results of a study over time because some of the people who participated at first may no longer be available for follow-up. Weighted data analysis is a statistical method of accounting for the changes caused by this type of loss, known as attrition.
Attrition can be problematic if the people who attrit have different characteristics than the people who remain. For example, if the original participants were a nationally representative sample of all income ranges, but half of the low-income families had left the study at the time of follow-up, results would no longer be nationally representative. Attrition can also be problematic if the people who attrit from the treatment group have different characteristics than people who attrit from the control group. Such a situation could potentially bias the results since conclusions, particularly in randomized trials, assume that the treatment group and control group are equivalent to each other on all characteristics at the time of assignment as well as at the time of follow-ups.
Weighted data analysis uses a statistical formula to determine who in the remaining sample has characteristics (e.g., age, income, gender, race) similar to or different from those who have attrited. Then, the people in the remaining sample are counted more or fewer times in analyses (i.e., weighted) in order to generate a hypothetical sample that has characteristics similar to the original sample.