Global

Rethinking Evaluation: Challenges and learnings of setting the baseline for Schools2030, a multi-country, contextualised programme

Schools2030's Global Evaluation Partner, Khulisa, share their learnings from conducting a complex, multi-country, multi-year baseline study

Margaret Roper, MERL Director, Khulisa Management Services

04 May 2026

Schools2030 is not a typical single educational system intervention; it is a 10-year, 10-country longitudinal action research programme working with over 1,000 schools at the pre-primary, primary and secondary levels to achieve Sustainable Development Goal 4: “Quality Education for All”. Driven by the three-step model of Assess, Innovate, and Showcase, the programme empowers over 50,000 educators to lead school-level design thinking and evidence generation.

As the programme has moved into its second phase, focusing on systemic educational support, the global evaluation partner, Khulisa Management Services, and the Aga Khan Foundation, are reflecting on what it takes to measure impact across some of the world’s most diverse educational landscapes.

This complex study utilises a baseline assessment to establish a foundation for measuring student outcomes, teacher innovation, and the effectiveness of local educational solutions. Because the programme operates in diverse environments, the evaluators must employ a flexible, agile methodology that balances standardised metrics with unique country contexts. Ultimately, the project seeks to influence global education policy by fostering a bottom-up approach where teachers design their own classroom improvements. This long-term research effort aims to provide rigorous data on how human-centred design can transform learning environments worldwide.

The Contextual and Complexity Challenge: Why Traditional Evaluation Models Fall Short

The most significant challenge identified is that Schools2030 cannot be evaluated in a “traditional” way. In standard social science, evaluators typically try to eliminate as many differences as possible between intervention and control samples to isolate single variables that significantly contribute to impact. However, Schools2030 embraces contextualisation as a strength.

Contextualisation within Schools2030 operates at numerous levels. For example, each educational system and ecosystem is unique, and therefore implementation varies across countries; language differences mean literacy assessment constructs will differ; and cultural norms influence the engagement and application of human-centred design in practice. From an evaluation point of view, this poses challenges for identifying relationships between data points and for identifying commonalities or trends to set baseline data for global and country-based action. Consequently, instead of measuring standardised global indicators, the evaluation used context-specific constructs aggregated to the country level. As the baseline was conducted before the global assessment tools were finalised, this was the best approach within the timeframe. These global assessment tools will significantly support ongoing monitoring and the development of the evidence base.

The key challenges faced during the design and implementation of the baseline included:

A “Bottom-Up” Innovation Model: Unlike top-down programmes with standardised interventions, Schools2030 empowers teachers to design their own solutions using HCD. This means the “intervention” looks different in every school and classroom, requiring an evaluation flexible enough to account for “niche ingredients” while still measuring global indicators such as literacy and numeracy.
Geographic and Structural Diversity: The evaluation tracks over 1,000 schools across countries ranging from Afghanistan and Kenya to Portugal and Kyrgyzstan. Each has unique educational systems, varying school-year calendars, diverse curricula, and different ministerial structures and priorities. Setting a fixed calendar date for baseline data collection was not possible; thus, data collection was staggered over a 12-month period.
Establishment of Counterfactuals: Creating a robust “difference-in-difference” design required identifying comparison schools (those not receiving the intervention), which had to be sampled differently across countries to ensure relevance. In the long term, a success indicator for the programme is that it influences all teachers, creating challenges for having a clean counterfactual group for the impact study.
The Scale of Data Collection: The baseline phase alone required a massive effort, collecting over 18,000 learner assessments, 1,584 teacher surveys, 520 classroom observations, and over 900 school leader surveys across three distinct age cohorts (5-, 10-, and 15-year-olds). This effort required an adaptive management approach and a significant amount of time and effort to clean and analyse at the country level, and then to use evaluative expertise to compile a global report that reflected the country contexts and the global programme.
Research vs Evaluative Opportunity: When it comes to large, complex, and contextualised programmes, don’t collect only a baseline for research purposes; collect to understand how the context influences outcomes and impact evaluation. Researchers will argue for only getting the hard baseline data: but quality impact evaluations need programmatic baseline data. Without this data, baseline evidence against which evaluators’ ‘value’ the results will be lost.

By moving from a purely theoretical approach to a practical, “not fearing failure” mindset, the Schools2030 baseline and impact evaluation can create a blueprint for understanding “what works” in global education across diverse contexts

Key Learnings: What the Baseline Phase Taught Us

The evaluation has already yielded critical insights into how global education programmes can foster resilient systems:

1. Teachers are the “Main Denominator”: Learning shows that teachers are the primary drivers of the programme’s success. The evaluation did not focus on viewing learning as a didactic “transfer of knowledge”, but rather a model of exploring empowerment and equipping learners through teacher innovation to address the challenges they face in teaching and assessment in the classroom. Synthesising results for each country and age cohort illustrates variations in practice – there is no single trend or ‘right way of teaching’; rather, the results show, and celebrate, the diversity across education systems.

2. Data Feedback Loops Improve Practice. A major learning from 2024 was the value of “immediate feedback loops”. In countries like Kyrgyzstan and Kenya, teachers now receive strengths-based feedback via SMS or WhatsApp immediately after classroom observations, allowing them to iterate on their teaching methods in real-time rather than waiting for an end-of-year report. These dynamic learning loops (or lack thereof) need to be documented going forward, as they are likely to influence and contribute to countries’ outcomes.

3. Contextual Relevance is More Sustainable. The baseline highlights that by incubating education innovations at the classroom level, practices become more inclusive and sustainable. For example, in Afghanistan, using assessment data to identify specific gaps—such as 90% of students needing help with simple letter writing—allowed teachers to brainstorm targeted, inclusive solutions immediately. Once again, these dynamic adaptations need to be tracked to illustrate similarities and differences in practice, validate how innovative practice influences learning outcomes, and document the diverse change pathways possible in different contexts.

4. The Importance of “Space for Difference”. Evaluators learned that the goal should not be to eliminate differences between countries but to find a combination of qualitative and quantitative tools that allow for both global comparison and local nuance. This includes using subject-based assessments alongside evaluations of socio-emotional skills, such as problem-solving and self-awareness. Teachers sharing these differences globally has the potential to both scale innovation and influence teacher engagement in education-sector dialogues.

Critical Considerations for Future Evaluations

Because quantitative data may not always show significant differences in transformational learning processes across diverse contexts, the evaluation relies heavily on a strong qualitative approach to determine what works, for whom, why and in what context.

In addition, the impact evaluation will need to account for “recursive causality,” in which repeated implementation steps and cycles of influence and change across the different relationships within each school and country system achieve results.

Looking Ahead: From Schools to Systems

By moving from a purely theoretical approach to a practical, “not fearing failure” mindset, the Schools2030 baseline and impact evaluation can create a blueprint for understanding “what works” in global education across diverse contexts.

As the programme has entered Phase 2 (2024–2030), the focus is shifting from co-designing tools and initiating innovative teaching practices to strengthening pathways for education system transformation. The learning is clear: embrace complexity rather than avoid it!

Explore the baseline findings for each country on their dedicated country page, and the global synthesised report here.