Are Ofsted Gradings Reliable?

Reading time: 3


Ross Morrison McGill founded @TeacherToolkit in 2010, and today, he is one of the 'most followed educators'on social media in the world. In 2015, he was nominated as one of the '500 Most Influential People in Britain' by The Sunday Times as a result of...
Read more about @TeacherToolkit

When Ofsted conduct research on its own inspection process, are the conclusions reliable?

If we assume lesson gradings are unreliable, why are more complex school inspections robust? As Ofsted reinspects and grades schools, any talks about a ‘drop-in standards’ or a ‘win-win for everybody’ loses the point…

The chief inspector Amanda Spielman, appointed on an initial five-year term which was due to finish next month, has had her role extended until 31 December 2023. This week, she has taken a little bit of flak for stating the obvious: halving the number of ‘outstanding’ schools to one in 10 is a “more realistic starting point for the system.”

My current perspective

Considering some schools across our education system have not been inspected for more than 10 years, and we have seen several education inspection frameworks during this time, it will be no surprise that as the goalposts shift, standards found, at least by Ofsted, in some of our schools will have changed.

The key question for me is, can we trust these overall judgements, regardless of inspection framework?

Allow me to set out my stall from the start:

  1. I believe we need to keep Ofsted, despite experiencing 9 inspections since 1997
  2. I believe all schools should be inspected for safeguarding reasons on an annual basis
  3. I do not believe the current inspection framework is sufficiently equipped to cater for different contexts
  4. I do not believe anybody can visit a school and reliably evaluate its effectiveness in a one-day inspection
  5. I do not believe the overall judgements are reliable, a valid measure or are helpful for the general public, nor lead to raising standards of education.

Quoted in Stephen Tierney’s new book, “Ofsted’s system is designed to downgrade schools serving the greatest number of vulnerable children and young people; it chooses what to measure, how to measure, and who will measure. The inspectorate appears to be making conclusions about schools’ effectiveness while being blind to huge differences in pupil intake.”

The reliability of short inspections

In a relatively new piece of research published by Terry Pearson: A review of Ofsted’s test of the reliability of short (one-day) inspections (2018), unpicking the reliability and validity of inspection outcomes. Pearson references a study, conducted by Ofsted themselves (double inspections), noting how frequently two inspectors agreed “whether the school remained good or whether further evidence was needed to enable a judgement to be achieved. The test found that in 22 instances out of 24 inspectors agreed on their overarching decision.”

This means that 10 per cent of inspection decisions are unreliable – and is something we should not be worried about!

Put simply, Ofsted used the sampling of 24 schools to justify that their short inspection processes were reliable, therefore being applied to ~25,000 schools, regardless of context (e.g. primary, secondary etc)

Unintended consequences

When conducting research, tests and interpretation of your own work, this requires a great deal of “fidelity” by the author. Pearson wrote (pg 14), “the validity of the results is dependent on the extent to which the inspections were carried out in the way intended…”

Pearson unpicks the test conditions where breaches are made and provides comment from the education select committee (House of Commons, 1999): “The committee noted that in thirty-three per cent of cases, the pairs of inspectors awarded different grades after observing the same lesson and in three per cent of these cases the difference was two grades.”

Judgements were considered to be not very reliable – even 20+ years ago.

In this detailed 42-page analysis of Ofsted’s research, Pearson provides a compelling case for questioning the reliability of short inspections. I’ve only skimmed the surface of this analysis and provide the recommendations only.


  1. A comprehensive base of existing knowledge about the complex process of expert judgement is needed
  2. A contextualised indicator for the trustworthiness of Ofsted inspection judgements, and that it factors different environments
  3. Inspector expertise must match subjects and schools
  4. Sample schools will know they are taking part in Ofsted research. Therefore, emotions and behaviour will change under observation (eg. Hawthorne Effect) and will skew Ofsted’ findings
  5. Future Ofsted tests need to make sure that infringements of the test protocol are kept to a minimum
  6. The procedures used to search the data accumulated during each test need to be made explicit, and,
  7. Ensure sufficient and relevant data or evidence is provided to support inferences and claims derived…

It’s worth noting that at present, Ofsted does not have the capacity to inspect more schools, plus their annual funding has diminished over the years. Equally, the inspector appointment process shuns thousands of school leaders simply because they choose to work in a different context.

If the current 70 per cent of school leaders who are inspectors – that (apparently) work in our better schools – want a better inspection system, they should ask themselves if they are happy to be working in the business of reliability, or unintentional bias.

We need to reach a place where Ofsted methods suit each stakeholders sustain education standards and are a reliable benchmark of what is happening in our schools. Personally, I don’t believe a one-word judgement cuts it…

3 thoughts on “Are Ofsted Gradings Reliable?

  1. Or perhaps they’re as reliable as exam grades, which – according to the evidence given to the Education Select Committee on 2 September 2020 by Ofqual’s then Acting Chief Regulator Dame Glenys Stacey – are “reliable to one grade either way”.

    What use is that?

    Perhaps all exam certificates should declare, in BIG LETTERS at the bottom

    OFQUAL WARNING: The grades on this certificate are reliable only to one grade either way.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.