OfSTED Reliability

Reading time: 3


Ross Morrison McGill founded @TeacherToolkit in 2010, and today, he is one of the 'most followed educators'on social media in the world. In 2015, he was nominated as one of the '500 Most Influential People in Britain' by The Sunday Times as a result of...
Read more about @TeacherToolkit

How reliable are OfSTED inspections and the overall assessments of the inspection teams?

On 7th March, the Department for Education published a study – shared in an article by Amanda Spielman, the new Chief Inspector of OfSTED – tests the reliability of OfSTED’s short school inspections.

In this blog, I provide you with a summary of the research. It is good to see OfSTED publishing this research. After all, OfSTED went ahead with removing ‘lesson gradings’ from inspections after a pilot study in June 2014, but failed to made the research public. “This study was … designed to answer a single question: were the decisions about whether short inspections should or should not convert to full inspections being made consistently by different inspectors?”

Ground Breaking!

OfSTED claim to be “breaking new ground” as they have “not looked at the whole inspection process from start to finish”. Secondly, this study was “surprisingly difficult”. The complexities included:

*getting the balance right between the live inspection and the study goal

*identifying ways to minimise bias and cross-contamination of inspector evidence gathering and thinking

*ensuring that inspectors and participating schools were fully prepared for simultaneous parallel inspections

*achieving a large enough sample of participating schools.

Most importantly for everyone who is inspected, the study provides a welcome positive view of inspector consistency in the particular context studied. Of the 24 short inspections in our sample, inspectors agreed on the outcome in 22 cases. This indicates a high rate of agreement (92%) between these inspectors about the conversion decision.

Well, that’s what OfSTED say.

My concern is, what about the 8% of schools who are left with an incorrect judgement? How long do they have to continue working with an incorrect assessment? The study is not without its flaws as OfSTED (thankfully) admit: “the findings cannot be extrapolated across other types of inspections or all types of institution. For instance, the study looked only at short inspections of primary schools in a certain size range and it had a relatively small sample.”


Figure 1: page 19 / Evidence gathering methodology

Figure 2: page 20 / Examples of the conversion process

Figure 3: page 25 / Outcomes from inspections (Autumn 2015)

Figure 4: page 27 / Outcomes from inspections (Summer 2016)

There is an interesting section on ‘triangulation of evidence‘ on page 36.

“… the start of the short inspection stemmed from an assumption that assessing their views on the school’s strength and weaknesses was both desirable and important, at least for collaboration purposes. However, while the initial discussion with leaders appears to be an important forum for establishing robust inspection trails, the potential for some leaders to offer an unreliable narrative of their school’s standards, means that this mechanism alone does not lead to reliability.


So, perhaps school leaders are to blame for the unreliability too of the process and the overall outcomes? It is critical that OfSTED “judgements are as reliable as possible”, and the get-out of jail clause from the inspectorate here, is quoting a medical analogy below:

… many kinds of clinical testing give both false positive results (where someone doesn’t actually have the condition, but appears to) and false negatives (where someone has the condition but is not picked up by the test). Perfectly reliable tests are the exception, not the rule … Turning back to education and social care, we know that inspection is a process based on human judgement to interpret and complement available data. We know a great deal about human judgement, and can work to minimise the impact of the limitations resulting from the various kinds of bias in human judgement, but we are unlikely ever to reach a position where perfect consistency can be guaranteed.”

If ‘reliability’ cannot be guaranteed, then we still need further modification of system. What if incorrect assessments are made of schools with less than desirable outcomes? What about the reliability of inspections of schools with a higher proportion of disadvantaged students? Is there too much at stake for these schools who are vulnerable? And who would want to work in a challenging school if inspection decisions are unreliable? I am in favour of progress 8 as a tool to gauge progress of all students, but it is not without its flaws. A process that can still damage a school’s reputation and teacher’s careers.

As Tom Sherrington says, “let’s not continue to judge and rank schools using crude single figure numbers without knowing about their profile and their context.”

Spielman ends with this:

” … we need to study the inspection judgements themselves, as well as the decisions around the conversion of short inspections … we will continue to work with outside academics and other experts, as well as those at the receiving end of inspection, to help shape the approach we take. It is really valuable to have the right level of challenge in this kind of work, as well as specialist expertise.

Well Amanda, I’m just waiting for you to call.


You can download the report here.

6 thoughts on “OfSTED Reliability

  1. Since qualifying as an inspector l have always maintained OFSTED is basically corrupt as there is no transparency or appeal procedure. Schools, teachers and pupils are immensely complex and there will never be a system that will be able to define whether a school is successful. Take something much simpler. Which is the best car? No one can give a satisfactory answer acceptable to everyone. Likewise with education. Grammar school anyone?

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.