The Effect of Observation Length on Teaching Quality

Reading time: 3


Ross Morrison McGill founded @TeacherToolkit in 2010, and today, he is one of the 'most followed educators'on social media in the world. In 2015, he was nominated as one of the '500 Most Influential People in Britain' by The Sunday Times as a result of...
Read more about @TeacherToolkit

How long should a lesson observation be to determine teaching quality?

In a paper published by Mashburn et al (2013), researchers study the effect of different observation procedures on score reliability and validity, conducting an experimental study that manipulated the length of observation…

Screenshot 2021 11 17 At 17.05.14In this 23-page paper, The Effect of Observation Length and Presentation Order on the Reliability and Validity of an Observational Measure of Teaching Quality, the researchers remind us that “teaching observations have a long history of use in education research for purposes of identifying characteristics of classroom settings that are associated with student learning.”

Ten minutes or 30?

I’ve conducted all the usual frameworks. Five-minute drop-ins, 10 to 20-minute learning walks and formal observations for appraisal, anything between 30 to 60 minutes in length.

What difference did the timeframe have on teaching and learning? I have no idea – even after tens of thousands of observations. That’s quite scandalous really! I suppose many of the above timeframes have evolved as a result of accountability, Ofsted framework and teaching union guidance, all designed to protect and improve the individual teacher.

One reference cited in the paper is something English educators should know: Measures of Effective Teaching Project (2012) – one of the key sources that helped the English profession abandon lesson gradings.

Selecting an appropriate length of observation not only affects validity but also reliability.

A key question also posed in the paper, worth serious consideration if you are a school leader is: “How can I best allocate resources so as to minimize costs while producing reliable scores that permit valid inferences about teaching quality?”


This research asked eight trained raters who were randomly assigned “to rate 40- minute videotaped lessons either in one 40-minute occasion, two sequential 20-minute occasions, four sequential 10-minute occasions, or two nonsequential 20-minute occasions.”

The goal? To compare the reliability and predictive validity of a teaching observation measure.

Eight schools were involved with 47 teachers in the treatment condition and 43 teachers in the control condition. Those teachers videotaped 40-minute classroom lessons on multiple days throughout the academic year and submitted videotapes to researchers during predetermined windows of time.

Other sources…

I’ve been reading a little more on observational research in classrooms. Ofsted recently published research on six models that provide a clear focus for those observing lessons, but with one disclaimer, to measure teaching quality. We definitely know that judging lesson observation is harder than you think, no matter how experienced you are.

If education ministers, CEOs and policymakers wish to improve educational standards, let this piece of research be your benchmark. It highlights why low stakes peer evaluation can make a difference to student test scores.

I’ve also returned to Unseen Observations as another model, as well as observing lessons as a researcher, not as a classroom observer. Another paper describes methods for analysing classroom ‘talk’, comparing strengths and weaknesses. Unless we learn a little more about what type of observation leads to [any specific outcome], it is tricky to determine what works best and in what context.


The research reports that “although there are no appreciable differences in the financial costs of implementing the four different operational procedures under study, there were notable differences in some aspects of the reliability and validity of scores related to length and/or order.”

Lessons rated in the shortest and most frequent manner (4 x 10-minute segments) produced the lowest coefficient. This suggests that 10 minutes may not be adequate time to observe teaching quality that informs raters’ judgments about the quality of teaching.

This could be the end of short learning walks or ‘drop ins’ as we know it…

The results indicated that two 20-minute observations produced “the most desirable effect on score reliability and validity. This suggests that 20-minute occasions may be sufficient time for a rater to observe true characteristics of teaching quality assessed.”

I guess the key question to still unpick is, what characteristics are we assessing?

The researchers conclude: Given the growing importance of teaching observations, further research is needed to understand the tradeoffs between reliability and validity.

Download the paper.


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.