Exam Reliability – Cronbach’s Alpha

What is Cronbach’s Alpha?

Cronbach’s Alpha provides a measure of the internal consistency of a survey, test or exam.

Within an exam context it is used to evaluate how reliably the exam is measuring candidates’ ability, assuming that the content of the exam paper is consistently testing knowledge about a single subject matter/theme.

How is Cronbach’s Alpha expressed?

Cronbach’s Alpha is expressed as a number up to 1 – generally this is between 0 and 1, but in extreme cases can be negative.

The closer the number is to 1, the more reliably the exam is measuring the candidates’ ability in that subject field. As a general guide a Cronbach’s Alpha:

• > 0.8 indicates strong reliability
• 0.6 – 0.8 indicates good reliability
• < 0.6 indicates candidates are not consistently scoring according to their ability.

How is it calculated? (The nerdy bit)

At its heart, Cronbach’s Alpha is (1 – Si/St) where Si is the sum of item mark variances and St is the total mark variance.

A (very simple) example to explain the concept…

In this example a test has 10 items (questions) which are unrelated, each worth up to 10 marks. The mean mark for each item is 5, with candidates scoring between 3 and 7 on each item and the mean score for the whole exam will be 50 (5 * 10).

If the items are all totally unrelated to each other then each candidate is likely to do as well as each other as there is no single theme (or factor) that allows one group of candidates thrive over the rest. In this case the exam scores will all be similar, perhaps only ranging from a minimum of 43 to a maximum of 57.

In this case each item mark variance will be 4 (squared difference from the mean – i.e. 2²), so the sum of the item mark variances (Si) would be 40. The total mark variance (St) will be 7² so 49 as scores vary between 43 and 57 with a mean of 50.

Therefore, Cronbach’s Alpha will be 1-(40/49) = 0.18 which is very low and reflects the fact that the items were independent. This means the exam should be considered unreliable because what common ‘theme’ are these items testing?

Now consider a similar exam except this time all the items are perfectly related, i.e. consistently testing knowledge around a central theme. Instead of candidates’ marks being all over the place, it might be reasonable to see good candidates score an average of 7 for each item and poor candidates score an average of 3.

While the sum of the individual mark variances (Si) would remain at 40, the total mark variance would be very different. On average good candidates scored 70 on the test and poor candidates scored 30, so total mark variance (St) would be 20² so 400.

Cronbach’s Alpha would therefore be 1 – (40/400) = 0.9 which is high and indicates that the exam was reliable in determining good and poor candidates with reference to the central theme.

As such if the relationship between the way items are answered is consistent according to the group (in this case higher Vs lower performing candidates), then the Cronbach’s Alpha will be higher.
In practice Cronbach’s Alpha can be tricky to calculate by hand (particularly on longer exam papers) and your Exam Software (such as Maxexam) should do that for you.

How is Cronbach’s Alpha used?

The throwaway answer to this would be ‘often incorrectly’ to determine exam reliability as a whole! Cronbach’s Alpha is only one measure to determine whether an exam is doing what it should – which in the case of summative medical or dental exams is determining whether candidates meet the standards required to be safe to practice.

We believe that Cronbach’s Alpha is often misunderstood, and this can lead to exams potentially being dismissed as being unreliable, when actually the way the exam itself is put together means you should expect Cronbach’s Alpha to be lower.

What influences the Alpha?

Cronbach’s Alpha is influenced by the items in the exam, the structure of the exam and the number of candidates.

Cronbach’s Alpha was originally developed for use in surveys, specifically for questionnaires where the respondents would typically answer agree/disagree type questions. The alpha was then used to qualify how good, or consistent, the questionnaire was. Much of the information that can be found about Cronbach’s Alpha even now is referring to surveys, so can be misleading if you are using it in exams.

When used for exams, Cronbach’s Alpha is a good measure of consistency when items are around a single subject matter/theme where better candidates would be expected to consistently get more and harder items right. If an exam covers a broad range of topics or unrelated topics it is less likely for individual candidates to be experts in all areas so you would expect a lower Cronbach’s Alpha.

Additionally, if an exam contains essential knowledge items where most candidates would be likely to get items right then your Cronbach’s Alpha would be expected to be lower. This is demonstrated in the illustration below.

In addition to the items themselves, the number of items and the number of candidates will influence Cronbach’s Alpha, with a greater number of items and candidates usually being associated with a higher Alpha, as individual rogue items or candidates will have less effect on it.

If candidate numbers are low (e.g. for a resit) then Cronbach’s Alpha should not be used as it will be unreliable.

Illustrations of why Cronbach’s Alpha should not be used on its’ own to determine exam reliability.

An exam containing some essential knowledge items.

If an exam contains essential knowledge items you would expect both good and poor candidates to get those items right. This will mean that both the individual mark variance and total mark variance will be lower than in an exam focused on ranking items, bringing down Cronbach’s Alpha. This does not however mean that the exam is poor or ‘unreliable’ – it is simply lower because of the fact the exam is testing essential knowledge.

An exam testing a wide knowledge base.

If a medical exam was testing two quite different areas of medicine such as paediatrics and geriatrics the graph for the percentage of candidates who got each item right might look something like this:

However, if you were to then look at candidates who specialised in paediatric medicine vs those who specialised in geriatric medicine (assuming the exam was split 50/50) the graph might look like this:

In this example Cronbach’s Alpha would be low, as the likelihood of getting items right is driven more by a candidate’s speciality than their ability.

Of course, in real life exams you wouldn’t expect a single exam to examine two such contrasting aspects of medicine, however it does illustrate why you would expect a lower Cronbach’s Alpha for a broader exam such as a final medical or dental exams than you would for a narrower topic exam.

Cronbach’s Alpha and Individual items

Cronbach’s Alpha can be used to determine whether an individual item is adding to or subtracting from the exam’s reliability – i.e. is it a good or bad item to confirm a candidate’s ability. This will be further covered in our next blog.

Conclusion – Is Cronbach’s Alpha a good measure of exam reliability?

In general yes, but it can be dependent on the exam so care is needed when using it/applying it.
If the exam is based around a single theme where candidates would be expected to answer items consistently according to their ability, then it can give you a good steer (ideally alongside other measures), about whether the exam is reliably testing ability. This also relies on the exam having enough candidates for them to be likely to act reliably as a group, and enough items to properly test knowledge.

However, if the exam is covering broad subject areas where candidates could be expected to be better in some areas than others, then a lower Cronbach’s Alpha might be expected and this does not mean the exam is unreliable in and of itself. Additionally, if the exam is testing essential knowledge where all candidates would be expected to do well, then the difference between higher and lower performing candidates will be less and you would expect a lower Alpha.

Practical Application – Cronbach’s Alpha and determining grading boundaries.

Cronbach’s Alpha is often used to calculate the standard error of measure. This is explained further within our Exam Reliability blog and is used to indicate the level to which an exam may underestimate or overestimate a candidate’s knowledge.

The standard error of measure is used together with the exam cut-off (hopefully determined by using standard setting methods such as Angoff), to determine the pass mark of a just passing candidate. Where organisations undertake grading and boundary reviews, these are usually for candidates within a standard error of measure of the cut-off mark. Again this is discussed more in our Exam Reliability blog.

We hope this blog has helped to clarify what Cronbach’s Alpha measures and how it can effectively be used. If you do have any questions please do not hesitate to contact us on +44 (0)117 428 0550 or you could fill in our contact form here.