Introduction to Exam Reliability

Standard setting methods combine with exam reliability statistics to create a reliable pass mark.

For an exam to be reliable it needs to measure candidates’ performance accurately and consistently according to the objectives of the exam.

Some exams such as ranking exams, seek to understand how a candidate performs in comparison to their peers – i.e. who are the better and worse candidates within a cohort.

Others, such as final year medical exams, determine who is competent in the subject e.g., for final year medical students who has the appropriate skills, knowledge and character to be safely entrusted with their patient’s welfare.

Exam reliability statistics

Exam reliability statistics such as Cronbach’s Alpha, which Maxexam uses, help examiners to understand how reliably the exam is measuring different candidates’ ability. In other words, are the combination of questions asked accurately determining which candidates have the required knowledge? These statistics all look at how candidates perform against each question, and whether the better candidates overall answer better on each question.

Exam reliability is typically expressed as a number in the range of 0 to 1, with reliability being higher the closer to 1 the number is. Cronbach’s Alpha (Maxexam’s measure of exam reliability) will be further explained in our next blog. A reliability closer to 1 (perhaps at 0.6 or above) suggests that better candidates are consistently more likely to answer questions correctly.

The ‘acceptable’ exam reliability score should depend on the type of exam. An exam covering a broader range of topics will by its nature be likely to have a lower reliability score that one around a narrower topic.  This is because it is much harder for one candidate to be expert across a range of topics.  The content of the exam therefore needs to be considered when interpreting exam reliability statistics.

In addition to the questions in the paper itself, exam reliability will also be affected by other factors including the length of the exam and the number of candidates, with longer exams and a greater number of students improving the reliability of exams at measuring candidate’s performance.

Standard Error of Measure

The standard error of measure is calculated from the exam reliability and the assessment’s standard deviation and is expressed as a mark or a percentage. It measures the level to which the questions within an exam may underestimate or overestimate a candidate’s ability, it effectively creates a range for a measured mark within which lies a candidate’s “true” mark…

For example, if an exam has a standard error of measure of 3.4%, then the ‘true mark’ of a candidate whose measured mark was 60% will lie between 56.6% and 63.4%. It is worth mentioning here that it would be impossible to develop an exam that is 100% ‘perfect’ at measuring ability, as among other things it depends on the candidate’s performance on the day.

In a written exam there will usually be only one Standard Error of Measure, but it is worth noting that in clinical exams, such as OSCEs, there may be a standard error of measure for each scenario (station) as well as one for the exam as a whole.  This is because the grading of individual scenarios can be just as important for determining the final exam result.

Exam Standard Setting and Exam Reliability

Exam reliability is used alongside standard setting to determine the pass mark for an exam.

Exam standard setting methods such as Angoff, Ebel and Cohen are used to define the levels of achievement or proficiency in an exam required for a ‘just passing candidate’ and the cut-off mark corresponding to that level. The cut-off mark is therefore the lowest possible mark that a student must achieve in order to pass the exam if the exam was a perfect measure.

However we know that exams do not measure a candidate’s ability perfectly so the standard error of measure is used to adjust the cut-off mark to take this into consideration.  In final medical exams for example, where it is so important that the candidate has the right level of competency, the actual pass mark is often set at one standard error of measure above the standard setting cut-off mark. This is illustrated below.

Grading and Boundary Reviews

While some examining organisations will take a firm view that a candidate only passes if they have achieved the pass mark, others will undertake grading and boundary reviews if candidates are within a standard error of measure of the standard setting cut-off mark. Depending on the view taken by the organisation, they may review anyone within a standard error of measure above or below the  cut-off mark, potentially taking other information into consideration such as course work etc.