Determining the Standard of Difficulty of an Examination/Test Paper Fred B Shaw, Evaluation Consultant
INTRODUCTION |t is a well-accepted fact that teachers are able to rank students fairiy accurately, but are not able to determine the absolute achievement level very well. The problern appears to be that we do not yet know enough about learning and cognitive functioning to describe achievement levels with any degree of accuracy. The human brain functions in different ways and styles. It is so versatile that people are able to use a variety of solution algorithms for any task set. In the text of this paper, the words 'test' and 'examination' are used interchangeably. BACKGROUND The passing score in a text is not a natural node in the score distribution of a test, but represents a deliberate judgement on the part of the test designer. An easy test may have a relatively high passing score, while a difficult test may have a relatively low passing score. The passing score in itself does not reveal anything about the real standard of the test. The overall difficulty or standard is determined by the difficulty of each and the number of the tasks set in the "question paper' (Note 1). The methods that will be discussed in this paper are designed to offer a solution to the problem of establishing test standards. Many procedures have been tried in an attempt to determine the passing score of a test objectively, but little practical success has been reported to date. Basically there are two approachesto include all the desired tasks in the test paper and thereafter to determine an expected passing score by analysing the difficulty of the tasks set in the paper to manipulate the contents of a question paper so that it contains the right balance of tasks to correspond with the traditional pass-fail-cut-off score. Methods that will be briefly dealt with are
- The Angoff method - Bloom's Taxonomy.
A third alternative will then be proposed that may provide a comparative solution for the problem.
THE ANGOFF METHOD
The purpose of the Angoff method is to determine a suitable cut-off score which is appropriate to the tasks set in the test.
A panel of judges is asked to analyze the paper in terms of probabilities. Each judge is required to consider each question as a whole and to decide on the probability that a borderline test taker will respond to the question correctly. Probabilities are expressed in terms of numbers between 0.00 and 1.00. Judges who have difficulty in arriving at a decision are asked to consider the number from a hundred borderline students who would probably respond correctly to the task.
After all the judges have been allowed to discuss their reaction to each question and have also been allowed to change their minds, the probabilities for each question given by each judge are averaged. Thereafter, the probabilities for all the questions are added. The sum determines the expected passing score for the test (Note 2). This is illustrated in Table 1.
There are two disadvantages to this method, viz. it requires the opinion of a panel of judges; and in practice, it is difficult to determine the probabilities of the success of borderline candidates.
Table 1: Example ot the application of Angoft's method for determining the expected passing score of a test
| Question | Probability of Correct Response |
| 1 2 3 5 6 7 8 9 10 |
.90 .85 .40 .20 .30 .70 .75 .80 .55 |
SUM 6.50
Expected passing score
THE USE OF BLOOM'S TAXONOMY
In 1956 a theoretical framework which could be used to facilitate "the exchange of information about ... curricular developments of evaluation devicesU was proposed by a committee of college and university examiners (Note 3).
The Committee recognised three domains of behaviour, viz. the cognitive, the affective and the psychomotor. The major purpose of the taxonomy was to foster bener communication among educators (Note 4). The "use of the taxonomy as an aid in developing a precise definition and classification of such vaguely defined terms as 'thinking' and 'problem-solving' would enable a group of schools to discern the similarities and differences among the goals of their different instructional programs" (Note 5).
The taxonomy was designed to be 'a classification of the student behaviours which represent the intended outcomes of the educational process...-the ways in which individuals are to act, think or feel as a result of participating in some unit of instruction.... The emphasis in lhe Handbook IS on obtaining evidence on the extent to which desired and intended behaviours have been earned by the students' (Note 6).
A taxonomy is simply a classification scheme of things within a larger system according to their similarities and differences (Note 7). The taxonomy contains six major classes which are shown in Table 2.
Table 2. The main categories of Bloom's Taxonomy
| 1 .00 | Knowledge |
| 2.00 | Comprehension |
| 3.00 | Application |
| 4.00 | Analysis |
| 5.00 | Synthesis |
| 6.00 | Evaluation |
Bloom writes, 'As we have defined them, the objectives in one class are likely to make use of and be built on the behaviours found in the preceding class in this list.... Our attempt to arrange educational behaviours from simple to complex was based on the idea that a particularly simple behaviour may become integrated with other simple behaviours to form a more complex behaviour.... Problems requiring knowledge of principles and concepts are correctly answered more frequently than problems requiring both knowledge of the principle and some ability to apply it in new situations. Problems requiring analysis and synthesis are more difficult than problems requiring comprehension.... Our evidence on this is not entirely satisfactory, but there is an unmistakable trend pointing toward a hierarchy of classes of behaviour which is in accordance with our present tentative classification of these behaviours' (Note 8).
In general, teachers and practitioners have found difficulty in using and applying this 'hierarchical' taxonomy in test practice. Questions are not always readily classified according to this scheme. Three possible reasons for this difficulty are:
1. The taxonomy was not strictly designed for testing purposes, but for curriculum purposes. It is intended to be more descriptive than evaluative.
2. The taxonomy is designed more particularly to classify educational and curriculum goals. If these goals are not specifically mentioned in the currlculum, it is not practical to attempt to analyze and evaluate associated test papers in terms thereof.
3. The hierarchical structure of the taxonomy is tentative. The difference in difticulty between tasks related to the upper levels of the taxonomy has not been proved.
Ormell found the taxonomy to be a "disappointingly blunt instrument" (Note 9). The major criticism is that the taxonomy does not allow for the development of imaginative understanding7 which Ormell believes to be an essential ingredient for mature learning and understanding (Note 10). Essay writing in history, for example, encompasses a task which represents a combination of all levels of the taxonomy, from the stating of essential facts to evaluation (Note 11). Thus it is hypothesised that 'understanding' involves a compound of awareness-of-what-is-the-case and imagination (Note 12). Imaginative assimilation, connectivity, understanding, and projective applic ability are regarded as essential products of education which are not taken into account by the taxonomy.
In order to make the taxonomy more user friendly, many persons have, however, adapted it into three levels, viz. knowledge, understanding and the higher congitive skills. Such a classification may be useful, but remains essentially inadequate.
THE LEARNING MOTIVATION MODEL AND GAGNE'S DOMAINS OF LEARNING
This model consists of a
three-tier model for learning and teaching:
Level 1 which comprises the
formation of associations;
Level 2 which builds on the
base of the associations from which concepts and principles emerge; and
Level 3, or the level of
creative self-direction.
The latter type of
learning provides the learner with understanding and an opportunity to initiate his own
creative activities independently (Note 13). It actually fulfils the educative requirement
of imaginative understanding, which Ormell found missing from Bloom's taxonomy.
Gagne identified five
types of human learning in terms of the necessary conditions and outcomes of learning
processes. viz.
verbal information, which
consists of names or labels, single propositions or facts, and collections of facts
organised in discourse;
intellectual skills or
procedures, which are rule governed;
psycho-motor skills, which
require the execution and co-ordination of muscular movement;
attitudes, which establish
preferences individual's choice of action; and
ognitive strategies, which
are internally organised skills that govern the individual's behaviour in learning,
remembering, and thinking (Note 14).
Cognitive strategies are
used in the solution of novel problems and explore imaginative understanding. As such they
underpin creativity which is an essential part of education which must not be neglected in
assessment procedures. These three cognitive operational levels are not static, but
dynamic. Once a problem has been solved and the procedure has been learned, the solution
process reverts back to a matter of applying intellectual skills. Over-learned
intellectual skills tend to become automatic responses which operate at the level of
verbal information.
A MODEL FOR CLASSIFYING TASK DIFFICULTY
Combining these two models
and extending them to multi-dimensional applications produces a general four-tiered
hierarchical model of cognitive operation which can be applied to classify task
difficulty. The four basic functional levels are:
the level of recall, which
includes the use of 'verbal knowledge';
the level of application in
which intellectual skills and algorithmic paths are chosen and applied;
the heuristic level in
which cognitive strategies are principally used; and
the multi-dimensional level
in which cognitive operations from two or more dimensions are applied. The latter two
levels require imaginative understanding to produce vision, possibly the 'Aha' phenomenon,
and provide new connections between prior knowledge and new information.
The model is more fully
described in Table 3. The categories of complexity are hierarchical, while three parallel
exemplars are provided in the table.
The usefulness of this model is that it is not dependent upon curriculum content or objectives and can be readily applied to any examination task.
In applying the model, one needs to realise that learning is a dynamic process and that like most people, students are in a state of continuous learning. Continuous exposure to test tasks down-grades the level of the tasks. What at first may be a novel task at the heuristic level which with more practice produces a rule-governed operation, an intellectual skill. For this reason examinations and tests become obsolete once they become public knowledge Under these conditions students might even learn the right answers or procedures without understanding the process, thereby circumventing the operation of the model.
For teachers the tasks are mostly always either of the first or second level, because of their wider experience and continual exposure to a particular field of knowledge. Thus, in judging the tunctioning level of a set of tasks contained in the test, one has to ask oneself, how the average learner will perceive the task.
The allocation of tasks to the different categories is always dependent on the opportunity to learn. If an examination is administered to two different groups who have had different learning experiences, the allocation of tasks to the different categories will need re-evaluation. Tasks outside the general experience of one group can only be allocated to the highest category, because although the task may involve simply describing a concept, if the concept is outside the normal experience of the candidates in question, they would have to construct such a description for themselves in order to supply a correct response.
APPLYING THE FUNCTIONAL-LEVEL MODEL
The model is readily applied in the design of a test or in comparing the standard of difficulty of two tests. Each task is allocated to one of the categories in the model. The number of tasks per category is determined and a task profile for the test can be obtained. A typical task profile for a three hour examination is shown in Figure 1. The four hierarchical groups of tasks are clearly shown in the figure.
Figure 1. Task profile for an examination
For long questions, for example, essays, the allocation of tasks to categories will depend upon the structure of the marking scheme or memorandum.
If this method is used in compiling an examination, tasks within an examination paper may be exchanged with other suitable tasks until a desired balance is obtained.
When the method is used to compare the standard of two examinations, the examination structure wiil be readily observed and the standard of the examination at any score will be transparent. The task difficulty profile for four examinations is shown in Figure 2. It is clear from this that it is difficult to speak of an examination standard in general. One can only speak of an examination standard at an expected score.
Figure 2. Protile of difterent examination standards
It needs to be remembered that each category is not fixed and that two persons may allocate tasks to two adjacent categories. This variance in the allocation of tasks is bound to balance out when spread over the variety of all the test tasks, and a fair estimate of the standard of the test will be obtained in all probability.
EXAMPLES OF ALLOCATION OF TASKS TO DIFFERENT CATEGORIES
The following are a selection of Std 7 end of the year Mathematics questions: (diagram)
Each of these tasks can be allocated to one of the described categories. The numbers represented in brackets is a measure of the difficulty of the item on a Rasch scale. In the case of multiple-choice items, which provide additional information in the distracters, candidates may use other algorithms to correctly solve the task. This possibility has to be taken into account when allocating the items to the different categories discussed.
For example, in Q6, the correct answer can De obtained by simply testing each of the distracters tor the condrtions stated in the problem. It is, therefore not surprising that this item is associated with the lowest difficulty.
Q8 has the highest difficulty. None of the distracters provide any hint of the correct solution. The item is a true problem that can best be categorised as one that fits the heuristic level for Std 7 candidates.
Readers are left to allocate each of these tasks to one of the four hierarchical categories and compare their allocation with the empirical evidence.
CONCLUDING REMARKS
The scheme is based on didactic and learning principles. While it assumes that there is curriculum alignment, it also ensures that students are extended in the examination to think and apply their learning in the best possible way. Without such extension, mature learning cannot be evaluated.
The above scheme is an attempt to estimate the standard of examination through levels of item functioning. The scheme is not entirely foolproof, but on the other hand it offers a possible solution to this problem for which no other practical solution exists. It offers a means whereby the difficulty of two papers can be compared before the papers are written. Its application, however, requires a fair amount of time to implement.
From the above, it is clear that it is not possible to speak of the standard of an examination paper. As explalned previously, depending upon the distribution of items in the different difficulty categories, the sum of scores derived from these categories set a different standard at different scores. Thus one can only refer to the distribution of standards In an examination. The model described is designed to cultivate suitable distributions of standards for examinations.
NOTES
1.The word task refers to the action or response required from the examinee to a stimulus set in the question paper. A task may include an answer to a question or the carrying out of an instruction given by the examiner.
2. Livingston SA & Zieky MJ. 1982. Passing Scores. Princeton: ETS.
3. Bloom BS (Ed), Engelhart MD, Hill WH, Durst EJ & Krathwohl. 1956. TAXONOMY OF EDU-
CATIONAL OBJECTIVES The Classification of Education Goals HANDBOOK l: COGNITIVE DOMAIN. New York: David McKay Co. Inc., p.1.
4. Ibid. p.6.
5. Ibid. p.10.
6. Ibid. p. 1 2f.
7. Collins Cobuild English Language Dictionary. London: Collins.
8. Op. cit. p. 1 8f.
9. Ormell CP. 1974. Bloom's Taxonomy and the Objectives of Education. Educational Research,
17(1), November 1974, p.3.
10. Ibid. p. 9.
11. Ibid. p. 7.
12. Ibid. p. 15.
13. Wilson JAR, Robeck, MC & Michael WB. 1974. Psychological Foundations of Learning and Teaching New York: McGraw-Hill.
14. Gagne RM. 1977. The Conditions of Learning. New York: Holt, Rinehart & Winston. 182ff.
15.Adapted from Shaw
FB. 1983. A didactic study of the problem of the establishment of standards in educational
evaluation. Doctoral thesis. Pretoria: University of South Africa.
Back to Top