A Valid and Reliable Way of Assessing Oral Language Proficiency Is it Really Possible?

 A A Roux, University of South Africa


In order to have meaningful development of oral language proficiency at school and for the reliable selection of people to accommodate positions in which oral language proficiency plays an important role, it is essential that there should be a valid, reliable and practical method available for the evaluation of oral language proficiency. In this paper a method which answers to the above mentioned requirements, which has been implemented in practice already and which can serve as a model for the development of similar methods is introduced.

 INTRODUCTION

I have a friend wHh an English speaking background, who attended a double medium primary school, an English medium secondary school, and an Afrikaans medium Teacher Training College, and who started his teaching career in a predominantly Afrikaans medium secondary school. Eventually he was bilingual to such an extent that Afrikaans speaking people could hardly belief that he was in fact English speaking. As a result, he referred to himself as a "detribalized Englishman". I shall refer to my friend again a bit later on.

Wahout the ability to communicate it would not be possible for mankind to develop and to expand his knowledge. What is meant in the first instance when referring to communication is spoken communication. One cannot disagree with this statement if one keeps in mind that the tradition of oral communication might be considered the oldest and that the culture, history and tradition of many nations were passed on orally from generation to generation. Although many forms of human communication exist today, speech has established itself as the most important form of communication. The importance of speech is reaffirmed every day by the increasing role played by the electronic media. The old truth that the pen is mightier than the sword can be extended today by saying that the spoken word can achieve more than the gun. Both sides can gain from negotiations. The battle field results in losses only. The spoken word as medium of communication can promote the transfer of viewpoints, especially when it is done in an effective way. What could be observed of the negotiations and debates that preceded the first democratic election in South Africa, stressed the importance of successful oral communication but also made one very much aware of the consequences of a lack thereof. It underlined the fact that no authorHy or employer can afford to put a person in a position which demands oral language proficiency, if he cannot express himseH successfully by means of the spoken word. In South Africa where more than one language claims equal treatment the situation is not made easier: The person who can claim oral language proficiency in more than one language can therefore consider himself to be in a very privileged position. That brings me back to my friend, the detribalized Englishman.

To be considered "detribalized" in this respect could be very advantageous. It is however not the aim of this paper to elaborate on the advantages of being orally bilingual. I think that is self-evident. The purpose is to discuss a method to evaluate a person's oral language proficiency in a valid and reliable way.

THE EVALUATION OF ORAL LANGUAGE PROFICIENCY

The evaluation of oral language proficiency does not apply to the teaching sHuation only. The demands in more than one career are of such a nature that potential employees, or employees in line for promotion have to prove their abilHy to communicate orally in a certain language. Evaluation in this respect is not without problems. Hughes (1989) formulates it as follows:

The accurate measurement of oral ability is not easy. It takes on considerable time and effort to obtain valid and reliable results.

Because of the time and subjectiveness involved no one can doubt the fact that it is very difficult to evaluate oral language proficiency in a valid and reliable way. It comes therefore as no surprise that a number of people involved in the assessment of language proficiency reject the idea of evaluating or examining oral language proficiency. Others are of the opinion that participation in a spoken language is more valuabie than the assessment thereof.

One can therefore ask: Why is it necessary to evaluate oral language proficiency? The answer is simple: When the child has learnt the basics of the language involved, enrichment and reinforcement of the system of language should follow at school level in order to communicate successfully (Weyers, 1987). That is achieved by teaching, and teaching is very difficult without evaluation. As Backlund (1985) puts it:

Assessing is a necessary and useful part of instruction.

Evaluation guides teaching. No valid judgement of a pupil's language proficiency at the end of an academic year or at the end of his school career is possible if his oral language proficiency has not been evaluated. According to Bachman and Clark (1987) the purpose of evaluation is not only to determine whether the aims of teaching have been achieved but also that of a career. If an acceptable level of oral language proficiency is considered a prerequisite to be successful in a certain career, without evaluation no judgement regarding that is possible when a person has to be appointed or promoted. Therefore, in order to have meaningful development of oral language proficiency at school (which includes teaching and assessment) and for the valid and reliable selection of people to accommodate positions in which oral language proficiency plays a vital role, it is important to have a valid, reliable and practical method of evaluation available.

ORAL LANGUAGE PROFICIENCY

Before I discuss the important issues of validity and reliability it is necessary to get a clear picture of oral language proficiency as such. Snyder et al. (1987) refers to proficiency as performing with expert correctness and facility. According to Spolsky (Trengove, 1976) and Levine and Haus (1987) oral language proficiency enables the speaker to function in a natural language situation or an authentic situation determined by a specific context (Staab, 1986). The many different views on oral language ability that can be found in the literature, all boil down to the following, namely that it can be regarded as the relative ability of a human being to function in an acceptable way by rneans of the language concerned in a specific communicative situation.

This definition has some implications. Firstly it implies that a specific level of oral language proficiency in one language does not necessarily apply to another language too. Secondly, a person whose oral ianguage proficiency is at a certain level in one situation is not necessarily at the same level in another situation. You can not, for example, expect a typist to perform at the same level as a commanding officer in this regard. It is important to keep that in mind when the issue of validity and reliability of oral language proficiency comes under discussion. Thirdly it should be kept in mind that a person who is capable of putting his thoughts in writing is not necessarily capable of expressing it orally too.

 VALIDITY AND RELIABILITY

In the discussion of validity and reliability it is necessary to take note of the present situation concerning the evaluation of oral language proficiency. At school level the evaluation of oral language proficiency leaves much to be desired as far as these aspects are concerned. A mark awarded for oral language proficiency forms part of the annual promotion mark and final mark for the Senior Certificate Examination of the language subjects. Each of these marks are supposed to be the result of cumulative evaluation. However, these marks are in general the result of subjective evaluation. Many, if not the most language teachers are not trained or experienced to evaluate oral language proficiency. According to Vorster (1980~ even the most experienced and trained listener is only capable of a highly impressionistic judgement of a person's oral language proficiency. Objective and reliable evaluation of oral language proficiency can not be based on that. In order to be valid continual evaluation should be based on language utterances representing the language spectrum as a whole. From the beginning to the end of a school year with all the changing circumstances and moods it is almost impossible to maintain the same standard of evaluation in the awarding of marks.

In practice there is usually only one person responsible for evaluation in spite of the fact that subjective judgement plays such an imponant role.

Against this background one may ask whether a teacher will feel obliged to award a low mark for oral language proficiency to a standard ten pupil who is a possible candidate for a distinction.

In the private and public sectors evaluation is usually done by means of an interview or by asking the person concerned to talk about a specific topic. Both these methods can be regarded as unreliable because they are based on subjective judgement.

Reliability refers to the value that can be attached to the mark awarded as a result of evaluation. According to Guilford and Fruchter (1978) it is the measurement rather than the measuring instrument that is said to have the property of reliability. Bachman (1990) elaborated on that by saying that:

reliability thus has to do with the consistency of measures across different times ... rather than other characteristics of the measurernent context.

According to Hannah and Oosthuizen (1984) the most important factors which determine reliability and which are also applicable to the evaluation of oral language proficiency, are:

The last two might be considered the most important. As far as the range of a person's ability is concerned, no reliable indication of his oral language proficiency can be expected if he has no or very little knowledge of the topic he has to talk about.

Because subjective judgement can be considered a crucial factor where the evaluation of oral language ability is concerned, it is of vital importance that evaluation also be conducted in a way that will insure marks to be awarded as reliable as possible.

If reliability is considered to be important the only solution is that the evaluation of oral language proficiency should meet the following requirements:

If the evaluation does not meet these requirements, the marks allocated will be a mere representation of the evaluator's impressions about the person's ability and no accurate, objective and fair assessment.

It is generally accepted that reliability is a prerequisite for validity. A method of evaluation that does not produce reliable results in fact does not evaluate anything and can not be considered as valid (Fulcher, 1987; Bachman, 1990) because validity refers to the extent to which a method of evaluation might be considered successful (Hannah and Oosthuizen, 1984). To be considered as valid the evaluation mark which is awarded should give a meaningful indication of a person's oral language proficiency. It also implies that the language utterance which is expected should be in the person's field of experience and that it should be related to the purpose for which a specific level of oral language proficiency is expected of him.

Validity might be influenced by factors relevant to the method by which oral language utterances are obtained. Reading proficiency might be regarded as a factor if the material by means of which the person has to be stimulated is in written form. If the person has to listen first, listening proficiency might be a factor. Interpretation might be a factor if pictures are used to stimulate the speaker. It is almost impossible to exclude all these factors totally. However, efforts should be made to minimize their influences.

To summarize one could therefore say that in order to be valid and reliable the evaluation of oral language proficiency should meet the following requirements:

A STANDARDIZED METHOD FOR EVALUATION

A few years ago the then South African Defence Force (SADF) requested the Human Sciences Research Council (HSRC) to develop standardized instruments (tests) for the evaluation of oral language proficiency in Afrikaans and English. It would be used to evaluate SADF members in line for promotion. Not only had the test to be valid and reliable but it had to be as useful as possible.

Only a small number of standardized tests for the evaluation of oral language proficiency, which could possibly be used as models to develop a new test, existed at that stage and they mainly used the interview method in one or other form. The reliability of this method was already under suspicion and it was decided to look tor an alternative method.

In view of the fact that both speaking and writing are considered to be productive modes and that the same problems as far as evaluation is concerned are experienced in both cases, research that had been done on writing proficiency proved to be most valuable. Essays seemed to be more or less generally accepted as the best way to evaluate writing proficiency. It was therefore decided that speeches would form the best base for the evaluation of oral language proficiency

Harris (1969) and Ebel (1979) gave the following guidelines to improve the reliability of the evaluation of writing proficiency:

With these guidelines in mind and with the intention to develop instruments by means of which the most reliable evaluation possible could be done, it was decided to base it on a method developed by Gosling (1966) for the Commonwealth Secondary Scholarships Examination in Australia. According to this method scale models are used as criteria to evaluate essays. This implies that for each topic there should be one or more selected examples representative of a specific point on an evaluating scale. Before they start evaluating the essays on a certain topic, the evaluators have to familiarize themselves with the relevant scale models. What this method means in practice is that there should be a number of topics available which could be used alternately. For each topic there should be scale models available, each representing a specific point on a scale that will be used; for example a 5-point scale for practical purposes. If necessary the points could be changed to a percentage when evaluation has been completed. According to research this method has already made it possible to obtain correlations as high as 0,90 between evaluators. This method has already been used by the Education Testing Service (ETS) in three different tests (Mullis, 1984). According to Godshalk, Swineford and Coffman (1966) an interrater reliability as high as 0,92 is possible with this method when as many as five evaluators are involved.

This method is not without disadvantages. A considerable number of topics should be available

For each topic scale models have to be selected. In order to obtain that, a substantial number of presentations will have to be evaluated. This method, however seems to present the best alternative model on which a test for a valid and reliable evaluation of oral language proficiency could be based.

This method of evaluation made the use of tape recorders a necessity. According to the literature on this topic there cannot be any doubt about the advantages of using a tape recorder for this purpose. The following are worth mentioning:

To ensure that the evaluation would be as reliable as possible, it was decided to use both global and analytical methods of evaluation. Evaluation would be done according to a 5-point scale and criteria would be determined for each scale point. Five demarcated and defined categories would be used for analytical evaluation. More than five would be unpractical and difficult to handle.

PLANNING

Twenty four topics, Afrikaans as well as English, on which any SADF member would be able to talk easily were selected by a panel of experts. These twenty four topics were divided into four groups or so called test forms. In each test form of six topics there were an equal number of narrative, argumentative and descriptive topics. Each topic and its descriptive details were typed on a separate card.

The procedures for administering and evaluating and the procedures according to which allocated marks should be handled were the same for the Afrikaans and English test. The discussion as from now on and the results that will be presented, will be applicable to Ihe Afrikaans test only. Details about the English test could be obtained from Dr J C Chamberlain of the HSRC who was the chief project leader.

 ADMINISTERING THE TEST

The test was not applied to a representative sample of SADF members because it proved to be impractical and almost impossible. The aim was, however to involve at least people from representative subgroups of the population concerned.

Two test forms each were applied to two groups of testees, of respectively 108 Afrikaans and 100 English speaking people in the one and 98 Afrikaans and 74 English speaking people in the other group. In view of the fact that no distinction is made between first and second language in a post school situation like the SADF, the same topics were used for Afrikaans and English speakers. Each person had to talk on twelve topics, firstly on six from one test form and after a break on six from another form. The tests were administered by a team of trained persons at different locations in South Africa in rooms specially prepared for this purpose.

Each testee was informed in a standard way about what was expected of him/her first. His name was then recorded. The first one of the six topic cards was then given to him. He was then allowed two minutes to prepare after which the tape recorder was switched on, the number of the topic was recorded and the testee was aloud two and a half minutes to talk. The procedure was repeated for the next five topics and after a break for the other six topics.

 EVALUATION OF THE SPEECHES

The evaluation was done under supervision at a central venue in Pretoria by a team of experts in the service of the SADF. Each speech was evaluated independently by between five and ten evaluators. Each evaluator had to complete the evaluation of all the speeches on one topic before he could start evaluating the speeches on the next topic. Each evaluator had to write his name, the topic number and the identifying information that appeared on the tape as well as the marks allocated, on a special designed mark sheet. Each evaluator had to acquaint himself thoroughly with the explanation of the criteria for evaluation before he could start evaiuating. He then had to llsten to a speech and allocate a global mark. He had to listen to the same speech a second time, evaluate it analytically and allocate a mark for eacn category.

 DATA PROCESSING

Computation of the results were done separately for Afrikaans and English speakers. The following were computated for each topic: Mean and standard deviation for each evaluator, minimum and maximum marks allocated and correlation coefficients. As a result of these computations, evaluators whose marks deviated drastically were eliminated in order to increase the interrater reliability and further computations were done. In no instance were the marks of less than five evaluators taken into account in the final computation. The following conclusions could be made:

SELECTION OF TOPICS

Topics were selected by means of the intercorrelations between allocated marks, means and standard deviations which were computated for Afrikaans and English speakers separately. It was not necessary to eliminate any topic, therefore all twenty four topics could be included in the test.

SCALE MODELS AND PRACTICE EXAMPLES

In order to obtain scale models for the highest and lowest points on the evaluating scale it was necessary to combine the rnarks of the Afrikaans and English speakers in each of the two groups. It was a logical step in the light of the bilingual framework of the SADF. Marks allocated already were not influenced by that. This resulted in bigger correlation coefficients and a higher interrater reliability.

By using the standard deviation for the allocated marks for each topic, scale models and practice examples were selected. The following guidelines were used:

Three scale models, representative of a high, average and low point on the scale were selected for each topic to serve as a guideline for the required standard of evaluation. One or more practice examples for each topic, representative of each point on the scale were also selected.

In practice the evaluator will have to evaluate these practice examples until he has reached the acquired standard before he may start evaluating new speeches on that specific topic. A panel of evaluators of the SADF evaluated the selected scale models and practice examples once more to verify the mark which each one is supposed to represent. The scale models and practice examples for each topic were recorded on side A and side B of a separate tape respectively. The mark allocated for each model or example was also recorded a few seconds after the relevant speech.

THE TEST FORMS

Eight test forms with three topics each (narrative, descriptive and argumentive) were finalized, four test forms respectively from the two groups of topics applied to the same testees. The eight test forms were compiled in such a way that their means and standard deviations were as equal as possible and that they correlated very well. In all cases the reliability which were computated for the means by using the standard error of the difference between means, overlapped. The correlation coefficients might be considered as very high and the reliability, as based on that, as acceptable. That made the alternate use of the eight test forms in future possible.

STANDARD PROCEDURES FOR ADMINISTERING AND EVALUATION

To insure that the test will be applied and evaluated according to standard procedures, every step, from the introductory words up to the criteria for evaluation were included in two manuals to be used by the persons responsible for administering and evaluating the test respectively. Those responsible for administering the test will have to record the speeches on tape. Two persons will be responsible for evaluating the speeches without consultation with each other. In cases where there is at least one mark difference between the evaluators a moderator will have to give the final decision. In view of the results obtained, evaluators are requested to take note of the categories for analytical evaluation, but rather to use the scale for global evaluation.

The evaluator has to study the written guidelines, listen to the scale models for the topic which has to be evaluated and evaluate the relevant practice examples until his evaluation is according to standard. After that has been done he may start evaluating the speeches on that specific topic.

THE TEST IN PRACTICE

Form A of the test was applied together with an unstandardized test for oral language proficiency. The data of 1560 testees were used. A reliability coefficient of 0,92 (KR21) was obtained for the standardized test compared to one of 0,8 for the unstandardized test. Based on percentages the standard deviations for the two tests were 14,00 and 11,9 respectively which indicate a larger distribution of marks in the case of the standardized test.

Form A was also applied together with Form 8 to 150 testees. A reliability coefficient of 0,94 (KR21) was obtained for Form B. In the case of both forms the evaluators correlated 0,9 and higher.

 CONCLUSIONS

This test can be regarded as a reliable and valid method for the evaluation of oral language proficiency. It can also serve as a model for the develoDment of similar evaluation methods.

BIBLIOGRAPHY

Bachman, L.F. 1990. Fundamental considerations in language testing. Oxford: Oxford University Press.

Bachman, L.F & Clark, J.L.D. 1987. The measurement of foreign second language proficiency, in The annals of the American Academy of Political and Social Science, edited by R D Lambert and A W Heston, vol. 490, March 1987: 20-33.

Backlund, P. 1985. Essential speaking and listening skills for elementary school students. Communication Education, vol. 34, July: 185-195.

Ebel, R.L. 1979. Essentials of educational measurement. New Jersey: Prentice Hall.

Fulcher, G. 1988. Lexis and reality in oral evaluation. ERIC Document Reproduction Service No. ERIC RIE 298759.

Godshalk, F.l., Swineford, F. & Coffman, W.E. 1966. The measurement of writing ability. New York: College Entrance Examination Board.

Gosling, G.W.H. 1966. Marking English compositions. Victoria: Australian Council for Educational Research.

Guilford, J.P. & Fruchter, B. 1978. Fundamental statistics in Psychology and Education. New York: McGraw-Hill.

Hannah, C. & Oosthuizen, W.L. 1984. Evalueringsprosedures in die onderwys. Pretoria: Mathematicae.

Harris, D.P. 1969. Testing English as a second language. New York: McGraw-Hill.

Hughes, A. 1989. Testing for language teachers. Cambridge: Cambridge University Press.

Levine, M.G. & Haus, G.J. 1987. The accuracy ot teacher judgement of the oral proficiency of high school foreign language students. Foreign language annals, 20(1): 45-50.

Meredith, V.H. & Williams, P.L. 1984. Issues in direct writing assessment: Problem identification and control. Educational measurement: Issues and Practices, 3(1): 11-15, 35.

Mullis, I.V.S. 1984. Scoring direct writing assessments: What are the alternatives? Educational measurement issues and practice, 3(1): 16-18.

Myers, P.l. 1987. Assessing the oral language development and intervention needs of students. Austin, Texas: PRO-ED.

Snyder, B., Long, D.R., Kealey, J.R. & Marckel, B. 1987. Building proficiency: Activities for the four skills, in Proficiency policy and professionalism in foreign language education. ERIC Reproduction Service no. E0 285419.

Staab, C.F. 1986. Eliciting the language function of forecasting/reasoning in elementary school classrooms. The Alberta journal of educational research, xxxii(2): 109-126.

Trengove, W.E. 1976. Oral proficiency in English as a second language: The designing of scholastic tests. Unpublished D Ed thesis, Stellenbosch: University of Stellenbosch.

Vorster, J.A. 1980. Handleiding vir die Toets vir Mondelinge Taalproduksie (TMT), Pretoria: Human Sciences Research Council.

WB01339_.gif (1535 bytes)  
Back to Top

Bulletin Index