An Analysis of a Writing Test in a Chinese Senior High School

2013-07-16 01:19倪庆荣

疯狂英语·教师版 2013年2期

倪庆荣

Abstract： This paper analyzes the design and implementation of a normal writing test in a senior high school. In line with theories of test evaluation， an effective writing test should try to meet the balance of the six usefulness， which will be discussed in this paper； thereafter， pedagogical implications will be mentioned with an emphasis on how a test affects the teachers in changing their teaching approaches and motivates the learners in achieving higher goals in their writing proficiency.

Key words： evaluation of the writing test； English writing proficiency； design and development of a test

[中图分类号]G63

[文献标识码]A

[文章编号]1006-2831（2013）05-0089-7 doi：10.3969/j.issn.1006-2831.2013.02.025

Test purpose： The purpose of this test is to assess the students ability to communicate ideas in written English and to ensure that the second-year students in Jin Ling High School have reached the required English writing proficiency levels specified in the new English curriculum （Education Ministry of the PRC， 2005）.

Length and administration： It is a final achievement test administrated in the school by the English teachers. Approximately 600 second-year students took the test at the end of the semester in July of 2010. The whole test took 90 minutes to complete and the designated writing test lasted approximately 20 minutes.

Scores： 25 points for the writing test which is involved in the whole test paper， but the score of this item is separately marked as well.

Test developer： The raters are seven experienced teachers who have interpreted many scoring scales in similar situations. The samples of writing test were collected from three experienced English teachers. The final test including the scoring scale was piloted by other four teachers and finally determined after discussion and negotiation.

1. General description

In line with the course， the test contains one topical task which is relevant to the syllabus of NMET （National Matriculation English Test）. The test isnt very challenging since this type of test has been frequently used for recent years especially in the formative and mock exams under the guide of syllabus for NMET. However， embedded pictures in the writing test is a tentative practice only in recent five years. Therefore， how much impact this test method has on teaching and learning hasnt been fully investigated.

Test task： The content of the target writing should include the following aspects： efficient description of the designated pictures； accurate main points of the advantages and disadvantages of intercity high-speed trains； personal points of view on the issue.

Score interpretation and rating scales of the test will be given with details in the Appendix.

2. Evaluation of the writing test

The following evaluation of this test is based on Bachman and Palmers view of test usefulness. They maintained that “the most important consideration in designing and developing a language test is the use for which it is intended， so that the most important quality of a test is its usefulness” （1996： 17）. They define test usefulness by means of six qualities as follows： reliability， construct validity， authenticity， interactiveness， impact and practicality. The author will review the writing test from the six aspects respectively.

2.1 Reliability

Reliability can be defined as “consistency of measurement across different characteristics or facets of a testing situation” （Weigle， 2002： 49）. This test is mainly investigated on its stability， reliability and other factors such as raters in the complex rating process and the variables of test tasks.

First of all， lets look at the stability and reliability of this test. According to Popham， stability refers to “consistency of test results over time” （2008： 30）， so the degree of similarities between two different sets of scores can be reflected by a correlation coefficient. From the report on statistics， based on the studentsscores of this test， the ranking orders of average scores from 15 classes were compared to the results from three previous exams during three months. The two sets of results showed the correlation coefficient reached 0.9 which indicates a strong relationship between the relative test performances on different occasions.

However， we should also look at the unstable element in raters to measure reliability. In this test， since all the compositions in a large scale are rated by seven different raters， it is hard for all the test-takers to receive the same mark each time from different raters （Bachman& Palmer， 1996）. Therefore， the reliability of inter-raters is still vague. Another inevitable risk of unreliability can be revealed in the scoring procedure， which the author will discuss in details in the next part of construct validity.

Another problematic area is related to the variables embedded in the writing task itself which should also be taken into account by the test developers （Weigle， 2002）. The relevant aspects of this test are as follows： the topic， the expected response mode in the discourse and the number of writing samples to be provided by test-takers. First， the topic is chosen from the context which is quite familiar to the candidates. Intercity high-speed trains have been a hot issue among citizens and even the uninterested students can pick up some information from the public or the TV programme. Furthermore， teachers who often act as raters have emphasized some similar tasks in their teaching activities based on the courses. So the test developers expect the test will be realiable since it seems all the candidates have overall schemata on the test topic. However， only one sample of essay is involved in the writing task and it is required to be written in a persuasive and communicative mode by the test-takers. In this case， those students who are better at identifying and analyzing the main points involved in a test content should have advantages in performing the task. However， for those individuals who are not proficient enough in this aspect may receive a negative result because of the genre. Consequently， this score may vary in its ranking order if the target writing genre or mode is changed. Therefore， the testing scores seem vague to generalize a consistence of the same individuals English proficiency level.

To sum up， this test demonstrated a relative consideration in three aspects such as the stability of a test， potential risks in test raters and the test task itself. However， the test developers and test takers are hard to achieve their expected goals in test reliability because of its complex nature which should be influenced by various aspects which are not only determined by the test itself but also influenced by test developers， test takers， test raters and even a wide range of social and contextual involvements. The following aspect of test validity will be combined together to measure the overall usefulness of this test.

2.2 Construct validity

Construct validity refers to “the meaningfulness and appropriateness of the interpretations that we make on the basis of test scores” （Bachman and Palmer， 1996： 21）. Weigle （1996） defines construct validity as the process of determining to what extent a test is actually measuring and what it is expected to measure.

Based on this concept， the scoring scale in this test presents an explicit statement about expected performance of language skills in the writing. However， it is often noted that scales can hardly achieve high validity since it cant describe test adequately （Lumley， 2002）. Not all the corresponding responses from this test task can be found in the scoring scale； similarly， some abstract points in the scales are not precisely related to the sample writings.

Besides the inconsistence mentioned above， there remains another problem： during the rating procedure， the test raters are seven different teachers. As it is mentioned in the previous part， there must be threats to both reliability and construct validity. Though the teachers have a common understanding and interpretation towards the same scoring scale（Lumley， 2002）， they still inevitably applied their personal frame of judging an essay especially when the test text and the criteria doesnt match explicitly （Vaughan， 1991）. This frame may have led to unconscious bias or preferences which threat construct validity. For example， one of the raters reported preferring logical organization in the writing samples. Consequently， she cared less about the accuracy of spelling and grammar. Then a high mark from her couldnt indicate the high proficiency level in all aspects of language abilities. According to Neil， it was the fact that 2 of the raters rated more severely than the others. This fact may lead to unreliability because each candidate couldnt receive the same score from different raters.

Finally， the author should point out that it is obviously helpful if the teachers often clarify or discuss the test criteria among the students. Some candidates who received high marks in this test stated that their teachers often highlighted the scoring scales which were related to writing tasks. They all considered a benefit from this practice because they were motivated by the explicit language target and focused more on the relevant language skills when they were learning.

To sum up， construct validity should initially train the raters in a strict and systematic way. Though the scoring scale for this test has been used for a long time among the raters， they should also have been trained from new perspectives to realize each potential threat to construct validity. On the other hand， as Weigle（2002） maintained that the consistence of each specific rating scale and each individuals score should present an effective feedback for teachers. They should adjust their scoring scale during different situations and improve the validity of the scoring procedure as well. Thats to say， we should realize that test validity is an“on-going process” （Bachman & Palmer， 1996： 22） since the interpretation of scoring can never be measured to be definitely valid.

2.3 Authenticity

Bachman & Palmer （1996： 23） defined authenticity as “the degree of correspondence of the characteristic of a given language test task to the features of a target language use（TLU） task” .

With this concept in mind， the topic of this writing test was quite close to the real life. The persuasive mode was often used for a discussion in classrooms. The language use of communication is quite similar to the test task. However， in a classroom discussion or a conversation in the real life， the students dont have any designated pictures to describe， and the sentence structures with main points could have been more casual or personal. Teachers found some of the students received a good mark in this writing test， however， they couldnt express themselves in a fluent language in real communication in or out of the classroom.

In this area， the teachers tried to emphasize the choice of relative communicative test task. Because they intended to construct some consistence between their teaching activities and the test tasks which could motive the candidates to participate in learning activities more actively. This point is also asserted in the formative assessment criteria from the new English curriculum. However， in this achievement test， many linguistic items specified in the course have to be included， while those linguistic requirements cant meet an authentic expectation in a real world（Spence-Brown， 2001）. Generally speaking， full authenticity cant be a practical purpose in this test.

2.4 Interactiveness

Bachman & Palmer （1996： 25） defined interactiveness as “the extent and type of involvement of the test takers individual characteristics in accomplishing a test task”. It reveals an inner relationship between the performance of a test-taker and a test task itself. In this area， the author will review the interactiveness from three aspects of test characteristics including language knowledge， strategic competence， topical knowledge and affective schemata.

In the first place， the test provides the candidates an opportunity to perform their comprehensive competence through writing tasks such as organizing their essays， writing accurate and fluent sentences， identifying and analyzing main points of views.

The other strength of this test is its function in motivating students to learn more autonomously. The new English curriculum in China asserts that the assessment should help students become more aware of learning strategies. An overall awareness of writing strategies can be developed from the analysis of sample writings. It indicates that this test task motivates an active involvement of various writing strategies.

Another advantage related to the interactiveness is the candidates topical knowledge. As discussed before， the topic of the test is set in the situation in which the candidates have covered some relevant topics before. So they reduced anxiety during the process of writing which could have possibly helped them perform the writing task better.

However， there remained a problematic area， which is related to the possible negative affection. Some students reported they showed no interests in relevant topic， so when they found the topic would be familiar to other peers while they didnt have this schemata advantage， they felt very anxious and panic. As Kunnan（2004） often asserted in his frame of fairness， this topic may have led to slight bias for some candidates. It should be noted that the pressure of testing time may have caused some anxiety to some candidates if they were not capable in writing.

2.5 Impact

Impact of tests can be defined as “the effect that tests have on individuals （particularly test takers and teachers） and on large systems， from a particular educational system to the society at large” （Weigle， 2002： 53-54）. In line with this definition， washback is often used in this area to discuss the direct impact particularly on test takers and teachers （Bachman & Palmer， 1996）.

Similar to most achievement tests in the Chinese testing situation， this test also has a strong washback effect both on students and teachers.

The first significant impact is to motivate the students to participate in learning activities more actively and autonomously. They would like to reflect on their strengths and weaknesses in English writing competence and to improve their potential strategies in English writing.

Interestingly， this perspective is also expressed in Brown and Hudsons （1998） concept about alternative assessment which often present classroom tasks involving“meaningful instructional activities” （654-655）. Therefore， potential pedagogical changes both in writing strategies and in guiding classroom activities could be also adjusted or improved with new techniques （Qi， 2005）.

But there still remained negative washback in this test. As Qi （2005） maintained， too much orientation to tests distracts the intended washback effect to a degree. For example， the writing tasks in this test are not challenging for some students at a high English language proficiency level. They complained that they didnt have an opportunity to perform their large quantity of vocabulary and rich information about arguments. Possibly during the next period， their intrinsic motivation to read beyond the test task can be restricted.

Thus， the teachers and test-takers should carefully look at the washback of each specific test， and try to achieve more positive effect from tests for teaching and learning activities in the future.

2.6 Practicality

We can define the practicality as the relationship between the resources required during the administration of the test and the available resources for these activities（Bachman & Palmer， 1996）. Thus， we determine the practicality of a designated test should take into consideration the two factors： human resources， material resources， and time for designing tasks， administering the test and scoring. This test is relatively easy to operate because it is a common achievement test， which has been administrated many times in the academic year. The teachers are experienced in this type of designing and scoring process. The students have many accesses to the materials which both come from the textbook contents and other reading materials available online or on the list of their selective reading books.

As the construct validity requires， data should be collected from a large number of writing samples to analyze the reliability and measure the specific language skills reflected in a given test. However， its very difficult to achieve the practicality in this area.

As a whole， one of the most difficult aspects for this test is the time-consuming nature of writing tasks. So the teachers find it hard to judge a students continuum development of knowledge through this test since the relevant tests cannot be held frequently over a time.

3. Conclusion

China has a large scale of senior school students who have to take numerous tests before their ultimate NMET. Since the NMET with distinct Chinese characteristics cannot be changed currently， the relative fairness of testing as a competitive function has been widely accepted for many years in China（Cheng， 2008）. Teachers and test developers call for more consistent of tests with the syllabus and curriculum， and strive for the higher quality of each test. However， the NMET selection function has primarily restrained the design and development of a test. Its hard to interpret or measure how a normal writing test affects teachers in changing their teaching approaches or whether the test motivates learners to achieve higher goals in their writing proficiency. Furthermore， teachers overall language use abilities and teaching methods should be reinforced in the Chinese situation. Pursuing this further， teachers should have the ability of selecting suitable teaching materials as preparation for assessments and flexible testing methods should be negotiated to meet students interests and various learning situations. An effective writing test should build up students confidence and generalize sufficient feedback about students strengths and characteristics in their learning process； similarly， teachers strengths and weakness in teaching methods should also be reflected by the test and testing method. As a whole， a writing test should try to meet the balance of the six usefulness mentioned above， or at least， a writing test should not go far away from its particular EFL learning and teaching situation.

References：

Bachman， L. F. & A. P. Language Testing in Practice[M]. Oxford： OUP， 1996.

Brown， J. D. and T. Hudson. The alternatives in language assessment[J]. TESOL Quarterly， 1998（32）： 653–675.

Hamayan， E. V. Approaches to alternative assessment[J]. Annual Review of Applied Linguistics， 1995： 21-26.

Hughes， A. Testing for Language Teachers[M]. Cambridge： Cambridge University Press， 1989.

Lumley， T. Assessment criteria in a large-scale writing test： what do they really mean to the writers？[J]. Language Testing， 2002（3）： 246-276.

Weigle， S. C. Assessing Writing[M]. Cambridge： Cambridge University Press， 2002.

Appendix 1： The Test Paper

Part 5 Writing （weighting： 25%； score： 25） Time： 20 minutes

On July 1st， China opened the intercity high-speed train between Shanghai and Nanjing.

Please make a description of the following pictures； state the arguments from the public on the advantages and disadvantages of the intercity high-speed train （each argument should contain at least two main points of view）； your personal views must be included in the end. The beginning has been given which is not included in the word count. The word limit is approximately 150 words.

Appendix 2： The Rating Scale

The rating criteria for this test are based on the New English Curriculum and its consistent requirements for assessment. They are described as follows：

⑴ Firstly， the score should be given based on five different categories of grades. Each grade is expected to indicate the differences in test-takers language competence. According to the test criteria， an initial score should be given classified into a specific item of grade based on the overview of content， organization and language. Thereafter， the raters should note the specific requirements of each grade item， and then give adjustment before the final score is determined.

⑵ Secondly， word count should be taken into consideration. Marks should be reduced by 1 or 2 points if the word number is below 130 or beyond 170.

⑶ Thirdly， when the test task was performed， the writing content should include proper topic sentences and cohesion between paragraphs. Demonstration of proper expression of written language and coherence of the context should be noted.

⑷ Furthermore， appropriate organization ， including the accuracy and fluency of vocabulary and grammar structures should be examined.

⑸ Finally， accuracy of spelling and punctuation should be involved in marks. The mark reduce should depend on how much negative impact the errors have on the communicative function. In the same way， if the handwriting blocks the conveying of meaning to audience or raters， the grade can be scaled to a lower level.

⑹ In addition， it should be noted that American spelling and British one are both accepted.