What is the essay worth? The role of rating models in teacher-mediated assessment of writing proficiency

Author(s):

Robert Sjöberg(presenting / submitting)

Conference:

ECER 2013

Network:

09. Assessment, Evaluation, Testing and Measurement

Format:

Paper

Session Information

09 SES 03 B, Assessing Spelling and Written Composition

Paper Session

Time:

2013-09-10

17:15-18:45

Room:

D-310

Chair:

Sarah Frahm

Contribution

Assessments are associated with several possible measurement errors, and recognition of these is central for a measurement approach to assessment. As Eckes (2011) puts it, “the score a rater awards to an examinee is the result of a complex interplay between bottom-up (performance-driven) processes and top-down (theory-driven or knowledge-driven) processes.“ (p. 33) He argues that raters’ mapping to the scoring criteria and the rating scale categories is crucial in the rating process.

In the case of teacher-mediated assessments of performance (Lane & Stone, 2006), for example writing proficiency tests (Weigle, 2002), many facets have to be taken into account. An examinee’s chances of getting a high score on a writing task will depend on his or her proficiency, the difficulty of the task and the characteristics of the raters. “Moreover, the nature of the rating scale itself is an issue.” (Eckes, 2011, p. 2)

The purpose of this paper is to investigate the role of rating scales in teacher-mediated assessment of writing proficiency at upper secondary level. The analysis is based on data from a study (Sjöberg, 2012) that examines different aspects of rater variability in teacher-mediated assessments. The focus of this paper is the five rating scales used by the teachers. The question is if different rating procedures, holistic and analytic, generate the same scores on the essays. The aim is to compare how the rating scales correlate and overlap.

The study is conducted within a third language assessment context in the Swedish educational landscape, where teachers are perceived as trusted professionals. They are responsible for the examination of their own students (Wikström, 2006). National tests are primarily assessed by the students’ own teachers. Anonymizing essays is recommended but not mandatory. Teachers of third languages are provided with an item bank developed by the National Agency for Education (NAE). The grades of these external tests and item bank tests influence the course mark, which is used for admission to higher education. With this responsibility for the high-stake grading, teachers in Sweden have to be qualified raters. However, they are working without any specific rater training with interrater reliability control such as that adopted by awarding bodies in for example England.

In this study, the concept of interrater reliability is based on Stemler’s understanding (2004). He notes that most research applies a universe concept, which he argues is imprecise. Stemler provides a three-class categorization: consensus, consistency, and measurement estimates. Consensus estimates assume that raters should be able to come to exact agreement about the use of scoring rubrics, whereas consistency estimates assume that it is not necessary for raters to have shared meanings, so long as each rater is consistent in their classifications. Further, in order to determine the amount of shared variance measurement estimates are used. He suggests for example Cohen’s kappa for consensus estimates; Person correlation coefficient or Spearman rank coefficient for consistency estimates and generalizability theory, the many-facets Rasch model or the factor analytic technique of principal components analysis for measurement estimates (Meadows & Billington, 2005).

Method

The web-based study with a fully-crossed design was conducted in the spring of 2012 and examined different rating scales used by the upper secondary teachers of third language (N = 27), who made ratings of the same 10 short German essays written by 17 year old Swedish students at the proficiency level of B1.1., according to the European CEFR-scale. Five rating models were used to investigate inter- and intrarater reliability in different models. The models are paired comparative judgements (Pollitt & Murray), rank ordering (Bramley & Oates, 2011), two Swedish national grading scales with criteria (Curriculum 1994, Curriculum 2011), and an analytic rating model with 10 scale descriptors. Following Stemler’s idea, the interrater reliability of the teachers’ ratings is analysed by focussing the consistency estimates. Spearman’s rho is calculated for each pair of teacher in each rating model generating five estimates. A mean value is estimated of the teachers’ five consistency values and of the teachers’ estimates within each model. The measurement estimates are computed by using the Q-factor analytic technique of principal components analysis and generalizability theory analysis (Brennan & Johnson, 1995) adopting an ANOVA method in a special SPSS program (Mushquash & O'Connor, 2006).

Expected Outcomes

The complex interplay between bottom-up and top-down processes, when awarding scores, is dependent on the rating scale used. What role do rating models play in teacher-mediated assessment? This is an important issue in a context, where teachers have to be qualified raters responsible for the rating of national tests and examination of their own students. The analytic work is still in progress. However, some preliminary analyses reveal that the interrater reliability (Spearman’s rho) of the teachers’ ratings is on a medium level (0.64). A lower value in the Comparative judgment model (CJ) is surprising, because the model is claimed (Pollitt & Murray, 1996) to ‘cancel out’ the rater’s internalized standards and therefore guarantee high reliability and validity. Further, in the principal components analysis the Curriculum 1994 model is highest correlated with the other models, what indicates that these criteria are internalized. Finally, when the 27 teachers rate the 10 essays, the rank ordering is consistent over four of the models. The essays with the highest and lowest scores follow the same pattern in all models except for the CJ model. However, there are strong indications that the rater variability varies in the different models, and therefore, further analyses are needed.

References

Bramley, T. & Oates, T. (2011). Rank ordering and paired comparisons - the way Cambridge Assessment is using them in operational an experimental work. In: Research Matters: Issue 11. p. 32-32. Cambridge. Brennan, R. L., & Johnson E. G. (1995). Generalizability of Performance Assessments. Educational Measurement: Issues and Practice. 9-12. Eckes, T. (2011). Introduction to Many-Facet Rasch Measurement. Analyzing and Evaluating Rater-Mediated Assessments. Frankfurt am Main 2011: Peter Lang GmbH Internationaler Verlag der Wissenschaften. Lane , S., & Clement, A. S. (2006). Performance Assessment. In: Robert L. Brennan (Ed.) Educational measurement (Fourth edition, pp. 387-431) Westport CT: American Council on Education/Praeger Publishers. Meadows, M. & Billington, L. (2005). A review of the literature on marking reliability. AQA Research Report. Mushquash, C., & O'Connor, B. P. (2006). SPSS, SAS, and MATLAB programs for generalizability theory analyses. Behavior Research Methods, 38(3), 542-547. Pollitt, A. & Murray, N. L. (1996). What raters really pay attention to. In: (Ed. M. Milanovic & N. Saville) Performance testing, cognition and assessment: Selected papers from the 15th Language Testing Research Colloquium (Studies in language testing, Vol. 3. 74-91). Cambridge: Cambridge University Press. Sjöberg, R. (2012). Teachers' judgements of national L3 writing tests in Sweden; rater variability, severity and leniency, Full paper at AEA-Europe 13th Annual Conference, Berlin, 8-10 November. Stemler, S.E. (2004). A comparison of consensus, consistency, and measurement approaches to estimating interrater reliability. Practical Assessment, Research & Evaluation, v9 n4. Weigle, S. C. (2002). Assessing Writing. Cambridge: Cambridge University Press. Wikström, C. (2006). Education and assessment in Sweden. In: Assessment in Education. Vol. 13, No. 1, 113-128.

Author Information

Robert Sjöberg (presenting / submitting)

University of Gothenburg

Department of Education and Special Education

Göteborg

Search the ECER Programme

Search for keywords and phrases in "Text Search"
Restrict in which part of the abstracts to search in "Where to search"
Search for authors and in the respective field.
For planning your conference attendance you may want to use the conference app, which will be issued some weeks before the conference
If you are a session chair, best look up your chairing duties in the conference system (Conftool) or the app.