Session Information
09 SES 03 B, Assessing Spelling and Written Composition
Paper Session
Contribution
Assessments are associated with several possible measurement errors, and recognition of these is central for a measurement approach to assessment. As Eckes (2011) puts it, “the score a rater awards to an examinee is the result of a complex interplay between bottom-up (performance-driven) processes and top-down (theory-driven or knowledge-driven) processes.“ (p. 33) He argues that raters’ mapping to the scoring criteria and the rating scale categories is crucial in the rating process.
In the case of teacher-mediated assessments of performance (Lane & Stone, 2006), for example writing proficiency tests (Weigle, 2002), many facets have to be taken into account. An examinee’s chances of getting a high score on a writing task will depend on his or her proficiency, the difficulty of the task and the characteristics of the raters. “Moreover, the nature of the rating scale itself is an issue.” (Eckes, 2011, p. 2)
The purpose of this paper is to investigate the role of rating scales in teacher-mediated assessment of writing proficiency at upper secondary level. The analysis is based on data from a study (Sjöberg, 2012) that examines different aspects of rater variability in teacher-mediated assessments. The focus of this paper is the five rating scales used by the teachers. The question is if different rating procedures, holistic and analytic, generate the same scores on the essays. The aim is to compare how the rating scales correlate and overlap.
The study is conducted within a third language assessment context in the Swedish educational landscape, where teachers are perceived as trusted professionals. They are responsible for the examination of their own students (Wikström, 2006). National tests are primarily assessed by the students’ own teachers. Anonymizing essays is recommended but not mandatory. Teachers of third languages are provided with an item bank developed by the National Agency for Education (NAE). The grades of these external tests and item bank tests influence the course mark, which is used for admission to higher education. With this responsibility for the high-stake grading, teachers in Sweden have to be qualified raters. However, they are working without any specific rater training with interrater reliability control such as that adopted by awarding bodies in for example England.
In this study, the concept of interrater reliability is based on Stemler’s understanding (2004). He notes that most research applies a universe concept, which he argues is imprecise. Stemler provides a three-class categorization: consensus, consistency, and measurement estimates. Consensus estimates assume that raters should be able to come to exact agreement about the use of scoring rubrics, whereas consistency estimates assume that it is not necessary for raters to have shared meanings, so long as each rater is consistent in their classifications. Further, in order to determine the amount of shared variance measurement estimates are used. He suggests for example Cohen’s kappa for consensus estimates; Person correlation coefficient or Spearman rank coefficient for consistency estimates and generalizability theory, the many-facets Rasch model or the factor analytic technique of principal components analysis for measurement estimates (Meadows & Billington, 2005).
Method
Expected Outcomes
References
Bramley, T. & Oates, T. (2011). Rank ordering and paired comparisons - the way Cambridge Assessment is using them in operational an experimental work. In: Research Matters: Issue 11. p. 32-32. Cambridge. Brennan, R. L., & Johnson E. G. (1995). Generalizability of Performance Assessments. Educational Measurement: Issues and Practice. 9-12. Eckes, T. (2011). Introduction to Many-Facet Rasch Measurement. Analyzing and Evaluating Rater-Mediated Assessments. Frankfurt am Main 2011: Peter Lang GmbH Internationaler Verlag der Wissenschaften. Lane , S., & Clement, A. S. (2006). Performance Assessment. In: Robert L. Brennan (Ed.) Educational measurement (Fourth edition, pp. 387-431) Westport CT: American Council on Education/Praeger Publishers. Meadows, M. & Billington, L. (2005). A review of the literature on marking reliability. AQA Research Report. Mushquash, C., & O'Connor, B. P. (2006). SPSS, SAS, and MATLAB programs for generalizability theory analyses. Behavior Research Methods, 38(3), 542-547. Pollitt, A. & Murray, N. L. (1996). What raters really pay attention to. In: (Ed. M. Milanovic & N. Saville) Performance testing, cognition and assessment: Selected papers from the 15th Language Testing Research Colloquium (Studies in language testing, Vol. 3. 74-91). Cambridge: Cambridge University Press. Sjöberg, R. (2012). Teachers' judgements of national L3 writing tests in Sweden; rater variability, severity and leniency, Full paper at AEA-Europe 13th Annual Conference, Berlin, 8-10 November. Stemler, S.E. (2004). A comparison of consensus, consistency, and measurement approaches to estimating interrater reliability. Practical Assessment, Research & Evaluation, v9 n4. Weigle, S. C. (2002). Assessing Writing. Cambridge: Cambridge University Press. Wikström, C. (2006). Education and assessment in Sweden. In: Assessment in Education. Vol. 13, No. 1, 113-128.
Search the ECER Programme
- Search for keywords and phrases in "Text Search"
- Restrict in which part of the abstracts to search in "Where to search"
- Search for authors and in the respective field.
- For planning your conference attendance you may want to use the conference app, which will be issued some weeks before the conference
- If you are a session chair, best look up your chairing duties in the conference system (Conftool) or the app.