objectivity means that the grade is independent of the assessor. A distinction is made between:
• Implementation objectivity: To what extent is the implementation of a performance review
independent of the teacher? Would other teachers design the exam in the same way (task, processing time, assistance, aids ...)?
• Evaluation of objectivity: To what extent is the evaluation independent of the respective teacher? Would other teachers arrive at the same evaluation result (procedure for correction, determination of correct and incorrect answers, evaluation of sub-steps ...)?
• Interpretation of objectivity: To what extent is the interpretation of the results independent of the respective teacher? Would other teachers use similar assessment guidelines and rate in the same way or give the same grade for performance?
Reliability means that a note accurately and reliably depicts a performance without too great a falsification due to measurement errors. For example, is the same performance always rated the same at different times?
validity means that a note really depicts what it is about in terms of content. With regard to school performance reviews: Does the test for which the grade is awarded really primarily measure the professional competence that is to be measured? A distinction is made between four aspects of validity:
• Content validity: Does the checked content match the content to be measured? Does the exam measure competencies that the students were actually able to acquire in class?
• Prognostic validity: Can correct conclusions be drawn about future performance and learning outcomes from the examination results?
• Match validity: Do the results obtained with different tests agree? This can be questioned, for example, if oral and written examinations in the same subject are far apart on the grading scale.
• Construct validity: Does the examination take into account the theoretical models (e.g. competency models) of the performance to be examined that are common in the technical discussion in all its areas and at all levels?
Procedures for carrying out, evaluating and interpreting examinations should be made aware, described and coordinated by the teaching staff.
Quorum training helps ensure that these procedures are consistently applied.
Evaluation and assessment of examination performances should always be carried out in clearly separated steps.
Comparative work or standardized tests (e.g. VERA) offer the opportunity to review and, if necessary, improve your own examination and assessment practice.
Exams are more reliable the more tasks they include.
Possible measurement errors should be known and reduced as much as possible or at least taken into account when making decisions (see also measurement errors in the award of grades and countermeasures).
Improvement of the validity
In non-linguistic subjects (e.g. general subjects) one should deliberately keep the language proficiency requirements low during performance reviews and under no circumstances should linguistic performance be included in the assessment.
At least some exams should be designed in such a way that memory performance and work speed are not so important.
It is important to make exams fearless
It is helpful to occasionally use school performance tests as a control.
One should keep in mind possible disruptive factors and distortions (see also factors influencing the award of grades).
One should minimize so-called errors of judgment (see also errors of judgment in the award of grades and countermeasures).
It is important to examine with close reference to previous and subsequent lessons.
In order to improve the conformity validity, examination situations (e.g. in front of the class, alone), examination forms (written, oral, practical) and task forms (e.g. open, closed) should be used as diverse as possible.
Requirements in the examination should reflect the previous lesson proportionally in terms of content and form (e.g. form of tasks, answer format): What took up a lot of space in the lesson should also take up a lot of space in the examination.
In order to ensure sufficient prognostic validity, the importance of the examination content for the future learning process must be taken into account: What will be taken up again and again in the future and assumed as a basis and what is less important?
Results on objectivity
Objectivity is given when measurements are independent of who is performing them. This is only partially the case with grades. For example, in one study, 73 teachers were asked to rate the same essay on a scale from 0 to 100 (Brimi, 2011). The ratings fluctuated between 50 and 96 points.
Similar results can be found for the subject of mathematics; Here, too, the performance assessment is not more objective (Ingenkamp & Lissmann, 2005). Still, in many studies where multiple teachers graded the same achievement, most of the ratings are fairly close.
The degree of evaluation agreement can generally be expressed using a coefficient with a range from 0 “no agreement” to 1 “perfect agreement”.
For school grades, values in the range from 0.35 to 0.85 are achieved (for comparison: the values for intelligence tests are around 0.95 to 0.99; Sacher, 2014).
Teachers vary greatly in the accuracy and reliability of their grading (Brookhart et al., 2016). This applies in particular to the evaluation of average performance, while particularly good or particularly poor performance is graded more reliably overall (Sacher, 2014).
Reliability can be mapped on a scale from 0 “not at all reliable” to 1 “perfectly reliable”. The reliability for the evaluation of written performance is between 0.50 and 0.80, that for oral performance is significantly lower (below 0.50; for comparison: the values for intelligence tests are around 0.80 to 0.95; Sacher, 2014). Overall, grade averages (e.g. the mean certificate grade or the average of all subject grades in a year) are more reliable than individual grades.
Results on validity
Teacher performance reviews are not an accurate reflection of student performance. On average, the relationship between the teacher's assessment and the performance actually achieved is 0.63 on a scale from -1 "reverse relationship" to 0 "no relationship at all" to 1 "perfect relationship" (Südkamp, Kaiser & Möller 2012) . This means that only about 40% of the differences in the teachers' assessments can be explained by actual differences in performance among the students and that to a considerable extent, factors other than performance influence the award of grades (see also factors influencing the award of degree).
The prognostic validity of grades (the correctness of predictions of future performance based on current grades) is also quite modest overall: although a satisfactory prognostic validity was determined for the school career recommendation after the fourth school year, because the teacher recommendation has proven itself for the majority of students (Scharenberg, Gröhlich, Guill & Bos, 2010). But school grades are only partially suitable for predicting future academic or professional performance.
The relationship between high school diploma and academic grade averages 0.34 (Bachelor) or 0.25 (Master) (Trapmann, Hell, Weigand & Schuler, 2007). Findings for the English-speaking area are very similar (Geiser & Santelices, 2007).
The relationship between school grades and professional performance is lower and lies between 0.16 and 0.30 (Gasser, 2014; Roth, BeVier, Switzer & Schippmann, 1996).