Common Classroom Assessments: Reliability and Validity

A Brief Introduction to Validity, Reliability, and Data

Data in raw form such as item printouts or completed surveys is not information until it is analyzed appropriately. Information that can inform and empower our focus on student learning is derived from the analysis and discussion of valid data.

What does reliable and valid mean when analyzing student data?

Think of a measuring device, like a scale in a doctor's office. Each time a patient steps on the scale, a drastically different measure of the patient's weight appears. The scale is said to be unreliable. In testing, think of test reliability as test consistency.

"Validity" comes in three basic forms: criterion-related validity, construct-related validity, and content-related validity. The type of validity we concentrate on in classroom testing is construct and content-related validity.

Construct-related validity relates to an assessment actually performing its intended purpose. If an assessment is designed to indicate the skill and knowledge of a student, then do students who have a more solid mastery of the assessment-related content score better than those who do not?

Content-related validity indicates if a test is actually measuring what it intends to measure. A social studies test that uses long, complex passages within question items will be measuring reading comprehension, along with students' skills and knowledge pertaining to social studies, even though reading comprehension may not be one of the inferences we're wanting to determine.

Validity in Our Classroom Common Assessments

A common assessment will have “validity” if, and only if:
  1. it measures what it is intended to measure
  2. the analysis and conclusions made on the basis of student scores are appropriate and accurate
  3. it serves as a sound predictor of students' performance on external, standards-based assessments

Specific to our common assessments having reliability and validity, consider the following:

  • Each question on a common assessment should correlate to a single, specific curricular indicator.

  • On a particular assessment, there should be multiple questions addressing an indicator.

  • Questions addressing the same indicator should be evenly distributed throughout the assessment, with easier items being placed in the beginning and progressively increasing in difficulty.

  • When considering why a question exists on a common assessment, (for example “What are we testing with this question?”) the response given by each teacher using the common assessment should be identical and unambiguous.

  • When evaluating a common assessment, teacher-specific grading practices need to be suspended. For example, “If a student doesn’t put their name on their test, I automatically deduct 5 points from their score”. Doing this on a common assessment invalidates the data by eliminating construct validity.

  • The common assessment should be left intact, and be identical for all students in the grade-level and/or department. Without this, no valid information is available from the analysis of the data. For example, “I don’t like this wording, so I won’t be using questions 15-20; I have my own questions for that part of the assessment”.

  • For each open-response item on a common assessment:
    • a scoring rubric should be developed
    • the rubric should clearly outline the criterion for each point value awarded
    • collaborative practice should be established between teachers using the rubric
    • a student’s score on any open-response item ideally will not vary depending on which teacher is scoring their assessment.

Resources Worth Investigating

McMillan, J. H. (2001). Classroom assessment: Principles and practice for effective instruction (2nd ed.).
Boston: Allyn & Bacon.

Popham, W.J. (2000). Modern educational measurement: Practical guidelines for educational leaders
(3rd ed.). Boston: Allyn & Bacon.

Click here to go back to the Classroom Testing and Analysis section.