Chapter 2

Twisting test results and missing sound evidence of learning

Chapter 2 Overview

This chapter is about the misinterpretation of test results. We discuss disregard of a test’s purpose, the misunderstanding of what a test might measure, the degree of imprecision and uncertainty inherent in that measure, the choice of the wrong metric, the disregard of spread and over-attentiveness to central tendency, and more. We look closely at the mismeasurement of growth, which is so central to evaluation of student-level, school-level and district-level progress. You’ll learn about the myth of Lake Wobegon, and the history of misunderstanding of norms on national tests.

Chapter 2 Excerpt

Education may be the only field where opponents of testing have a stronger voice than those who defend it. Is there a debate within the pharmaceutical industry where people either favor or oppose clinical trials? Do members of the medical profession favor or oppose CAT scans or MRIs? Of course not. In those fields, arguing against evidence itself would be equivalent to championing ignorance. The discussion in these fields is not whether to test, but how to test, whom to test, when to test, which test to use and at what cost. Those in the education profession are stuck in a cruder pro-con argument about testing. The anti-testing forces seem to have the strongest voice at this moment, and it’s not just due to the pandemic-related interruption of state-level assessments. There are just too many districts where “test” is still a four-letter word. The teacher unions have taken a stand in opposition to testing, saying it intrudes on instruction. The National Center for Fair and Open Testing known as FairTest urges people to “just say no to standardized tests” and provides guidance on opting out. Surprisingly, the testing companies are not speaking up to defend testing except when reporters call them for comment in the face of criticism. Those who have spoken up in favor of better-quality tests used more wisely are scholars like James Popham and Howard Wainer, professional organizations, policy centers, education reform groups and equity advocacy organizations like the Education Trust. In the public square, these defenders’ voices are at this time not as forceful as critics’ attacks. Be aware that we’re in a high-stakes and highly politicized conflict. The higher the stakes, the more deformed test results are likely to become. I’m also concerned about the degree to which people’s self-interests are wrapped up in the interpretation of students’ test results. Real estate agents need to reassure prospective home buyers that the schools in town are doing just fine. Local employers need to reassure prospective recruits with families that their local schools will do a fine job educating their children. Of course, principals and superintendents want to bring good news to their school board members, and the easiest way to do that is with test scores that are trending up-up-up. With so much at stake for so many, it’s no wonder that deriving sound evidence from test results is a challenge. But that challenge isn’t just a result of the political use of test results as ammunition, or the inherent complexity of the evidence. Popham, in an article in Educational Leadership magazine, asserted that it would take administrators only a dozen hours or so reading one of his books in a study group to gain a reasonable level of assessment literacy. With such a modest effort required to compensate for this knowledge deficit, Popham caused me to wonder whether the cause might be self-interest, the fear that evidence captured in test results might reveal mistaken board policies, mistakes by adults when classifying students or errors in selecting curricula. If so, what is the remedy?…

Chapter 2 References

Cannell, John Jacob, How Public Educators Cheat on Standardized Tests, Friends of Education (1989), 125 pages. Retrieved on March 15, 2020 from https://files.eric.ed.gov/fulltext/ED314454.pdf

Hattie, John, Visible Learning: A Synthesis of Over 800 Meta-Analyses Relating to Achievement,Routledge (2008), 259 pages.

Popham, W. James, “The right test for the wrong reason,” Phi Delta Kappan, August 30, 2014, 96 (1), pages 46-52.

Popham, W. James, “Why Assessment Illiteracy Is Professional Suicide,” Educational Leadership Magazine, ASCD (September 2004), vol. 62, no. 1, pgs. 82-83.

Rankin, Jenny, dissertation “Over-the-Counter Data’s Impact on Educators’ Data Analysis Accuracy,” Northcentral University, Arizona (2013). Accessed on March 3, 2019, from https://pqdtopen.proquest.com/doc/1459258514.html?FMT=ABS.

Rankin, Jenny, “Remedying Educators’ Data Analysis Errors with Over-the-Counter Data,” CCNews: Newsletter of the California Council on Teacher Education, vol. 24, no. 4 (December 2013), pgs 14-21. Accessed on March 8, 2019, from http://ccte.org/wp-content/pdfs/newsletters/ccte-news-2013-winter.pdf

Rankin, Jenny, Standards for Reporting Data to Educators, Routledge (2016), 140 pages.

Rasmussen, Steven, “Smarter Balanced Tests—One Year Later, Same Shameful Tests,“ accessed on September 25, 2020, from http://mathedconsulting.com/2016/03/smarter-balanced-tests-one-year-later-same-shameful-tests/

Rasmussen, Steven, “The Smarter Balanced Common Core Mathematics Tests Are Fatally Flawed and Should Not Be Used: An In-Depth Critique of the Smarter Balanced Tests for Mathematics,” SR Education Associates (March 2015). Accessed on September 25, 2020, from http://mathedconsulting.com/wp-content/uploads/2015/04/Common-Core-Tests-Fatally-Flawed.pdf

Standards for Educational and Psychological Testing, American Educational Research Association (2014), 230 pages.

Shaywitz, Sally E., New England Journal of Medicine, January 29, 1998; Vol. 338(5), pages 307-12.

“SLDS Data Use Standards: Knowledge, Skills, and Professional Behaviors for Effective Data Use – Master Standards for School and District Leadership.” U.S. Department of Education. Washington, DC (2016): National Center for Education Statistics, 11 pages.

“Education suffers from an overabundance of well-meaning B.S. and a worrisome shortage of useful B.S. detectors. In this accessible volume, through case studies and clear prose, Steve Rees and Jill Wynns have sought to provide just that to community and educational leaders seeking a better way to gauge how schools are doing.”

Rick Hess – Education policy maven of the American Enterprise Institute think tank (AEI)

Chapter 2 Data Visualizations

Figure 2.3 - College going rate of Class of 2018

This leaderboard compares the rate at which Morgan Hill’s students in the class of 2018 enrolled in college, compared to students in 15 similar districts. The students we’re examining (the cohort) are those who were enrolled as freshmen four years prior. (Sadly, as of September 2022, this is the most current data available in California.)

But play with the evidence. Want to compare Morgan Hill’s students’ college choices with those of their neighboring district, Gilroy? Mouse over “Gilroy” label on the left side of the left-most panel. Gilroy’s graduating class of 2018 made different choices, with fewer going to four-year colleges. Now change the subgroup from “all students” to “girls.” Then switch to “boys.” Quite a difference.

Figure 2.6 - Second-graders' scores on two reading tests

You’ll see 2nd graders’ reading scores on two tests, given within weeks of each other in the Fall of 2019. But you can change that. Go to the settings in the left-most panel. Click on 3rd-graders for both “x” and “y” axes. Now try the bottom setting and select EL students only. What questions come to mind? I can’t help but wonder why some EL students are still considered to be EL if their reading score are far ahead of their grade-level peers, sometimes on both tests.

Now mouse over the field itself. Blue dots are students getting regular, tier 1 instruction. Tan/gold dots are students getting tier 2 support in reading. The horizontal axis reflects students’ scores on the Fountas & Pinnell (F&P) benchmark assessment, largely a test of reading accuracy and speed, called a running record. The vertical axis reflects students’ scores on the Measure of Academic Progress, an interim assessment from Northwest Evaluation Association. It is mainly a measure of reading comprehension.

The degree to which F&P results disagree with the results of the Measure of Academic Progress is visible by noting the tall column of dots at any point on the F&P scale. The columns in the middle of the F&P scale are tallest, revealing that the two tests produce very different results for the majority of students in the middle, closest to grade level (according to F&P).

Figure 2.6

Mismeasuring Schools’ Vital Signs