Contributors

Wednesday, April 24, 2013

Standardized Testing: Do It Right or Not At All

There's been a lot of wailing and moaning about high-stakes testing in the schools lately. An article by Valerie Strauss in the Washington Post caught my attention because it criticizes a company I used to work forStrauss has a long litany of delays, errors committed, and fines paid by Pearson for problems with the administration of tests in schools. Another recent article discusses Pearson's role in jacking up the price of the GED. I even know the president of GED Testing Service; I worked with him 20 years ago.

I didn't work with high-stakes testing for kids; I worked for the companies that grew out of Control Data's PLATO division. They provided computerized certification exams for IT professionals who supported software products from companies like Microsoft, Novell, Oracle, etc. They also provided FAA pilot and mechanic exams, stock broker compliance testing, insurance and real estate sales exams, and so on. When I retired a dozen years ago, the business was quickly morphing into computerized delivery of certification testing for medical professionals and was about to enter the SAT/ACT market.

So I have a bit of inside knowledge about the testing business. I was on the software end of things, and wrote the code that delivered and scored computerized tests. I worked with a lot of exam developers and customers to get their exams into our delivery system and the results out of the back end. But the real work in testing is on the front end: the development of the exams themselves.

There were basically two kinds of customers. The first kind was the testing professional, who insisted on doing things the right way. That involves writing a large bank of test questions ("items") and then testing the items' performance in several series of exams to a large number (hundreds, if not thousands) of target candidates who demonstrate the expected range of knowledge of the subject matter. The quality of the items is then statistically determined by how well they predict the ability level of the candidate (which has be assessed separately).

A good item is one that someone with a firm grasp of the subject material gets right and someone who doesn't know the material gets wrong. A bad item has no correlation with subject matter expertise and a terrible one has a negative correlation. Bad items have factual errors, or are poorly written, unclear, misleading or "trick" questions.

Another consideration in writing an exam is the number of "forms" you deliver: in many testing regimes people take the exams on different days, so you have to write many different forms of the exam in order to avoid exposing all the items to the public at once. This is a serious concern because there are quite well-organized cheating efforts that involve people who've just taken an exam doing a memory dump of a few questions they are assigned to remember. With a relatively small crew you can completely reconstruct the exam: within a day your test -- and all the answers -- can be out on the Internet.

When you have multiple forms of an exam, it's critical that the forms be equivalent. That is, each form has to be statistically balanced to have the same degree of difficulty, even though not all items on the form are the same. Otherwise the test wouldn't be fair to all takers.

This means that if you want to give several alternate forms of a 50-question test to millions of kids across a state or a country, you're going to wind up writing thousands of items, many of which will be discarded because they do not accurately predict ability level.

This is extremely expensive and time-consuming. And it's a never-ending process because of the exposure problem and constantly changing curricula. Companies like Pearson manage item banks with millions of items whose statistical performance is monitored and are aged out over time.

That brings us to the second kind of customer: the average guy. The average guy thinks you can just jot down some questions and be done with it. That's probably true for teachers who know the kids in their class, where quizzes plus class participation plus daily homework provide a complete picture for the teacher to assign a grade. But you can't write a standardized test that way.

The problem with developing good exams, in my experience, is that people just don't want to pay for it. Their eyes glaze over as you explain that it'll require subject matter experts writing thousands of items, and months of testing and retesting the items' performance (you can't change a word of an item -- or even its formatting -- without affecting its stats), and analysis of the statistics, and careful construction of equivalent forms.  And then things like content balancing  (making sure subdisciplines of the exam subject matter aren't under- or overrepresented on a particular form) make the exam developer's job that much harder.

Not surprisingly, school districts are particularly concerned about costs and schedules. They never have enough money, and by the time the legislature appropriates it, the company that's supposed to develop and deliver the test may not have the time to do it right.

In my experience, corporate customers constantly changed their minds and added new requirements, but the schedule never changed. With state-wide tests and requirements coming from dozens of school districts, administrators and meddlesome politicians, the software developer in me would imagine the deadline at the end of the school year to be an all-consuming bottomless pit.

Thus, I'm sure that many of the problems Valerie Strauss cited with Pearson's performance are due to changes their customers demanded at the last minute, or customers skipping necessary quality control steps that they didn't want to pay for or have time for due to schedule constraints. From personal experience I'm absolutely certain that many of Pearson's alleged problems are really the fault of politicians, school boards, state education commissions and educators themselves.

I'm equally certain that many of the problems are due to sales guys who promised things Pearson didn't have, management who agreed to schedules their technical people told them outright were impossible, not to mention hardware problems, mistakes in coding and data entry, faulty statistical analysis, mismatching items and their statistics and/or answer keys, and simple cut/paste errors in item text.

Given the constantly shifting educational priorities and curricula, perennially tight school budgets and incessant political bickering I don't see how we'll ever be able to do large-scale standardized testing right, especially not with every state and local jurisdiction trying to reinvent the wheel themselves, and everyone insisting that we do it several times a year.

If we can't spend the time and the money to do standardized testing right, we shouldn't do it at all. With all due respect to my former colleagues, I think we should take the money out of the hands of companies like Pearson and put it back into the schools where it'll do the most good.

1 comment:

Juris Imprudent said...

Education is a responsibility that belongs to the states, not the feds. NCLB was a poorly conceived concept from the start.

No wonder that the two biggest advocates were W and the drunken senator from Massachussets.