TEACHERS FORUM

“Only teachers really know what to teach and how to test it.”

An Interview with James Dean Brown

Professor Brown is from the Department of Second Language Studies, University of Hawai’i at Manoa. He is the author of 16 books and 12 monographs on testing and assessment. He is the highest level scholar in this field in America, the author of 12 published and institutional test batteries. He has presented all over the world and has received numerous awards for his outstanding activities. Professor Brown came at the invitation of the Russian Ministry of Education to work with the Russian colleagues on January 19-30, 2004.
This interview was made by Alyona Gromoushkina after the seminar held by Professor Brown at MSU on January 26.

A test writer takes on great responsibility when making a test. Sometimes the students’ lives depend on passing the test. What is it like to be a test writer? How do you feel?

Being a test writer makes me very careful. It makes me take many steps to do the job as well as I possibly can. I don’t just write a series of items and give them as a test. Instead, I write items, then get feedback from people about the quality of the items, and I revise them, and then I get more feedback, and I revise them again, and I make them as good as I possibly can. Then, I pilot them with a group of students, who are similar to the group that I am designing the test for in such a way that I can analyze the test, item by item, and even option by option, and see what happened during the piloting. Who answered correctly in the upper group, the middle group, and the bottom group? Who, in the upper, middle, and bottom groups, picked which options? And so forth.
Then I select the items that are going to do the job that I need done. If it’s a standardized test, say for proficiency testing or placement purposes, then I’m looking for items that spread people out; so I select those items that the upper students answer correctly and the bottom students answer incorrectly. Those are the items that spread people out. Then I get rid of the items that aren’t working, that aren’t spreading people out, and create a new revised version of the test. That way my test and its quality are maximally improved before I ever use the test to make decisions that will affect students’ lives in important ways.
One other important point should be made here: in the US, we always try to use multiple sources of information in our decision making processes. For example, a university admission decision might be based in part on a test score, but other information will also be used – information like the student’s grade point average in high school, the quality of the school, the types and number of extracurricular activities the student participated in, the number of advanced placement courses the student took, the quality of the student’s statement of purpose, and so forth. A single test score would never be used to make such an important decision in America because we know that multiple observations are more reliable than any single observation.

How do you select students for piloting? Do you know them beforehand?

I am looking for students who are like the ones I’m going to test for decision-making purposes. Let me describe one way this can be done. Let’s say we need to administer twelve different tests a year. In such a situation we can put experimental items on each test…say, for every randomly selected thousand students, we put five experimental items that are not scored – they are just in there, and the students don’t know which ones they are. Those are experimental items that we do not score. So if one hundred thousand people are taking the test that’s five items for every thousand times 100 (the number of thousands, for a total of 500 items that are piloted). We can then do item analysis and determine exactly what the difficulty level of each item is and the degree to which it spreads the students out, that is, we can record those statistical characteristics along with what each item is designed to test and “deposit” these 500 items into an item bank from which we can draw the items we want, in terms of what they are testing and how they are working statistically, for future versions of the test. And then, whenever we need to, we can build a test. So if we want a test that has X, Y, and Z structures, we go to our computer and pick out the items that test those structures and are working well statistically, and then we build the test just the way we want it. Naturally, we also monitor the results of the test afterwards, when we actually administer the test for decision making purposes, to see what actually happened statistically. And we find that, yes, it worked well, but this item wasn’t quite right and that one could have been better, so we learn from the experience, and start building the next test.

So there is much work after the test… not only to write the test, but scoring and analyzing…

Actually writing a test is really just the beginning. Though unfortunately, it’s the last step in some countries. That’s a mistake. There should be careful piloting of a relatively large number of items and item-by-item analysis of the pilot results, followed by production of a shorter, more efficient final version of the test. I’ve seen tests that were designed without all those steps and they created a real mess: answer keys that made no sense, two or three possible answers, items that no native speaker could ever answer … a real mess. The danger is that such tests are used to make important decisions about children’s lives, young people’s futures, and on the basis of what? On the basis of items that can only be described as a mess.

There are three key concepts: testing, evaluation and assessment. What is the difference?

People use these terms in different ways. I’ll explain my specific way of using them. I use “testing” only for measurements that we use to determine students’ knowledge about language or their language abilities, whether criterion-referenced or norm-referenced measurements.
I use “evaluation” only when I’m referring to program evaluation: the quality of the teaching and/or the program in general. We use program evaluation to answer questions like: are we meeting the needs of the students; are the objectives adequate; are the test measurements effective; is the teaching good? And … around and around we go. That’s evaluation to me.
“Assessment” is a broader term than “testing” to me. It involves testing, of course. But it also involves other aspects of grading, so it would be all of the information that teachers use: attendance records, class participation, all of that.

What is your favourite kind of test?

I like cloze tests. Cloze tests are very complex. So, they fascinate me. I’ve been spending much of my last twenty-five years trying to figure out what they do, and in the process, I’ve published 15 or so articles on the topic.

You use the term “assessment literacy”. What does it mean?

It’s an important issue to me. In our master’s degree program every student takes a language testing course, because we find that “assessment literacy” – knowing what assessment is all about – is important in two perspectives. First, teachers need to be able to design good tests for their students for diagnostic, progress, and achievement purposes, but also because they need to understand that in our country there are many forms of testing that they need to be able to think about in rational ways. Many myths are circulating in our culture about testing: for example, many Americans believe that Americans are getting dumber because average SAT scores have declined over the years. But in reality, this trend may simply be reflecting the fact that opportunities have opened up for more people in our country to get access to higher education. Such a trend might therefore mean that Americans are actually getting more educated and smarter. Teachers need assessment literacy so they can think clearly about issues like these.

Do you think that future teachers should study the basics of testing or there should be specialists who do the job?

Essentially, I think both are important. There should be specialists, of course. It takes quite a bit of learning to become a specialist. It is hard work, which includes learning a lot about linguistics, statistics, psychology, etc, to do the job. But at the same time, I think all the teachers should have at least basic knowledge of testing so they can do a responsible job of designing and administering their classroom tests and so they can be assessment literate.

Testing is not the only way to check knowledge, is it? In your articles you suggest other methods such as portfolios, conferences, and self-assessment.

I think these are very rich and very important pedagogical tools. In fact they are forms of testing, I prefer to think of them as part of what we do as language testers. You can do good portfolio assessments or bad, the same with conferences and self-assessments. In terms of large-scale testing they are probably too logistically cumbersome to be useful, but in terms of pedagogical criterion-referenced testing they are very, very useful and provide a rich source of information about students. The students are brought into the process. In fact, the process, including the language points to be assessed and the criteria to be used in assessing those language points, can and should be negotiated with the students; they need to understand what’s going on. I think such strategies are useful.
However, there is one aspect of the literature on this topic that I disagree with: I don’t think of these sorts of measures as alternatives to language testing; rather I believe they are alternatives within language testing.

It’s a good thing when teachers have a right to choose. But can you imagine if teachers in the USA were ordered one day to have a kind of Unified exam? What would they do?

I think teachers in America would react forcefully to any such order coming from the federal government. The National Education Association is large and powerful in Washington. Virtually all teachers and professors belong to this organization. The NEA would lobby and fight for the teachers’ point of view in Washington. Ultimately, it is an issue of academic freedom, which is a crucial freedom for the teachers in my country. If somebody comes to my class and tries to tell me how to teach or test, it is my professional duty to fight. Bureaucrats know nothing about teaching. Only teachers really know what to teach and how to test it.

What do you think about the idea of combining school final exams and University entrance exams?

It’s not a good idea. That’s combining two different, very different in my view, varieties of tests: I mean norm-referenced and criterion-referenced tests. A test like this should focus on the university entrance decision, which would mean that it would have to be norm-referenced so that it would create a normal distribution.
The test can certainly be based on material covered in high school, but only those items that spread students out should be included. Such a strategy will work out well for the university entrance decision, but would be antithetical to testing achievement (that is, what the students actually know) in any theoretically reasonable way.

What can our teachers do in this situation? They find themselves in a tight corner, most of them see it as a “no-go” situation. Some of them say: “OK, I just have to do my best to coach my students for this imposed examination. That’s all I can do.”

That really worries me because it means that they are teaching to the test instead of teaching English. If the exam is constructed properly and is working well, it should be the case that, if students learn English, their scores will go up, regardless of how they studied.

If you were involved in the present situation in Russia with this new system of testing what would you recommend to do first?

There are certain steps that have to be decided. And they haven’t been decided yet.
I think professional Russian English teachers have to decide, as a group, what their testing approach will be. Probably, in the process, people need to be trained. It is possible to send people to the UK or the United States to become specialists in language testing. There should be no hurry in such an important process. Take your time. A big rush can only result in disaster.
In one country where I work a lot, for example, the decision to implement a communicative curriculum in the high schools was made in 1993 and they still haven’t managed to implement it because teachers were not adequately trained to do so. Similarly, some of them would like to implement a single national examination, but they have encountered so much resistance from the universities, each of which has its own entrance exams, that they have not been able to implement this policy either.
No such decisions can be responsibly implemented overnight. Perhaps in Russia’s case, the big universities and the professional organizations representing the teachers should come together and decide how they want to proceed with this entrance examination policy. They would then need to approach all the other universities and the bureaucrats, and convince them that a unified exam that does such-and-such would be in the best interests of all involved.
The main question is probably how to deal with a language teaching situation where the grammar-translation approach and communicative approach may flourish side by side. Perhaps a university entrance test should include bits of both traditions. I’ve noticed that the details of the language are important to the Russian teachers I have met – every Russian I talk to seems to want to talk about the details of vocabulary or grammar. That’s important, but also, the communicative aspects of language may be important. Maybe the national test should reflect a balance. But that is a political decision, one that language teaching professionals in Russia will have to decide for themselves and as a profession.
Then, with all such political decisions made, the steps in test design are relatively easy: get item writers to create items (a lot of items); pilot those items in some secure manner; item-analyze the results; create tests that are norm-referenced, reliable, and valid; train teachers to interpret the scores; and train university admission officers to correctly interpret the scores in combination with other multiple sources of information. Ultimately such a process must respect all the participants (especially the teachers) enough to send teams out to the main cities to inform teachers of what the test is going to be and how it will work; then, the teachers can in turn tell their students what to expect. To me these are the normal steps that should be taken step-by-step so the test will function well for all involved and not create a huge disaster.