“Only teachers really know
what to teach and how to test it.”
An Interview with James Dean Brown
Professor Brown is from the Department of Second Language
Studies, University of Hawai’i at Manoa. He is the author of 16 books and 12 monographs
on testing and assessment. He is the highest level scholar in this field in America, the
author of 12 published and institutional test batteries. He has presented all over the
world and has received numerous awards for his outstanding activities. Professor Brown
came at the invitation of the Russian Ministry of Education to work with the Russian
colleagues on January 19-30, 2004.
This interview was made by Alyona Gromoushkina after the seminar held by Professor Brown
at MSU on January 26.
A test writer takes on great responsibility when making a test.
Sometimes the students’ lives depend on passing the test. What is it like to be a test
writer? How do you feel?
Being a test writer makes me very careful. It makes me take many steps
to do the job as well as I possibly can. I don’t just write a series of items and give
them as a test. Instead, I write items, then get feedback from people about the quality of
the items, and I revise them, and then I get more feedback, and I revise them again, and I
make them as good as I possibly can. Then, I pilot them with a group of students, who are
similar to the group that I am designing the test for in such a way that I can analyze the
test, item by item, and even option by option, and see what happened during the piloting.
Who answered correctly in the upper group, the middle group, and the bottom group? Who, in
the upper, middle, and bottom groups, picked which options? And so forth.
Then I select the items that are going to do the job that I need done. If it’s a
standardized test, say for proficiency testing or placement purposes, then I’m looking
for items that spread people out; so I select those items that the upper students answer
correctly and the bottom students answer incorrectly. Those are the items that spread
people out. Then I get rid of the items that aren’t working, that aren’t spreading
people out, and create a new revised version of the test. That way my test and its quality
are maximally improved before I ever use the test to make decisions that will affect
students’ lives in important ways.
One other important point should be made here: in the US, we always try to use multiple
sources of information in our decision making processes. For example, a university
admission decision might be based in part on a test score, but other information will also
be used – information like the student’s grade point average in high school, the
quality of the school, the types and number of extracurricular activities the student
participated in, the number of advanced placement courses the student took, the quality of
the student’s statement of purpose, and so forth. A single test score would never be
used to make such an important decision in America because we know that multiple
observations are more reliable than any single observation.
How do you select students for piloting? Do you know them
I am looking for students who are like the ones I’m going to test for
decision-making purposes. Let me describe one way this can be done. Let’s say we need to
administer twelve different tests a year. In such a situation we can put experimental
items on each test…say, for every randomly selected thousand students, we put five
experimental items that are not scored – they are just in there, and the students
don’t know which ones they are. Those are experimental items that we do not score. So if
one hundred thousand people are taking the test that’s five items for every thousand
times 100 (the number of thousands, for a total of 500 items that are piloted). We can
then do item analysis and determine exactly what the difficulty level of each item is and
the degree to which it spreads the students out, that is, we can record those statistical
characteristics along with what each item is designed to test and “deposit” these 500
items into an item bank from which we can draw the items we want, in terms of what they
are testing and how they are working statistically, for future versions of the test. And
then, whenever we need to, we can build a test. So if we want a test that has X, Y, and Z
structures, we go to our computer and pick out the items that test those structures and
are working well statistically, and then we build the test just the way we want it.
Naturally, we also monitor the results of the test afterwards, when we actually administer
the test for decision making purposes, to see what actually happened statistically. And we
find that, yes, it worked well, but this item wasn’t quite right and that one could have
been better, so we learn from the experience, and start building the next test.
So there is much work after the test… not only to write the test,
but scoring and analyzing…
Actually writing a test is really just the beginning. Though
unfortunately, it’s the last step in some countries. That’s a mistake. There should be
careful piloting of a relatively large number of items and item-by-item analysis of the
pilot results, followed by production of a shorter, more efficient final version of the
test. I’ve seen tests that were designed without all those steps and they created a real
mess: answer keys that made no sense, two or three possible answers, items that no native
speaker could ever answer … a real mess. The danger is that such tests are used to make
important decisions about children’s lives, young people’s futures, and on the basis
of what? On the basis of items that can only be described as a mess.
There are three key concepts: testing, evaluation and assessment.
What is the difference?
People use these terms in different ways. I’ll explain my specific
way of using them. I use “testing” only for measurements that we use to determine
students’ knowledge about language or their language abilities, whether
criterion-referenced or norm-referenced measurements.
I use “evaluation” only when I’m referring to program evaluation: the quality of the
teaching and/or the program in general. We use program evaluation to answer questions
like: are we meeting the needs of the students; are the objectives adequate; are the test
measurements effective; is the teaching good? And … around and around we go. That’s
evaluation to me.
“Assessment” is a broader term than “testing” to me. It involves testing, of
course. But it also involves other aspects of grading, so it would be all of the
information that teachers use: attendance records, class participation, all of that.
What is your favourite kind of test?
I like cloze tests. Cloze tests are very complex. So, they fascinate
me. I’ve been spending much of my last twenty-five years trying to figure out what they
do, and in the process, I’ve published 15 or so articles on the topic.
You use the term “assessment literacy”. What does it mean?
It’s an important issue to me. In our master’s degree program every
student takes a language testing course, because we find that “assessment literacy”
– knowing what assessment is all about – is important in two perspectives. First,
teachers need to be able to design good tests for their students for diagnostic, progress,
and achievement purposes, but also because they need to understand that in our country
there are many forms of testing that they need to be able to think about in rational ways.
Many myths are circulating in our culture about testing: for example, many Americans
believe that Americans are getting dumber because average SAT scores have declined over
the years. But in reality, this trend may simply be reflecting the fact that opportunities
have opened up for more people in our country to get access to higher education. Such a
trend might therefore mean that Americans are actually getting more educated and smarter.
Teachers need assessment literacy so they can think clearly about issues like these.
Do you think that future teachers should study the basics of testing
or there should be specialists who do the job?
Essentially, I think both are important. There should be specialists,
of course. It takes quite a bit of learning to become a specialist. It is hard work, which
includes learning a lot about linguistics, statistics, psychology, etc, to do the job. But
at the same time, I think all the teachers should have at least basic knowledge of testing
so they can do a responsible job of designing and administering their classroom tests and
so they can be assessment literate.
Testing is not the only way to check knowledge, is it? In your
articles you suggest other methods such as portfolios, conferences, and self-assessment.
I think these are very rich and very important pedagogical tools. In
fact they are forms of testing, I prefer to think of them as part of what we do as
language testers. You can do good portfolio assessments or bad, the same with conferences
and self-assessments. In terms of large-scale testing they are probably too logistically
cumbersome to be useful, but in terms of pedagogical criterion-referenced testing they are
very, very useful and provide a rich source of information about students. The students
are brought into the process. In fact, the process, including the language points to be
assessed and the criteria to be used in assessing those language points, can and should be
negotiated with the students; they need to understand what’s going on. I think such
strategies are useful.
However, there is one aspect of the literature on this topic that I disagree with: I
don’t think of these sorts of measures as alternatives to language testing;
rather I believe they are alternatives within language testing.
It’s a good thing when teachers have a right to choose. But can
you imagine if teachers in the USA were ordered one day to have a kind of Unified exam?
What would they do?
I think teachers in America would react forcefully to any such order
coming from the federal government. The National Education Association is large and
powerful in Washington. Virtually all teachers and professors belong to this organization.
The NEA would lobby and fight for the teachers’ point of view in Washington. Ultimately,
it is an issue of academic freedom, which is a crucial freedom for the teachers in my
country. If somebody comes to my class and tries to tell me how to teach or test, it is my
professional duty to fight. Bureaucrats know nothing about teaching. Only teachers really
know what to teach and how to test it.
What do you think about the idea of combining school final exams and
University entrance exams?
It’s not a good idea. That’s combining two different, very
different in my view, varieties of tests: I mean norm-referenced and criterion-referenced
tests. A test like this should focus on the university entrance decision, which would mean
that it would have to be norm-referenced so that it would create a normal distribution.
The test can certainly be based on material covered in high school, but only those items
that spread students out should be included. Such a strategy will work out well for the
university entrance decision, but would be antithetical to testing achievement (that is,
what the students actually know) in any theoretically reasonable way.
What can our teachers do in this situation? They find themselves in
a tight corner, most of them see it as a “no-go” situation. Some of them say: “OK, I
just have to do my best to coach my students for this imposed examination. That’s all I
That really worries me because it means that they are teaching to the
test instead of teaching English. If the exam is constructed properly and is working well,
it should be the case that, if students learn English, their scores will go up, regardless
of how they studied.
If you were involved in the present situation in Russia with this
new system of testing what would you recommend to do first?
There are certain steps that have to be decided. And they haven’t
been decided yet.
I think professional Russian English teachers have to decide, as a group, what their
testing approach will be. Probably, in the process, people need to be trained. It is
possible to send people to the UK or the United States to become specialists in language
testing. There should be no hurry in such an important process. Take your time. A big rush
can only result in disaster.
In one country where I work a lot, for example, the decision to implement a communicative
curriculum in the high schools was made in 1993 and they still haven’t managed to
implement it because teachers were not adequately trained to do so. Similarly, some of
them would like to implement a single national examination, but they have encountered so
much resistance from the universities, each of which has its own entrance exams, that they
have not been able to implement this policy either.
No such decisions can be responsibly implemented overnight. Perhaps in Russia’s case,
the big universities and the professional organizations representing the teachers should
come together and decide how they want to proceed with this entrance examination policy.
They would then need to approach all the other universities and the bureaucrats, and
convince them that a unified exam that does such-and-such would be in the best interests
of all involved.
The main question is probably how to deal with a language teaching situation where the
grammar-translation approach and communicative approach may flourish side by side. Perhaps
a university entrance test should include bits of both traditions. I’ve noticed that the
details of the language are important to the Russian teachers I have met – every Russian
I talk to seems to want to talk about the details of vocabulary or grammar. That’s
important, but also, the communicative aspects of language may be important. Maybe the
national test should reflect a balance. But that is a political decision, one that
language teaching professionals in Russia will have to decide for themselves and as a
Then, with all such political decisions made, the steps in test design are relatively
easy: get item writers to create items (a lot of items); pilot those items in some secure
manner; item-analyze the results; create tests that are norm-referenced, reliable, and
valid; train teachers to interpret the scores; and train university admission officers to
correctly interpret the scores in combination with other multiple sources of information.
Ultimately such a process must respect all the participants (especially the teachers)
enough to send teams out to the main cities to inform teachers of what the test is going
to be and how it will work; then, the teachers can in turn tell their students what to
expect. To me these are the normal steps that should be taken step-by-step so the test
will function well for all involved and not create a huge disaster.