C-TESTS: THEIR EVOLUTION AND FUTURE¾

A way of boosting report cards, including the teacher’s?

Abstract

In 1982, Christine Klein-Braley researched the “suitability of cloze tests as measures of reading comprehension” (Toegepaste Taalwetenschap in Artikellen, 13, 49-61).  Later that year she published a paper titled, “Der C-Test, ein neuer Ansetz zur Messung allgemeiner Sprachbeherschung” (AKS Rundbrief 4, 23-37).  Now, twenty years and several thousand research pages later, she and her disciples (Ulrich Raatz, Ruediger Grotjahn, Undine Roos, James Coleman, and Brigitte Stemmer along with over one hundred other researchers) speak with relative harmony about the C-Test version of the Cloze test¾they “strongly recommend incorporating it into everyday teaching and testing activities” (Lucy Katona, 1993).  Discrepancies are minimal and contradictory only in methods of design, not in issues of validity, reliability, frugality, and ease of use.  The purpose of this study is to evaluate the C-test’s acceptance in, and possible adaptation to, U.S. ESL markets.

 

Introduction

In 2003, while learning about language assessment (Bailey), I was led to develop a professional interest in this ingenious adaptation of the cloze test¾the C-Test.  I became curious about its stage of acceptance in the United States and the degree to which our ESL students would respond to a survey about its use.  That curiosity led to this three-pronged research involving the C-Test’s past, present, and future¾meaning the original Cloze, the Cloze2 (original C-test), ClozeX (the nth word deletion C-test), and now the ClozeT and ClozeTT (their online tailored adaptations).  I hope that this study’s outcome prompts you to try them yourself and thereby help collect additional data. 

Comments

During the C-Test’s early stage of development, research showed that there was a significant difference between testing just new words and the inclusion of words previously learned.  The paper that best caught my attention was “The C-Test: The Teacher-friendly way to test language comprehension” by Lucia Katona and Zoltan Doernyei (1993).  It was an update of a study conducted in Budapest with 120 first-year English majors the previous year.  Katona and Doernyei reported that there was a significant direct improvement in language comprehension through the use of the then ten-year old C-Test format[1].  They stated that the C-Test was the test of choice (out of five2) for testing GLP (general language proficiency).  Bulleted advantages were led by:

“We, therefore, recommend it to be used (a) to select and place students in appropriate groups, (b) to assess their achievement at end-of-term exams by selecting several typical passages from the term’s materials, (c) to test certain grammar areas (e.g., tenses or word formation) by including texts that contain several examples of the structures in question, and (d) to check home reading or homework by taking passages from the texts the students had to work on.  We consider it, without doubt, the most versatile test type.”

Katona and Doernyei also suggested that students be brought into the process of creating these tests (for fun and learning and even competitions) because such involvement had a positive washback by “taking away the fear of the unknown” and making the reading and learning a more integrated process.

These are other C-test data starting with excerpts from Learning About Language Assessment (Bailey, 1998).  The second is the rebuttal / counter rebuttal that was largely responsible for leading me to conduct this survey.  I hope to have also a few responses to my email request on C-test updates: (My request was, “Please be so kind and help me present the last five years of development of the C-Test [any version] in its most accurate manner.  Your expert opinion would be greatly appreciated.”)

Kathleen M. Bailey quotes her friend, Tim Hacker, who gained experience with C-tests during a stint in Sri Lanka while teaching English to grad students for the Peace Corps (designed by experts from the British Council), saying: “The C-test was at that time deemed ‘cutting edge’ for measuring proficiency rather than achievement.  The problem was, the students hated this proficiency test like every other; it pointed out too obviously what they did not as yet know.  During the process of follow-through, however, he started to use the tests as a learning tool rather than testing device and found that the students began to like them … especially when they were allowed to help create them.”

By contrast, Abdoljavad Jafarpur (1995) attempts to refute certain claims about C-testing. Specifically, on page 195 of his article, he attributes the following five claims to Klein-Braley and Raatz (1984):

  1. it is easy to construct and to score C-tests;
  2. adult native speakers should obtain virtually perfect scores;
  3. the deletions affect a representative sample of the text;
  4. even previously untried material produces satisfactory reliability and validity coefficients; and
  5. C-tests have face validity.

 

In the conclusion of his article (page 209), J states his findings with respect to these five claims:

  1. it is easy to construct and to score C-tests; but
  2. native speakers do not achieve perfect scores;
  3. the deletions do not affect a representative sample of the text: different deletion starts and deletion ratios produce different tests, which is suggestive of the invalidity of the procedure;
  4. previously untried material demonstrates satisfactory reliability but does not show acceptable validity against cloze testing; and
  5. C-tests do not possess face validity.

Ashley Hastings counters with: “There is general agreement on point (1). As for the remaining points, the purpose of this [her] paper is to show that J has apparently misconstrued both the essence of the C-testing procedure and the claims made about it; that his study relies on instruments that are wholly inadequate to address the issues raised; and that he has consequently failed to make a case against C-testing.  Link http://www.su.edu/icfs/indefense.htm leads to a detailed summary of what Jafarpur actually tested and Raatz’s comments in rebuttal.  J’s data in Table 1 demonstrates clearly that, if nothing else, the C-test’s mean score (using “exact” scoring) was equal to the conventional cloze test (using “acceptable” scoring), which is a remarkable improvement over the low scores of the “exact” cloze.

 

Details

Jafarpur’s criticism of the C-Test brought to mind the all-important question of student satisfaction and student reactions here in America … and not under the name of “C-Test” but under a variation of its original name—under cloze-something or something-cloze.  Since we are dealing with partially deleting every 2nd or nth word, I settled on Cloze2 (for the original C-Test), on ClozeX (for the nth word deletion), and on ClozeT and ClozeTT (for their tailored adaptations).  I wanted to know which of these tests would give ESL students the greatest feeling of accomplishment.  To this end, I set out to do the following:

First, I canvassed textbooks for standardized vocabulary lists and noted that only at the “academic” level was there some consensus.  The other lists (mostly categorized as Level 1, 2, 3 and 4 lists) were too varied to meet validity requirements.  I settled, therefore, on using the UNIVERSITY WORD LIST augmented by words included in some lists but not others.  In other words, I decided to consolidate all lists termed “academic” or “scholastic” to create my primary table to search against.

For one variable, I chose to include the middle spectrum of the 10,000 most common words (words from the 3000 and 5000 lists).  For the other [variable], I changed the order of the three selected formats to see whether sequential variances would influence the scores and feedback.  I chose this higher level of proficiency because it might be only at this higher level that students will be smart enough to select “more” rather than less.

To research student satisfaction through not only paper tests but also via LANs and the Internet, I commissioned the best programmer I could find (Rob Craig, the developer of Euphoria) to create for us the user-friendliest program imaginable.  Well, he did; clozeonline.us was up and running in time for this round of research.  It created the formats B and C of the paper tests and successfully administered all online retakes.

The main task was to find a school willing to give permission for me to administer the test and then collect feedback.  Since the focus was on student satisfaction, it didn’t really matter whether a school had an adequate testing lab; what mattered was that we had a large sample and provided for a random distribution of the three versions of the test.  My preliminary agreement from Dr. Myrna Creasman at CMMS (Center for Multilingual Multicultural Studies) was to let me coordinate this data collection during the second week of their next group of students—the last week of June 2003.  (“More Details” covers what actually happened and what conclusions I was able to reach.)

The last preparatory task was to create a post-test questionnaire that was brief but valid.  I needed it to be professional but friendly and meaningful without appearing to pry.  In addition to asking the obvious preference questions, I needed to ask about the clarity of my instructions, their perceived learning style, their preferences about number of truncated words per line, and to what extent they were in agreement with reviewing while testing the words or tasks of the week.

 

More Details

THE FACILITY

To establish what I had to work with, I first made an appointment with the administrator of the language school to collect current vocabulary tests, settle on dates and timeframe, go over the data-collection format, and get a list of all students in the participating classes.  I also needed to know whether I would get permission to ask participants to do an online retake of the test … this time not randomly assigned but randomly selected.  (I needed to determine whether online testing affects questionnaire responses.)

DESIGN OF TEST

Once in possession of the facility’s word list, I turned to Brown’s (1995a) list of similar studies and searched for ideas.  The upshot, I proceeded to develop the three attached testing packets.  Each consists of clear instructions to set the pace, an easy-to-read sheet for casting their vote, and a one-sheet test for testing the 40+ new vocabulary items in three different formats. 

NOTE: I settled on calling the conventional cloze test (of one fully blanked-out word per statement) format A, its prime derivatives, formats B and C.

To permit the use of the same questionnaire for all three versions of the test, I chose to have all three types of cloze tests on the same sheet but in a varying order, i.e.:  Version 1 uses the format order A:B:C, Version 2 uses the order B:C:A, and Version 3 uses C:A:B.

 

DAY OF TEST

On the day of the test, I randomly assigned the three versions of the test and asked that everyone turn to page 2 (the Data Collection Sheet) while I read page 1.  I proceeded to read the instructions and asked that no one look at the rest until asked to start.  The instruction, survey, and test sheets were stapled in such a manner that the test sheet was facing down while the instruction and survey sheets were facing up.  The time allotted to the test was 10 minutes plus an additional 5 minutes for filling out the questionnaire. 

NOTE: “Random assignment” consisted of distributing the packets from a previously shuffled stack (stacked ABC, ABC, etc.) in a sequential manner so that no students sitting next to one another had the same version of the test.

After everyone was finished, I picked up the test sheets, thanked everyone for agreeing to help with the research, and finished by asking them to again read the instructions and then fill out the questionnaire. I said, “The instructions ask that you place a check mark or ‘x’ on the line before the word(s) that best describe your preferred choice of answer.  When you’re through, keep the instructions and hand the other up front.”  I also told the class that a compilation of the data would be available at the site within 30 days.

ONLINE RETAKE

This vital part of the survey is simple: Sometime within seven days of the test, students logon, enter the STUDENT domain of teacher BLUM, select one of the three “research” tests, engage their keyboard’s “Insert” feature, and proceed to type the missing characters over the underscores of the partially deleted words.  When finished, they click Submit.  The program lists the number of correct answers.  If less than 100% correct, student is invited to go back and replace more underscores.  For this research, two additional tries are acceptable (to test its purported usefulness as a user-friendly teaching tool).  Upon completion of the third try, the student clicks on “Give up” and then selects “Research participants click here” to load a brief version of the feedback questionnaire.

NOTE: Only those responses are counted that match one of the addresses collected during the paper test; therefore, urge students to make a note of it on the take-home page.

 

In Summary

It looks like Christine Klein-Braley and her disciples were right in saying that there was a better way than the standard cloze [Format A in this study].  It looks as though partially deleting words and no cheat list is indeed better, or at least perceived to be better, than completely blanking words that are listed in a table just above.  It also seems that the C-test is in fact accepted in the USA, at least it was by this first group of students at CMMS (albeit by reverting back to its original name—CLOZE).

CONCLUSION

To accommodate “Exit” surveys, it is now more than just “prudent” to ask for student input; it has become a requirement.  Therefore, since the tailored cloze (C-test) formats have proven to increase at least the perception of learning, using them for feedback requests might, indeed, be a very smart idea.  Future tests may bring different results but for now, data supports the usefulness of the next generation of cloze tests, especially when used on an ongoing basis as a learning tool.  I, for one, will continue to collect pertinent data on this issue.  I hope you do, too.



[1] Fixed ratio or nth word deletion truncates either a little more or less than half of the word of every nth (usually 2nd) word, starting with the 2nd word of the 2nd sentence.  Selected deletion truncates the words on the basis of some rational decision.

2 The five categories were: 1) English Department Proficiency Test (on vocabulary, grammar, listening, & comprehension), 2) TOEIC, 3) an oral interview, 4) a cloze test, and 5) the fixed-ratio C-test.