So here’s the big project that’s been keeping me away from my blog for the last little while: the Corpus of Learner English (CoLE). This is a project I have been working on for a couple of years and we are finally ready to start sharing it with the world. Every step of the way has been an adventure: From designing the corpus, to applying for IRB approval, to compiling the data. Here are the nitty gritty details.
I’ve always been interested in corpora — more in the idea of them than in any particular research question. A few years ago, I initiated the American Language Program’s (ALP — Ohio State’s intensive English program) transition from paper-based placement testing to computer-based testing. Shortly after that, I started thinking about how much easier the 30-minute placement composition components would be to analyze. Word counts, for example, could be compared with a couple of keystrokes. And, of course, more complex comparisons were possible such as differences in pronoun use between male and female students. (One interesting preliminary finding there was that our male students used a lot more first person pronouns while our female students used a lot more third person pronouns. Was this some sort of cultural artifact? Not my research question! But it could be yours…)
A year or two after we moved to computer-based testing in ALP, Ohio State’s ESL Composition Program also moved their testing online, in part as a response to making testing accessible to students before they arrive in the U.S. Previously, students could not take their placement tests until they arrived. Because test results were prerequisites for many classes, students often registered late and found many classes had already filled. Again, I saw some data that could be an interesting corpus.
I talked with my colleague Jack Rouzer about the potential for such a corpus, and he was also very enthusiastic about the project. We immediately began working out the details and submitted an IRB application. This was my first experience with our IRB and it was an interesting one. For one, I don’t think our IRB is as familiar with linguistic corpora (or even “data repositories” as the project was classified) as it is with medical testing or psychological experiments. Once we were able to create a protocol that would reasonably protect our student participants’ privacy, we were approved. Here’s what we came up with:
First, obviously, we ask for students’ informed consent. We describe that we will make their writing available online in a de-identified way with only some demographic information attached. In the corpus, we include each student’s age range, sex, country of origin, college of study, graduate or undergraduate status, and their placement level (1, 2, or out, which means they are exempt from taking ESL Composition classes.) We ask for their consent after they have written their essays so that they know exactly what will be included. Second, we read each placement essay to be sure students don’t self-identify in any way within the content of their essays. Third, we only include essays in the corpus for which we have at least fifty members of every demographic category. So, for example, we will include an essay if it written by one of 500 students aged 18-21, one of 1000 female students, one of 400 Chinese students, one of 400 College of Arts & Sciences, one of 300 graduate students, and one of 200 students that placed into undergraduate level 2. It is extremely unlikely that you would be able to identify who wrote this essay based on these demographics. However, we would not release an essay written by one of 3 Botswanans or one of 20 students over age 25 because it is more likely that you could identify them if, for example, you know a student from Botswana. The good news is, as we include more and more essays each year, every population will go up and we will be able to include more essays in the corpus as this threshold is reached in different demographics.
In the first semester, we were only able to include male and female, grad and undergrad Chinese students under 25 years old in Business, Arts and Sciences, and Engineering with low and intermediate placements, but subsequent additional semesters have broadened the pool to include more age bands, countries, colleges, and placement levels.
If you are interested in accessing this corpus please contact me for more information.