CoLE: Corpus of Learner English

photo of students taking an exam.

The exam by bitjungle.

So here’s the big project that’s been keeping me away from my blog for the last little while: the Corpus of Learner English (CoLE).  This is a project I have been working on for a couple of years and we are finally ready to start sharing it with the world.  Every step of the way has been an adventure: From designing the corpus, to applying for IRB approval, to compiling the data.  Here are the nitty gritty details.

I’ve always been interested in corpora — more in the idea of them than in any particular research question.  A few years ago, I initiated the American Language Program’s (ALP — Ohio State’s intensive English program) transition from paper-based placement testing to computer-based testing.  Shortly after that, I started thinking about how much easier the 30-minute placement composition components would be to analyze.  Word counts, for example, could be compared with a couple of keystrokes.  And, of course, more complex comparisons were possible such as differences in pronoun use between male and female students.  (One interesting preliminary finding there was that our male students used a lot more first person pronouns while our female students used a lot more third person pronouns.  Was this some sort of cultural artifact?  Not my research question!  But it could be yours…)

A year or two after we moved to computer-based testing in ALP, Ohio State’s ESL Composition Program also moved their testing online, in part as a response to making testing accessible to students before they arrive in the U.S.  Previously, students could not take their placement tests until they arrived.  Because test results were prerequisites for many classes, students often registered late and found many classes had already filled.  Again, I saw some data that could be an interesting corpus.

I talked with my colleague Jack Rouzer about the potential for such a corpus, and he was also very enthusiastic about the project.  We immediately began working out the details and submitted an IRB application.  This was my first experience with our IRB and it was an interesting one.  For one, I don’t think our IRB is as familiar with linguistic corpora (or even “data repositories” as the project was classified) as it is with medical testing or psychological experiments.  Once we were able to create a protocol that would reasonably protect our student participants’ privacy, we were approved.  Here’s what we came up with:

First, obviously, we ask for students’ informed consent.  We describe that we will make their writing available online in a de-identified way with only some demographic information attached.  In the corpus, we include each student’s age range, sex, country of origin, college of study, graduate or undergraduate status, and their placement level (1, 2, or out, which means they are exempt from taking ESL Composition classes.)  We ask for their consent after they have written their essays so that they know exactly what will be included. Second, we read each placement essay to be sure students don’t self-identify in any way within the content of their essays.  Third, we only include essays in the corpus for which we have at least fifty members of every demographic category.  So, for example, we will include an essay if it written by one of 500 students aged 18-21, one of 1000 female students, one of 400 Chinese students, one of 400 College of Arts & Sciences, one of 300 graduate students, and one of 200 students that placed into undergraduate level 2.  It is extremely unlikely that you would be able to identify who wrote this essay based on these demographics.  However, we would not release an essay written by one of 3 Botswanans or one of 20 students over age 25 because it is more likely that you could identify them if, for example, you know a student from Botswana.  The good news is, as we include more and more essays each year, every population will go up and we will be able to include more essays in the corpus as this threshold is reached in different demographics.

In the first semester, we were only able to include male and female, grad and undergrad Chinese students under 25 years old in Business, Arts and Sciences, and Engineering with low and intermediate placements, but subsequent additional semesters have broadened the pool to include more age bands, countries, colleges, and placement levels.

If you are interested in accessing this corpus please contact me for more information.

Advertisements

Leave a comment

Filed under Projects, Research, Resources

More Free Photos

free photo example: hands of people working together on a projectFrom Flickr via OSU Open Photo (License)

OSU Open Photo is a fantastic “collection of high quality, openly licensed photos from around the web” put together by Ashley Miller at Ohio State.  Images include original sources and licenses.  Most of the photos relate to higher education, technology, and people in contemporary educational or work settings.  The photos are tagged and searchable.  There are also links to other resources for finding free photos.  Although there are larger collections out there, this set is useful because it is so nicely curated.

Leave a comment

Filed under Resources

Studio Usage Heat Map

studio usage heat map - by day

If you’ve been following along, you know that I’ve been working to pull together a recording studio on a budget. Our first step was clearing out the old office that was destined to become the studio, work on minimizing the echo in the room, and painting one wall Sparkling Apple to use as a green screen. This is where our first $100 went. Next, we spent another $50 or so to light both the green screen and the talent in front of it. I’m currently working on sorting out the best solution for audio and video. (Stay tuned for updates!)

Fortunately, the lack of A/V equipment hasn’t prevented our staff from using the studio.  In fact, since the doors first opened in July, it has seen over 150 hours of use.  At this point, it is interesting to look at the patterns of usage that have emerged. Thus, the heat map, above.

To make the heat map, I added a “1” to each half-hour timeslot that the studio was reserved each week in an Excel spreadsheet. I then color-coded the data in the sheet with hotter colors reflecting higher numbers. The colors help to visualize trends in usage. For example, usage increases as the week goes on with Thursday and Friday afternoons appearing in oranges and reds. In contrast, there are times early on Monday and Tuesday that have never been reserved.

Studio usage heat map - by weekI also have a heat map that compresses all of the days into one, which I made by totaling the times for each half-hour block on the spreadsheet and then color-coding it. Click to enlarge it. Again, it’s pretty easy to see the studio warm up as the day goes on, indicating increased usage.  Having a couple of regular evening reservations also contributes to this pattern.

Color coding numbers in a spreadsheet isn’t rocket science, but it is an easy way to visualize the data to quickly get a read on the studio. And, I can see that I’m going to have to start coming in earlier on Mondays if I want to use the studio.

Leave a comment

Filed under Projects

America’s Secret Slang

If you haven’t seen it yet, America’s Secret Slang, which is produced by the History Channel, is worth checking out.  There are currently 9 episodes available, most of which are 44 minutes long.

I happened to catch this show one day when I was channel surfing and quickly got sucked in.  I haven’t seen all of the episodes, but I’ve been impressed by what I’ve seen.  Each episode takes on a general theme and then examines the origin of slang (including idioms) that relate.  Most of the segments include a person-on-the-street segment asking native speakers if they use a slang term (spoiler: they do) and if they know its origin (they usually don’t, but they often try making one up.)  The origin and explanation is then revealed through in an interesting and visual way including animated words and historical re-enactments.

I’ve linked to one episode, above, and the rest are available on the History Channel website and YouTube.  Be aware the the show is rated PG, so you may want to preview episodes before watching them in class or assigning them to your students.  Non-native speakers will appreciate being able to rewind and review the videos online.  They can also turn on captions if they find that helpful.  Overall, the shows are very well made, include a ton of information, and are interesting to native and non-native speakers alike.

2 Comments

Filed under Resources

Make a Google Form in 5 Minutes

I was once sitting in an meeting of the Gaming Special Interest Group at a CALICO Conference (I mention these details because this is a great group within a great organization — check them out) when we got to the point in the agenda where we needed to collect the names and email addresses of everyone in the group.

Rather than passing around a pen and a pad of paper, I whipped up a Google Form on my iPad and passed that around instead. Not only was it so quick and easy that I had the form created and the information collected before the end of the 30-minute meeting, but I didn’t have to try to decipher anyone’s handwriting in order to get their email address.

The simplest Google Forms look like online surveys.  As the form is completed, the answers are uploaded to a Google Spreadsheet. And, like all of the different types of Google documents in Google Drive, the form and the spreadsheet can be made public, private, or unlisted and multiple collaborators can be given various levels of access from owning to editing to viewing.  Of course, private information entered into the form is still archived by Google.  If your institution, like mine, has protocols involving what information can and can’t be stored in the cloud, you may want to investigate those before using these tools.

If you’ve never created a Google Form, take a look at the above video for a 5-minute walkthough.  Then open Google Drive, sign up for a free Google account (or sign in if you already have one) and create your form.  It’s easier than you think.

Leave a comment

Filed under Resources

“Privacy”

fingerprint copyFingerprint (not mine – combination of this image and this image)

Maybe you’ve noticed that Facebook is separating its messenger application from its mobile application. “That’s strange,” you think, “I like things the way they are. They’re integrated, which works well. Why would they change that?” Good question. According to Facebook, there are lots of reasons that your new experience will be richer and better.

But, according to this article on the Huffington Post, users who download the Messenger app agree to terms of service that are “unprecedented and, quite frankly, frightening.” For example, by installing it, you agree that the Facebook Messenger app can:

  • call phone numbers and send text messages without your intervention
  • record audio, take pictures, and take video at any time without your confirmation
  • share data about your contacts,
  • share your phone’s profile information including the phone number, device IDs, whether a call is active, and the remote number you are connected to
  • access a log of your incoming and outgoing calls, emails, and other communication

Some of these are a bit scary — recording me without my confirmation? who are you, the NSA? But maybe you’re not surprised that Facebook is collecting and sharing your information because users get the app for “free,” which basically means you pay for it by giving over your data. And anyone who agrees to those terms and conditions gets what they signed up for, right? Well what if something similar was happening on the World Wide Web? Spoiler alert: it is.

Think turning off cookies keeps websites from tracking you? Take a look at the Electronic Frontier Foundation’s Panopticlick. Even if you don’t let websites store cookies — small files that websites use to track you — on your machine, it’s likely that the combination of your operating system, browser version, browser plugins, time zone, screen size, fonts downloaded, and a few other configurations are as unique as a fingerprint. And websites recognize you by your device’s fingerprint every time you visit.

In fact, your browser history alone is another giveaway. Think about how links to sites you have visited are purple while links you haven’t are blue, then consider this thought experiment: If a website picked a handful of websites and linked to them on its webpage, it would learn about you when you visited based on your combination of blue and purple links. As the number of links grows, there would be a greater and greater chance that your specific combination would be unique. And, based on your combination of blue and purple, and the demographics of visitors to those sites, some information about you could be predicted. For example, if you have visited Martha Stewart’s website on your computer and I’ve visited Hot Rod Magazine’s website on mine, a website could predict a few ways in which we are different. And, again, the longer the list of links, the more accurate the prediction becomes.

All of this information isn’t intended to cause a panic, but rather to raise awareness. Before you bust out your tinfoil hat, consider other alternatives that are more likely to keep you safe online: Check your browser’s security settings, keep your operating system up to date, and look into antivirus and anti-malware tools. And, be aware that what you are doing online is likely trackable and traceable, so be thoughtful of where you go and what you do there. As a friend of mine recently observed in response to all of this, “It’s a scary world. But also a great one.” Be careful out there.

Leave a comment

Filed under Resources

Build a $150 Studio

IMG_4533  Our $100 studio gets $50 worth of lighting.

If you’ve been following along, you’ve already read about the $100 studio we built in an old office to record better audio and video resources for our students. We’ve recently installed $50 worth of lights to get the studio ready for video production.  Here’s what we used:

Item  #  Cost  Total
4′ two-light shop light  2 $14.98 $29.96
8 1/2″ clamp light  2 $7.85 $15.70
CFL bulbs – daylight (2 pack)  1 $9.98 $9.98
Total:  $55.64

Again, we did come a few dollars over our target of $50, but we’re in the neighborhood. Our list does not include bulbs for the shop lights (I brought in four bulbs from a twelve-pack I had in my garage) or the power strips we plugged the lights into because we scrounged those from around the office.

IMG_4536

The installation was relatively straightforward. We hung the shop lights as close to our green screen wall as possible in order to wash the wall with light evenly. An evenly lit green screen is easier to replace with another image or video in postproduction using iMovie or a similar application. We attached a paper baffle using magnets to try to keep the light from the shop lights from backlighting the subject. Green paper was not necessary, but it was readily available so we used it.

IMG_4535

We hung the clamp lights from the ceiling at approximately a 45-degree angle from the subject. The goal is to light the subject from just above her eyes, which means these lights may be a little high, but the ceiling was an easy way to hang them and keep them out of the way. We used binder clips to attach parchment paper over the bulbs to diffuse the light, making it less harsh. In the photo, you can see that we have added a second light (for two on each side). We did this to make sure there was plenty of light on the subject. Although the CFL lightbulbs do warm up and become brighter after about five minutes, they still have to compete with all of the light reflecting off of the green screen. So, we added the second set of lights to be sure there was plenty of light, though these may not be absolutely necessary.

Each set of lights, left and right, are plugged into a power strip on the wall. None of the lights have switches, so the switch on the power strip becomes an easy way to turn them on and off without having to plug or unplug them. Finally, the last critical detail was to get “daylight” bulbs rated at 6500K. This is the best light temperature for most cameras. Fortunately, daylight bulbs were easy to acquire and not any more expensive than other temperatures (warm, cool, etc.)

So, for a few bucks at your local home improvement warehouse, you can find plenty of lights to outfit your studio on a budget. Our next step is to test a few camera / microphone combinations to see what will fit our budget and be quick and easy to use for anyone in our program who wants to make a video. Stay tuned.

1 Comment

Filed under Projects