Tag Archives: linguistics

The List of Lists

dictionaries

I’ve been tinkering with AntConc, Laurence Anthony’s free concordancer, which has led me down a bit of a rabbit hole of lists generated by corpus linguists over the past 60 years.  I’ve listed a few that I’ve used, sometimes within AntConc, to analyze students’ writing.  If you’ve taught students to investigate their linguistic hunches via the Corpus of Contemporary American English (COCA), you might also consider teaching them to put their own writing into a tool like AntConc to analyze their own writing as well.  By including the lists below a blacklist (do not show) or a whitelist (show only these), students can hone in on a more specific part of their vocabulary.  Most of these lists are available for download, which means you can be up and running with your own analysis very quickly.

The lists (in chronological order):

General Service List (GSL) – developed by Michael West in 1953; based on a 2.5 million word corpus.  (Can you imagine doing corpus linguistics in 1953?  Much of it must have been by hand, which is mind boggling.)  Despite criticism that it is out of date (words such as plastic and television are not included, for example), this pioneering list still provides about 80% coverage of English.

Academic Word List (AWL) – developed by Averil Coxhead in 2000; 570 words (word families) selected from a purpose-built academic corpus with the 2000 most frequent GSL words removed; organized into 9 lists of 60 and one of 30, sorted by frequency.  Scores of textbooks have been written based on these lists, and for good reason.  In fact, we have found that students are so familiar with these materials, they test disproportionately highly on these words versus other advanced vocabulary.

Academic Vocabulary List (AVL) – the 3000 most frequent words in the 120 million words in the academic portion of the 440 million word Corpus of Contemporary American English (COCA). This word list includes groupings by word families, definitions, and an online interface for browsing or uploading texts to be analyzed according to the list.

New General Service List (NGSL) – developed by Charles Browne, Brent Culligan, and Joseph Phillips in 2013; based on the two-billion-word Cambridge English Corpus (CEC); 2368 words that cover 90.34% of the CEC.

New Academic Word List (NAWL) – based on three components: the CEC Academic Corpus; two oral corpora, the Michigan Corpus of Academic Spoken English (MICASE) and the British Academic Spoken English (BASE) corpus; and on a corpus of published textbooks for a total of 288 million words. The NAWL is to the NGSL what the AWL is to the GSL in that it contains the 964 most frequent words in the academic corpus after the NGSL words have been removed.

Leave a comment

Filed under Resources

The Largest Vocabulary in Hip Hop

turntable“technics sl-1200 mk2” by Rick Harrison / Flickr

I spent much of my youth listening to hip hop, or, as it was called back then, rap music.  This was long before MP3 players and long before you could Google your favorite song lyrics.  It was also long before I knew anything about textual analysis, let alone before I thought about using unique words per n words as a measure of variety in vocabulary.

So, when Matt Daniels published this piece called The Largest Vocabulary in Hip Hop last month, it was both a flash back to the music of my youth and a flash forward to some of my current interests in corpus linguistics.

Daniels does a very nice analysis, so I won’t repeat much of it here.  Just follow the link and scroll down to see the details.  Be aware that some of the analysis incorporates a bit of slang that may not make it completely kid friendly.

Most noteworthy in the analysis are the two baselines of comparison:  Shakespeare (5170 unique words per 35,000 words) and Herman Melville (6,022 unique words in the first 35,000 words of Moby Dick).  Of the 85 rappers analyzed, 16 use a wider vocabulary than Shakespeare and 3 are above Melville.  So, if you ever thought all hip hop was a simplistic art form, you may want to take another look.  It’s amazing what an analysis of the data can show us.

Leave a comment

Filed under Inspiration

Arst Arsw: Star Wars in Alphabetical Order

baby darthFather’s Day by Artiee / Flickr

A friend recently lent me the book Uncharted: Big Data as a Lens on Human Culture, which discusses the development of the Google N-Gram Corpus.  After scanning millions of books, Google could not simply make them all freely available because this would essentially be republishing copyrighted works.  Instead, Google has made them all searchable by N-Grams (one-, two-, three-word phrases and so on up to n-words) which protects the copyrighted works because they are really only viewable in aggregate.  The corpus is, of course, limited in that it only includes books (as opposed to also including magazines, newspapers, oral texts, etc.), but given that it goes back hundreds of years, the size and the scope of the corpus is pretty amazing.

Early on in Uncharted, a book called Legendary Lexical Loquacious Love, a concordance of a romance novel, is affectionately described as a conceptual art piece that helped to inspire the N-Gram Corpus.  In Love, every word from a romance novel is presented in alphabetical order.  So, a word like a, which appears several times in the original source novel, is repeated scores of times.  The authors talk about how different the experience of reading a concordance of a romance novel is from reading the original romance novel, but how the former is compelling in its own way.  For example, they offer the following quote:

beautiful beautiful beautiful beautiful beautiful beautiful beautiful
beautiful beautiful beautiful beautiful beautiful beautiful beautiful
beautiful beautiful beautiful,  beautiful, beautiful, beautiful, beautiful,
beautiful, beautiful, beautiful,” beautiful. beautiful. beautiful.”
beautiful… beautiful…

These 29 occurrences of the word beautiful are, presumably, spread throughout the original novel.  But seeing them juxtaposed next to other words that begin with b (and with the scores of occurrences of the word a) gives you a different perspective on a romance novel.

What does this have to do with Star Wars?  Great question.  While reading Uncharted, I came across the following YouTube video:

Created by Tom Murphy, the video is “meant to be provocative in its uselessness.”  It took 42 hours to produce the 43-minute video, which is oddly compelling to watch.  In addition to the video, a small data bar at the bottom graphs the frequencies of each word, which is also tallied onscreen through the video.  It’s a difference experience, much like reading a concordance is different from reading the original source text.  For example, the famous scene in which Obi-Wan uses a Jedi mind trick on a couple of Stormtroopers appears in the original movie as follows:

Stormtrooper: Let me see your identification.
Obi-Wan: [with a small wave of his hand] You don’t need to see his identification.
Stormtrooper: We don’t need to see his identification.
Obi-Wan: These aren’t the droids you’re looking for.
Stormtrooper: These aren’t the droids we’re looking for.

(Source: imdb.com)

In Arst Arsw, this interaction is best summarized by the three occurrences of the word identification, which are the only three times that this word appears in the film.  Identification appears at 16:08 of the video.  There are many other interesting moments, particularly when different voices utter the same word several times (for example, leader by several rebel pilots) or when only one character uses the same word several times (for example, kid by Han Solo.)  For me, longer words are generally more interesting because they take longer to say, whereas the shorter words can fly by so quickly that they can be hard to comprehend.  One exception, however, is the word know, all 32 occurrences of which fly by in under 5 seconds.  But because the 26th know is so emphatic, it stands out against the rest.

I’m not sure if there are any other video concordances out there, but if there are, I would love to see them.  Especially if the original source material is as compelling as the original Star Wars.

Leave a comment

Filed under Inspiration

Building a New Language

throne

I’ve somehow managed to avoid the pop cultural phenomenon that is Game of Thrones. I’m aware that it exists, and that it’s adapted from a series of fantasy novels, but I’ve never seen an episode.  An awareness of the show is hard to avoid.  For example, one of my favorite podcasts, Nerdist, hosted by Chris Hardwick, references it all the time.  I bring this up because one of the recent guests on the podcast was David J. Peterson, a linguist who created Dothraki, the language that is used by characters in Game of Thrones.  (Actually, as Peterson explains, George R. R. Martin, the author of the novels, invented the language and then Peterson had to flesh it out further, develop the phonology, etc.)

So, if you’re interested in linguistics and Game of Thrones (or either of these things) you will probably enjoy Nerdist episode #502, in which Peterson goes into depth on creating Dothraki and several other topics.  Please note, as often happens on the Nerdist, the hosts and guests occasionally drop an F-bomb or two out enthusiasm, which means that the entire episode may not be appropriate for younger audiences.  Enjoy your burrito!

Leave a comment

Filed under Inspiration

How Do You Spell Success?

Statue of Rocky in Philadelphia, his arms raised in triumph.

To find the prescriptive answer to this question, look in a dictionary.  To find the descriptive answer to this question, look in a corpus.

In ESL Programs at Ohio State, I have been working towards building a couple of corpora of learner language not only for our own analysis, but also for researchers around the world to access.  Our plan is to include the English placement compositions that all international students’ write when the arrive on campus in the first corpus and the Intensive ESL Program (IEP) students’ placement and end-of-term compositions in the second.  Because almost all of these compositions are now written on computers instead of paper, it is relatively easy to take the next step and format them for analysis by corpus tools.

Both corpora should be interesting.  The former could grow by more than a thousand compositions per year as international students are admitted to Ohio State in ever increasing numbers.  Because these students have met the English proficiency requirements to be admitted, their level of proficiency is relatively high.  The latter will include fewer students, but will include longitudinal data because each student will write multiple compositions as they progress through the program.

As I was scoring some of the recent end-of-semester IEP compositions, and encountering the usual and frequent errors in our lowest-level students’ writing, I began thinking about how our students’ creative spelling would affect, and possibly inhibit, searches of this corpus.  For example, how can you search for past tense verbs when so many of them are misspelled?  Then it occurred to me that these misspellings could themselves be quite interesting.  So, to answer the question posed in the title of this post, here are some of the ways our students spell success (and its cognates), listed in order of frequency:

successful, success, succeed, sucessful, successfull, succesful, secessful, succes, succed, sucssed, successfully, succeful, seccsessful, suessful, suecess, suceessful, succsful, succsess, successul, successufl, successfufl, successeful, succeshul, succefull, succeess, succees, succeeded, succeccful, secuessful, secssed, seccssful, seccessful, scuccess, sccesful.

We are currently working on securing IRB (Institutional Review Board) approval for this project, after which we will be able to share the data and results more publicly.  As part of our IRB application, we are alpha testing our procedures and this question about the spelling of success became an interesting test case.  To create this list, I took a set of student compositions and fed them through AntConc, a free concordancer written by Laurence Anthony.  In addition to the frequency of words, lots of other interesting queries are possible with this application and others.

All of the compositions will be coded with the demographic information we have for each student (age, gender, country of origin, first language, major or degree program) as well as information about each composition (score, topic, date).  By sorting for whatever factor is interesting, we’ll be able to make any comparison we like.  Want to see what the compositions above and below a certain score look like?  No problem.  Want to see how Chinese speakers compare to Arabic speakers?  Male to female?  Grad to undergrad?  We will be able to do it.

We’re looking forward to bringing this Big Data approach to our programs.  Not only will this data inform our curriculum, but it will also become a useful resource for researchers across our campus and around the world.

Leave a comment

Filed under Projects

Edupunk Eye-Tracking = DIY Research

One of my favorite presentations at the 2011 Ohio University CALL Conference was made by Jeff Kuhn who presented a small research study he’d done using the above eye-tracking device that he put together himself.

If you’re not familiar with eye-tracking, it’s a technology that records what an person is looking at and for how long.  In the example video below, which uses the technology to examine the use of a website, the path that the eyes take is represented by a line.  A circle represents each time the eye pauses, with larger circles indicating longer pauses.  This information can be viewed as a session map of all of the circles (0:45) and as a heat map of the areas of concentration (1:15).

This second video shows how this technology can be used in an academic context to study reading.  Notice how the reader’s eyes do not move smoothly and that the pauses occur for different lengths of time.

Jeff’s study examined the noticing of errors.  He tracked the eyes of four ESL students as they read passages with errors and found that they spent an extra 500 milliseconds on errors that they noticed.  (Some learners are not ready to notice some errors.  The participants in the study did not pause on those errors.)

The study was interesting, but the hardware Jeff built to do the study was completely captivating to me.  He started by removing the infrared filter from a web cam and mounting it to a bike helmet using a piece of scrap metal, some rubber bands and zip ties.  Then he made a couple of infrared LED arrays to shine infrared light towards the eyes being tracked.  As that light is reflected by the eyes, it is picked up by the webcam, and translated into data by the free, open-source Ogama Gaze Tracker.

So, instead of acquiring access to a specialized eye-tracking station costing thousands of dollars, Jeff has built a similar device for a little over a hundred bucks, most of which went to the infrared LED arrays.  With a handful of these devices deployed, almost anyone could gather a large volume of eye-tracking data quickly and cheaply.

Incidentally, if you are thinking that there are a few similarities between this project and the wii-based interactive whiteboard, a personal favorite, there are several: Both cut the price of hardware by a factor of at least ten and probably closer to one hundred, both use free open-source software, both use infrared LEDs (though this point is mostly a coincidence), both have ties to gaming (the interactive whiteboard is based on a Nintendo controller; eye-tracking software is being used and refined by gamers to select targets in first-person shooters), and both are excellent examples of the ethos of edupunk, which embraces a DIY approach to education.

Do you know of other interesting edupunk projects?  Leave a comment.

5 Comments

Filed under Inspiration

Blank or Blank: a Concordancer Game

This is a 10-minute demo of a web-based game I’ve been thinking about.  At its heart, it is a concordancer, but the game is also a repeatable, user-directed tool that could be used to study many interesting linguistic structures.  It could be used in any language and in other, non-linguistic disciplines.  I’ve also incorporated crowdsourcing and social networking to make it more useful and more fun.  And it’s so simple, it just might work.

Don’t believe me?  Too good to be true?  Perhaps.  Watch the demo and decide for yourself.  Then, share your reaction in the comments.

7 Comments

Filed under Projects