Tag Archives: big data

Raw. What is it good for?

students vs teachers-1 cropped

When I first came across Raw, a free, online data visualization tool, I channeled my inner Edwin Starr and asked, “What is it good for?”  It turns out the answer is “absolutely everything.”  Or pretty close to it.

Raw is extremely user friendly.  It’s built on D3.JS, which is pretty powerful.  If you, like me, haven’t had time to explore D3 in depth (or if, also like me, you’re not sure you have the skills to take it on,) Raw greatly simplifies the process.  And all of the data is processed in your browser, which means your data is never copied and stored on their servers.

So, what can Raw do for you?  Well take your favorite data set and paste it into the text box (or choose from one of the four example data sets provided).  Then choose from one of the 15 chart types and drag components for your data into the axes or other options for the cart type you have chosen.  You can do this as many times as you like to get the data to try on different options.  Finally, customize your visualization by adjusting the size, scale, and colors of your visualization before choosing how you want to export your results.  It’s amazingly easy!

I created the visualization at the top of this post by feeding in some data on teachers (left) and students (right).  The lines connecting them represent classes that the students had with each teacher with thin lines for one semester and thick ones for the next.  I wanted to explore how students move through our program.  Here, it’s easy to see that most students move up from one level to the next, but there are some that skip levels and some that repeat levels.  The students and teachers are not arranged in order from lowest to highest level, though this would be possible and might make it easier to see these trends.

There are lots of other options within Raw and, depending on what your data include, some may be more useful than others.  But the beauty of Raw is that you are only a couple of clicks away from any of them, making it very easy to try several visualizations until you find one you like.

Leave a comment

Filed under Resources

The Largest Vocabulary in Hip Hop

turntable“technics sl-1200 mk2” by Rick Harrison / Flickr

I spent much of my youth listening to hip hop, or, as it was called back then, rap music.  This was long before MP3 players and long before you could Google your favorite song lyrics.  It was also long before I knew anything about textual analysis, let alone before I thought about using unique words per n words as a measure of variety in vocabulary.

So, when Matt Daniels published this piece called The Largest Vocabulary in Hip Hop last month, it was both a flash back to the music of my youth and a flash forward to some of my current interests in corpus linguistics.

Daniels does a very nice analysis, so I won’t repeat much of it here.  Just follow the link and scroll down to see the details.  Be aware that some of the analysis incorporates a bit of slang that may not make it completely kid friendly.

Most noteworthy in the analysis are the two baselines of comparison:  Shakespeare (5170 unique words per 35,000 words) and Herman Melville (6,022 unique words in the first 35,000 words of Moby Dick).  Of the 85 rappers analyzed, 16 use a wider vocabulary than Shakespeare and 3 are above Melville.  So, if you ever thought all hip hop was a simplistic art form, you may want to take another look.  It’s amazing what an analysis of the data can show us.

Leave a comment

Filed under Inspiration

How Do You Spell Success?

Statue of Rocky in Philadelphia, his arms raised in triumph.

To find the prescriptive answer to this question, look in a dictionary.  To find the descriptive answer to this question, look in a corpus.

In ESL Programs at Ohio State, I have been working towards building a couple of corpora of learner language not only for our own analysis, but also for researchers around the world to access.  Our plan is to include the English placement compositions that all international students’ write when the arrive on campus in the first corpus and the Intensive ESL Program (IEP) students’ placement and end-of-term compositions in the second.  Because almost all of these compositions are now written on computers instead of paper, it is relatively easy to take the next step and format them for analysis by corpus tools.

Both corpora should be interesting.  The former could grow by more than a thousand compositions per year as international students are admitted to Ohio State in ever increasing numbers.  Because these students have met the English proficiency requirements to be admitted, their level of proficiency is relatively high.  The latter will include fewer students, but will include longitudinal data because each student will write multiple compositions as they progress through the program.

As I was scoring some of the recent end-of-semester IEP compositions, and encountering the usual and frequent errors in our lowest-level students’ writing, I began thinking about how our students’ creative spelling would affect, and possibly inhibit, searches of this corpus.  For example, how can you search for past tense verbs when so many of them are misspelled?  Then it occurred to me that these misspellings could themselves be quite interesting.  So, to answer the question posed in the title of this post, here are some of the ways our students spell success (and its cognates), listed in order of frequency:

successful, success, succeed, sucessful, successfull, succesful, secessful, succes, succed, sucssed, successfully, succeful, seccsessful, suessful, suecess, suceessful, succsful, succsess, successul, successufl, successfufl, successeful, succeshul, succefull, succeess, succees, succeeded, succeccful, secuessful, secssed, seccssful, seccessful, scuccess, sccesful.

We are currently working on securing IRB (Institutional Review Board) approval for this project, after which we will be able to share the data and results more publicly.  As part of our IRB application, we are alpha testing our procedures and this question about the spelling of success became an interesting test case.  To create this list, I took a set of student compositions and fed them through AntConc, a free concordancer written by Laurence Anthony.  In addition to the frequency of words, lots of other interesting queries are possible with this application and others.

All of the compositions will be coded with the demographic information we have for each student (age, gender, country of origin, first language, major or degree program) as well as information about each composition (score, topic, date).  By sorting for whatever factor is interesting, we’ll be able to make any comparison we like.  Want to see what the compositions above and below a certain score look like?  No problem.  Want to see how Chinese speakers compare to Arabic speakers?  Male to female?  Grad to undergrad?  We will be able to do it.

We’re looking forward to bringing this Big Data approach to our programs.  Not only will this data inform our curriculum, but it will also become a useful resource for researchers across our campus and around the world.

Leave a comment

Filed under Projects