Introducing “corpus”: a tool for finding language patterns

This post is available both as a podcast and a YouTube video — the video is found in the middle of the post. Do check it out! :)

Introducing "corpus": a tool for finding language patterns Advanced Academic Writing: Tips and Ideas

In my writing class, you will have a chance to examine texts and analyse the language features used in either individual writing, a large bunch of texts from other students, or even a bulk of expert texts. These will be examples that you’ve been asking for to help you with writing or understanding grammar use in texts, but I’m sure you’d ask: how are we going to process a large amount of texts to look for language patterns that you want us to recognise?

Let me introduce a tool that saves us from doing painstaking manual search, and that generates a lot of examples from real texts instead of ones made up from textbooks, or from me.

The tool is called a “corpus”, an archive of text or a text database. Working like the index pages of a book, an online corpus lets you type key words or phrases to examine how they’re used in texts produced by real people, such as students, researchers, news reporters, authors, and so on. As you make the search, the web interface shows you lines of texts in which the key words and phrases are located; we call these lines “concordances”, and this corpus tool function is called a “concordancer”. This function serves to look at “KWIC” or “keywords in context”, meaning that you’re looking into how a key word or phrase is used with the surrounding language features.

For example, in the corpus tool Department of English at PolyU ( developed extracted 760 journal articles of 38 disciplines, totalling over 6 million words. You can choose which database of specific disciplines, or “sub-corpora”, based on your interests or needs. In my case, as I teach social sciences students, I would choose the sub-corpora of psychology and sociology, so that I can have more refined searches.

Next, all you have to do is to key in the words or phrases you wish to look up in “Input Search Word/Phrase”, like the phrase “it is”, so you can see what usually follows or precedes this phrase. This kind of patterning of key words or phrases with other features is called “collocation” – how words are “located” near each other.

You can also add an additional word or phrase, such as “to” in my case, to examine what the common language features you have between the phrases “it is” and “to”. You are also able to adjust how many letters or “characters” you can read in the concordance lines; the default setting is fine in our current search. Then we can click “Search”. As the tool deals with a large database, it may take a bit of time to process. So please be patient.

When the concordance lines are loaded to the page, you can find 167 examples or “instances” of “it is” and “to”, with “to” commonly used right after “it is”. We can look at what are between “it is” and “to”, but they seem to be a little bit disorganised. Now, you can do the sort function, right above the concordance lines, according to the words on the left or right of the key word or phrase. I want to sort what comes immediately after “it is”, so I’ll sort the first word on the right, or “R1”.

Here comes the question for you: out of these 167 instances, (i) what words are in the R1 position? You can give five examples; and (ii) out of these words in R1, what part of speech, or “grammatical class” is the most common? Send me your answers via the form I give you in the link, and I’ll show you the answers!

So in this short podcast, we’ve covered a few concepts related to “corpus”, including the terms “collocation”, “concordancer”, “KWIC” and “subcorpora”. I’ve also shown you how to operate the concordancer to find out language patterns commonly collocated with the key phrases I’ve tried. You can do the same, when you want to confirm your assumptions about whether certain phrases are common in academic writing, what prepositions typically follow a particular word, and so on.        

I’ll introduce another corpus tool next time, given so much information have we got for now. Do go over to this podcast again if there are certain details you want to revisit. See you next time!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s