February 23, 2016

If I May Use Some of It

If you’ve ever surfed your way to the Oxford English Dictionary‘s website, you’ll know that there are approximately 600,000 words in the English language. This represents quite a staggering number of ways to order pizza, shout at fellow motorists, or troll a celebrity on Twitter. Yet a Dec 2015 update to the OED online reveals that, even when disregarding archaic nouns and verbs, only a tiny fraction of this 600,000 is commonly used.

In some ways this is hardly a surprise, but the update’s introduction of a frequency rating nonetheless provides us with some interesting insights into English. Not only does it suggest that the vast bulk of the language is underutilized almost to the point of being dormant, but that most of the Germanic tongue is composed of words restricted to specific domains, discourses and groups of people. In other words, the language is socially fragmented, since most of it is generally used and understood only within particular contexts.

Aside from being an interesting curio and helpful guide in its own right, this new update is therefore confirmation that we all face significant challenges when it comes to communicating with people outside our particular circles and comfort zones. It yields further indication that far too much of ‘public’ discourse unfolds within self-enclosed bubbles that are sometimes impenetrable and irrelevant to those on the outside. Conversely, it hints that everyday life might be impoverishing and emasculating itself through the constant recycling of the same limited set of words. Yet even more profoundly than these two issues, it also raises the age-old problem of just how to define a language, of how to establish that millions of people with divergent vocabularies speak the very same tongue.

From 1 to 8

As for the particular features and mechanics of this new update, it has assigned each English word in the OED with a frequency rating from 1 to 8, with 1 being the lowest and 8 being the highest. As the dictionary’s website explains, each grade represents a different frequency of appearance in “typical modern English usage.” For example, words with a rating of 8 occur more than 1,000 times for every 1,000,000 words ‘typically’ used by an English speaker, while band 1 “contains extremely rare words” that scarcely appear even in a sample of 1,000,000.

Such samples are derived from Google Books data, which incorporates over 8,000,000 books and close to half a trillion words of English. As extensive as this coverage undoubtedly is, it could be objected that most English is in fact spoken rather than written, and that spoken English is generally poorer than than its written counterpart. It’s therefore equally arguable that the frequency rating doesn’t cover “typical modern English usage,” since it doesn’t incorporate anything of what we mutter to each other on a day-to-day basis. As a result, a more accurate description would be that it covers ‘written English usage,’ or perhaps even ‘atypical modern English usage,’ so long as we wanted to be contentious and highlight the often contrived makeup of so much writing.

But this ‘grapho-centrism’ of the OED isn’t one of the main revelations to be drawn from its new update. On the one hand, this is because such a partial focus on the written word could already be used to critique the dictionary, which claims to be the “definitive record of the English language,” as if this language were encompassed solely and entirely in the Shakespeare quotes it uses to exemplify words. On the other, it’s because the other lessons the update can teach us are more interesting and novel.

This becomes apparent with a single glance at the “Key to frequency” page, which informs us of what percentage of the English lexicon each frequency band covers:

Band	Frequency per million words	% of entries in OED
8	> 1,000	0.02%
7	100 – 999	0.18%
6	10 – 99	1%
5	1 – 9.9	4%
4	0.1 – 0.99	11%
3	0.01 – 0.099	20%
2	< 0.0099	45%
1	–	18%

What this table shows is that the most frequently used words in the English language — conjunctions like if, pronouns like I, and verbs like may and use — comprise only 0.02% of its total lexicon. Even when we move down to Band 7, which includes the nouns and adjectives that “form the substance of ordinary, everyday speech and writing,” we still find that we’re dealing with only 0.18% of non-obsolete words in the dictionary. Once again, even when we progress to the “wide range of descriptive vocabulary” included in Band 6, the explosions and headaches we find stretch over only 1% of the language.

This is all quite astonishing, but what’s equally striking is that the words in the less everyday brackets constitute the majority of English vocab. With Band 3 and its collection of words “not commonly found in general text types like novels and newspapers,” the language stores 20% of its vocabulary. Within Band 2 and its “technical terms from specialized discourses,” an incredible 45% of English is hidden away, in words like sesquipedalian and ambs-ace. Even in the “obscure technical” and “occasional historical” terms of Band 1 we’re confronted with 18% of the OED, an 18% that’s the lexical equivalent of that dark, dusty section of the library nobody ever dares enter.

Night of the Living Dead

But what’s more remarkable than the fact that large swathes of the English language are almost as lifeless as a ‘dead language’ like Latin, is the reminder that so much of it is understood only by small, select and specialized groups. As stated above for Bands 3, 2 and 1, most words in the language are “not commonly found in general text types,” are “technical terms from specialized discourses” and are “obscure technical” words. Put differently, they aren’t known to most English speakers, and because they embody 83% of English, they manifest how most of the language’s lexicon is kept alive, not by most of the people who speak it, but by a scattered range of experts.

Not only that, but their infrequency also implies that these experts are isolated from each other and from the general public. It implies that academics, scientists and writers aren’t doing enough in the way of communicating and mixing with the population at large, since otherwise the distribution of percentages would be less skewed towards their end of the frequency spectrum. Because they’re not introducing new terms into common parlance, common parlance is therefore remaining confined to only some 5% of the English language (Bands 8 through to 5).

This confinement has potentially grave ramifications for politics, culture, education, journalism, economics or any domain that depends on communication to survive. It threatens to hamper the attempts of laypeople and experts alike to exchange accurate information with each other, and correspondingly it stops us all from all being as educated and knowledgeable as we might possibly need to be. Because we know only a slice of our own language, we’re in less of a position to hold the politicians, technocrats and bankers who use jargon to account, enabling them to run roughshod over us while we struggle to make sense of what they’re actually saying.

Given that democracies rest on an informed electorate making equally informed decisions during elections, this ignorance might obstruct us from participating in democratic life, thereby contributing to the corrosion of our political systems. Even if its implications aren’t quite so gloomy, it may at least stymie our attempts to understand each other and move from one sphere of activity to another. That’s why an egalitarian educational system should always remain one of our highest priorities, since there’s only so much ‘dumbing down’ or translation that can be done to render the terms of a politician into those of, say, a retail assistant.

Language as Life

This isn’t to say that the average retail assistant necessarily has a more reduced vocabulary than the average politician (knowing the soundbite-loving tendencies of today’s parliamentarians, it might even be the other way around), but it is to draw attention to the relatively minuscule section of the OED that spans “everyday speech and writing.” In some ways, this section should be minuscule, since by definition everyday English needs to consist in the ‘lowest common denominators’ of the language, that is, in terms we can all understand and use.

However, if we remember that language does not exist in a vacuum, but is rather an accessory or complement to the various objects, activities and events that make life what it is, then the poverty of everyday speech and writing points to a corresponding poverty of everyday life. It points to insular lives and lifestyles, lives that neither look outwards nor strive to evolve themselves. But even more than that, it conceivably alludes to a mental homogeneity or sterility, to an inability to conceive of one’s self and life in anything but the same basic terms, recycled over and over again.

This is complete speculation, however, and once again the OED‘s data was taken exclusively from written material, so it can’t really be considered the most reliable indicator of how people talk and live. It should also be added that, even though everyday language has been interpreted here as making up, for example, only some 1% (Band 6) of the lexicon, this 1% would equal 6,000 of the 600,000 words the dictionary claims to document, minus the obsolete terms. Regardless of its scantiness in comparison to the remainder of the English language, this is still quite a wide sweep of vocabulary, so it’s being a little too harsh to assert that most modern-day speakers have sunk to the level of Orwell’s Newspeak, unable to express themselves with any nuance or complexity.

The OED data also suffers from a more severe limitation, in that it lumps together the entire period from 1970 to the present day, meaning that it doesn’t discriminate between different decades and years. This is a shame, since it would be fascinating to see how the distributions have changed up until the present day, and to see whether the word power of common discourse has expanded or contracted. It’s tempting to reason that because of mass culture, textspeak and social media sites like Twitter, the percentage of the English lexicon that covers everyday language has actually shrunk. Unfortunately, there isn’t much empirical evidence to settle this question either way, although a recent study did find that the number of words used in pop songs has decreased over the past ten years.

Languages and Lexicons

On a more general level, however, the OED‘s new frequency data confronts us with an interesting question. Namely, its disclosure that around 82% of English words are not commonly used or known asks us to reconsider what defines a natural language. It poses us the question of what exactly a person needs to possess in order to be counted as speaking a language like English, since as the foregoing indicates, if it were strictly a matter of using and understanding most of a language’s vocabulary, very few people would actually speak English. As the dictionary’s frequency key affirms, ordinary speech and writing occupies anything from 0.2% (Bands 8 and 7) to 5.2% (Bands 8 to 5) of the lexicon, yet even with this surprisingly narrow range, very few of us are willing to contend that Tom, Dick and Sally do not speak the language. Because we’re not willing to discount them in this way, it would seem that we can’t possibly equate English with any one body of words. By an extension of the same logic, it would also seem that we can’t equate it with any one body of grammatical rules, since few of us perfectly uphold such rules in our speech.

Instead, I suggest that we could entertain one of at least two options. The first involves adopting the kind of theoretical approach Wittgenstein advocated in Philosophical Investigations, asserting that no language has a single ‘essence’ that we must all equally grasp in order to qualify as speakers of that language. On this view, the only thing qualification requires of us is that the speech we produce exhibits certain ‘family resemblances,’ traits which overlap in some but not necessarily all respects. These overlapping traits would include not just words but also the grammatical principles we use to structure them into utterances, enabling us to recognize something of what we say to each other without necessarily recognizing everything.

Alternatively, we could take a slightly modified approach, proposing that a language like English is, strictly speaking, something none of us ‘possess.’ On this reading, English as it’s described by grammarians and lexicographers is a regulative construct, an abstraction they propound so as to ensure that our utterances stay within practical and actionable bounds. As such, to say that we are ‘speaking’ English is simply to say that our speech is subject and accountable to a particular grammar, that it can be regulated against this grammar if we fail to make sense to our interlocutors. Of course, we still require some kind of Wittgensteinian theory of family resemblances to determine which particular rule our ramblings should be checked against, but this still doesn’t mean we have to master a definite set of words or a definite set of grammatical principles to determine such accountability.

Nescience and Benightedness

Which, perhaps, should be some relief if you’re liable to worry about your language skills, especially in light of the OED‘s announcement that most of us know only a fraction of the English language we claim to speak. With their new frequency feature, our ability to confirm our own nescience (Band 4) has been expanded even further. Yet at the same time, it offers us the opportunity to correct this benightedness (Band 3), to rediscover the richness of the English language as it exists in dictionaries and books, and perhaps to move closer to recreating it in our speech. At the very least, it will permit the pseuds among us to confirm that the word we’re about to use in closing an article is as inenubilable (Band 1) as possible.

Become a Patron!

This post may contain affiliate links.

Literary Dis(-)appearances in (Post)colonial Cities

The latest issue of the Full Stop Quarterly focuses on literary representations of dis(-)appearing (post)colonial cities across the Eurasian continent. Click here to read the introduction.