Lucas M. Chang – Polyglot, Data Scientist, Cognitive Scientist

The Great Vowel Shift, or why English vowel spellings confuse the world

February 2, 2021February 2, 2021 lumochang

Why is there such a mismatch between the sounds represented by the vowel letters in English versus virtually every other language that uses the Latin alphabet? For instance, “oo” makes the /uː/ sound that rightfully belongs to the letter U. “ee” and “ea” make the /i:/ sound that is normally written I, and “a” can either be the expected /ɑ/ or the unusual /eɪ/.

Contrast this with a language like Spanish, where the symbols A E I O U represent the sounds “ah” “eh” “ee” “oh” “oo,” with the letters and their values inherited from Latin. Other European languages have more complex vowel inventories, but at least the most fundamental values of the vowel letters — sometimes called the “continental” values — tend to be preserved.

Why don’t we spell “eye” ai, “food” fuud, or “treaty” triti? The explanation depends on something called the Great Vowel Shift.

Introduced by Danish linguist Otto Jespersen in the early 20th century, the GVS refers to the wholesale restructuring of the long vowels that occurred roughly 1300-1700, between the times of Chaucer and Shakespeare. This transformation has often been considered the defining boundary between Middle English and Modern English.

Jespersen noted that the long vowels of Middle English underwent a series of changes that did not seem at all arbitrary or unconnected. He produced the following diagram to describe the pattern:

Jespersen’s diagram (1909)

The series on the left shows what happened to the front long vowels: Each vowel raised, taking the place of the next one in the series. The highest vowel /i:/, with no further room to raise (if the tongue went any higher it would hit the palate), was then “pushed out” and formed the diphthong /ei/ which today is the long I /aj/. This hypothesized process is known as a chain shift: one sound change triggers a restructuring of the entire system because speakers tend to avoid mergers that compromise intelligibility.

Similarly, a parallel set of changes occurred in the back long vowels: Each vowel raised one step, until the highest one /u:/ was “pushed out” and formed the diphthong /ou/.

While the sounds changed, our spellings are still based on the pre-shift system. You can get a better sense for how the spelling of long vowels in Middle English worked from this video of Chaucer’s Canterbury Tales read aloud.

We can also summarize the shift using a table of the Middle English long vowels and their present-day equivalents (adapted from Wikipedia):

Word	Originally sounded like	Middle English vowel	Modern vowel
bite	“beet”	/iː/	/aɪ/
meet	“mate”	/eɪ/	/iː/
meat	“met”	/ɛ:/	/iː/
mate	“maht”	/ɑː/	/eɪ/
out	“oot”	/uː/	/aʊ/
boot	“bote”	/oʊ/	/uː/
boat	“bot”	/ɔː/	/oʊ/

(interestingly, the contemporary “ah” sound as in “father” is a post-shift innovation that filled the empty space left by the lack of a low vowel).

One of the most remarkable things about linguistics is how organized abstract patterns emerge from tacit, unintentional, embodied processes that are distributed over time and space. Jespersen’s neat diagram summarizes the systematic differences between “Middle English” and “Modern English.” But of course, there was not just one Middle English but rather a vast array of highly divergent dialects. Nor is there agreement among speakers of Modern English on vowel sounds. And we know from textual, historical and linguistic analysis that the vowel changes took place over a span of hundreds of years.

In the 1980s, Stockwell and Minkova introduced an analysis that deconstructed this long-standing standard model. They show that the “chain shift” story only applies at best to the higher vowels /e/, /i/, /o/, and /u/, whereas the other changes in the lower vowels (in the shaded boxes) occurred several generations later, at different times in different dialects, and rather than being involved in a chain shift, caused mergers with vowels occurring in other contexts. What appears from afar as a unified series of changes turns out to be the accumulation of centuries of sound change, only parts of which are structurally connected.

Stockwell’s diagram (2002)

Matthew Giancarlo (2001) presents a fascinating and readable discussion of the history of the Great Vowel Shift as a concept, showing how Jespersen and his colleagues constructed a unified, linear conception of the English language as a single object developing over time towards the higher and more rational condition of Modern English, and how this conceptualization allowed for the packaging of a more complex, pluralistic history into the textbook story that all students of the history of the English language are familiar with.

In many ways this debate about the concept of the GVS reflects the eternal debate between the lumpers and the splitters, the modelers and the descriptivists, the abstract versus the concrete. It might be beyond our ability to understand what really happened; certainly the truth is too intricate to represent in a diagram with a handful of letters and arrows. Perhaps the GVS is a helpful abstraction, a useful fiction; but then again maybe it is misleading, harmful, even, as Giancarlo argues, perpetuating traces of nationalistic and colonial ideologies. Perhaps it is all those things at once.

Low-Yield Searches: Availability of Information on Wikipedia Affects Tourist Decisions

October 13, 2020October 13, 2020 lumochang

Have you ever looked something up on Wikipedia or Google and failed to find much relevant or high-quality content? Every time that happens, it’s a subtle hint to us that the topic isn’t important or interesting. However, this implicit message is often biased, especially against diversity and in favor of the dominant culture and its language and trends.

A clever study (Hinnosaar, Hinnosaar, Kummer, & Slivko, 2019) used a controlled experiment on Wikipedia to estimate how much of an effect this has on a real-world outcome. They randomly assigned Spanish cities with low-quality articles on Italian, German, and French Wikipedias, and added relevant information (mostly by translating from Spanish and English Wikipedias) to 120 treatment group articles while leaving the 120 control group articles unchanged. They estimate that hotel stays by Italian, German, and French tourists are increased by an average of 9% in cities that had their articles improved in the language of the tourist’s country of origin compared to control cities.

The authors consider the paper’s implications from an economic perspective, discussing the failure of interested parties to respond to the apparent economic incentive to provide the missing information. But the paper also serves as a reminder that informational bias has real impacts. Given the prevalence of male editors on Wikipedia, the commercial interests of Google, and the lower representation of minority cultures and languages online generally, it’s no surprise that low-yield searches disproportionately affect people who are interested in minority or foreign cultures or notable women, BIPOC and/or LGBTQ figures, and those who seek information in non-English languages or in English about non-English-dominant topics. Regardless of their intent, whenever content producers neglect a topic, and whenever algorithms privilege popular topics at the expense of others, they risk exerting a substantial marginalizing effect.

Did Homo erectus have language? According to Daniel Everett, they invented it

September 13, 2020September 13, 2020 lumochang

Thoughts on the book How Language Began: The Story of Humanity’s Greatest Invention:

In grad school, I remember hearing about Daniel Everett as a controversial and somewhat heterodox figure in the world of linguistics, but until now I had never read any of his work.

Everett’s controversial claim is that a lot of the structural and especially syntactic features of human languages that are commonly thought to be universal, and without which language would be unimaginable, are actually not universal and not fundamental at all.

He bases this claim on his observations of the languages spoken in the Brazilian Amazon, especially Pirahã. Everett claims that Pirahã lacks hierarchical and recursive structures, yet is still a complete human language and perfectly adequate for communication. This contradicts the Chomskyan idea that language is at its core an innate, computational cognitive system that manipulates and combines symbols into larger structures. Instead, Everett claims that the essence of language consists in the use of symbols (i.e. words) for communication, and that this naturally leads to a whole host of auxiliary features and behaviors, of which syntactic communication is just one.

The big discontinuity

In this book, Everett tackles a big question: How did language begin? Did it evolve, was it invented, or some combination of the two?

It’s a particularly hard question because our nearest relatives, the great apes, don’t seem to have a communication system that could serve as a “protolanguage” or direct precursor of human language. Chimpanzees, bonobos, and gorillas do communicate with each other with sounds and gestures, but they don’t seem to have any direct equivalent of the words or phonological and syntactic structures found in human languages. Although they are capable of learning to communicate with symbols if trained by researches, non-human primates cannot speak using their vocal tract (larynx, tongue, lips, etc.) the same way we do.

So, how did our hominin ancestors start talking? In the Chomskyan paradigm, language originated in Homo sapiens as a result of biological evolution. The standard argument to support this conclusion is indirect and somewhat subtle, but (simplifying) if our linguistic abilities surpass what we would be able to do merely with general-purpose learning and cognition, they must depend on innate cognitive adaptations; therefore, language could not have been invented, even if the details of how the innate linguistic endowment is implemented in our genomes and brains, and how this evolutionary event occurred, are still highly speculative. And because Homo sapiens show larger brains, changes in the vocal tract anatomy, and more complex behavior than previous hominins, it is thought to have occurred relatively late in human evolution.

Everett argues for a different story. He thinks language was invented, probably almost 2 million years ago, by Homo erectus. According to his theory, erectus were the first to invent symbols. This, in turn, created a context where the brain and body could coevolve with the nascent languages: an example of the Baldwin effect, in which individuals who were better talkers had an advantage given the cultural and behavioral context, and the languages themselves also changed over time to meet the needs and abilities of their users.

It all started with symbols

According to Everett, Homo erectus had already reached a level of behavioral complexity 1-2 million years ago that suggests they probably had highly developed cultures and communication systems. Although most non-stone tools don’t leave any direct evidence, there are indications that erectus could do a lot more than hit two rocks together. They migrated all over the world and colonized islands, which suggests they must have built rafts or canoes, and probably planned their migrations, hunts, and fishing expeditions.

First, these hunter-gatherer communities would have invented symbols: in order to coordinate their collective activities, they would have made specific sounds (probably accompanied by body movements) to refer to objects, actions, and relations. At this point, Everett’s big idea starts to kick in: once you have symbols, fully-fledged human languages would have emerged gradually as a result of the interactions between speakers, their bodies, cultures and communicative needs, the structure of information in their environment and implicit in social interaction. Further refinements made by Homo sapiens would have been the result of a coevolutionary process that presupposes an existing linguistic environment.

In the rest of the book, Everett elaborates how the existence of a symbolic communication system within a cultural context might have led to the emergence of a wide range of linguistic phenomena of interest to linguists. For example, phonetics with vowels, consonants, and syllables might have emerged because this is the most efficient way to transmit information in a way that is easily perceptible by the auditory system. Phonological systems might have converged on “double articulation” of phonemes (small meaningless elements) that combine to form words (larger, meaningful elements) because of the necessity to produce a large set of easily distinguishable symbols. Grammar with hierarchical phrases and sentences might have emerged as a result of demands on working memory and to help distinguish the different roles a symbol can play in an utterance (e.g. topic, comment, subject, object). From the start, speech would have been accompanied by gesture because our interconnected brain tends to recruit the whole body and its neural representation in the service of complex tasks. The first language users would also have resolved ambiguities using shared background knowledge and cultural assumptions, just like we do.

We may never know exactly when and how language originated. Nevertheless, attempts to address this question can help us interrogate our concepts and assumptions about what language is and how it works. Whether or not you think Everett is onto something here, we shouldn’t limit ourselves by dogmatic theoretical beliefs in this area, given how much there remains to be discovered, and, probably, re-conceptualized about language.

The Scots Wikipedia Thing

August 28, 2020August 28, 2020 lumochang

The Scots Wikipedia thing is changing the way I think about the Internet and minority languages.

What happened?

Earlier this week, a viral Reddit post alleged that a single editor — a non-Scots-speaking American — of the Scots-language Wikipedia has flooded the wiki with, essentially, mangled English articles translated into fake Scots, mostly by substituting word-for-word using an English-Scots dictionary. (The user in question seems to have acted in good faith; he started as a child and has apologized). Other non-Scots-speaking editors have also made many low-quality contributions. Apparently, there was never a sufficient community of actual Scots speakers on Wikipedia to keep the poor-quality “Scotched English” in check and fill the wiki with authentic Scots articles. Now the Scots Wikipedia community is stuck with the question of what to do with this fiasco of a Wikipedia: delete it, roll back to an older version, mobilize the Scots-speaking community to fix all the articles?

For background, Scots is one of Scotland’s two native minority languages. Whereas Scottish Gaelic is a Celtic language related to Irish, Scots is essentially an English dialect, having split off from Middle English and developed mostly separately for centuries. Scots split off earlier, and is more distinct from standard English, than other English varieties around the world. “Broad” Scots — i.e. Scots with less influence from English, is largely unintelligible to English speakers who aren’t familiar with it, although many speakers use mixed varieties that are heavily influenced by Scottish Standard English.

It’s difficult to say for sure how many people speak Scots. 2011 census data suggest that the number of self-reported speakers was about 1.5 million. The future of the language is uncertain as new generations might abandon it for English, while others might renew their commitment out of resurgent Scottish nationalism or pride. What’s for sure is that most people who speak Scots at home and in their communities still prefer to read and write English, especially for the sort of content that typically goes on Wikipedia.

Wikipedia is a uniquely hard challenge for poorly-standardized spoken languages

I’ve noticed a lot of Scots speakers commenting along the lines of “I want to help, but I don’t feel qualified.” It isn’t only that Scots speakers don’t feel comfortable writing encyclopedic articles in a mostly-spoken language. Rather, there is a more pervasive problem with this kind of project. Unlike most major languages, there isn’t a standard form of Scots that everyone can agree on. For starters, there are several variants (Doric, Lallans, and Ulster) that are pretty distinct from each other. Plus, even within each dialect group, there aren’t established spelling norms. When an editor sees an article that’s written in something very different than their personal Scots idiolect, it can be hard to add to it without starting over. It makes a lot of sense that most Scots speakers might speak, and even sometimes write, Scots, but still don’t want to edit Wikipedia. Especially considering its sorry past and present state, which has now been handed to them saying, “we screwed this up, but you fix it.

A case study of behavior on the Internet?

There have been three main kinds of responses to the situation: first, hate and harassment directed at the offending editor and the frankly offensive state of Scots Wikipedia as of August 2020. Second, mocking the ridiculousness of the whole situation. Third, constructive discussion regarding what to do about it.

We’ve fragmented ourselves into specific little communities, and you can see the different character of responses on Wikipedia, Twitter, and different subreddits like r/linguistics, r/scotland, r/badlinguistics, etc. Like a prism, the structure of internet communities has divided the aspects of communication into parallel streams. That puts a huge responsibility on us to be proactive about staying informed and thinking critically, while also ensuring civility wins out over trolls.

Wikipedia is often held up as a triumph of collective human goodness (All of human knowledge! For free! For everyone and by everyone!). But under that surface there is a lot of chaos. Although the vast majority of information on Wikipedia is true, it cannot avoid bias when its editors are a small and not very diverse group that does not represent its communities of readers, and often fails to include people from communities that are relevant to articles with historical or cultural content. There have been big problems: most notably, Croatian Wikipedia has been taken over by Neo-Nazis (there is a Serbo-Croatian Wikipedia in substantially the same language where things seem to be more normal).

Is this kind of thing a threat to the language?

The original Reddit post suggested that the user might have “done more damage to the Scots language than anyone else in history.” If that sounds crazy, think of how many people might have followed a link to Scots Wikipedia not knowing what it was, seen something that looks like English transcribed in a mocking Scottish accent, and concluded that Scots isn’t a real language — and a ridiculous non-language at that — unintentionally compounding centuries of cultural marginalization. Although I’m somewhat optimistic that the reaction to this situation could be constructive, it should go without saying: Don’t pass off a shallow imitation of someone else’s culture as the real thing on the internet!

For more about Scots language, I suggest looking at the Scots Language Centre, the Open University course, the Dictionary of the Scots Language, or the works of the great poet Robert Burns.

What I Learned at the Polyglot Conference Ljubljana

February 8, 2019August 28, 2020 lumochang

Last October, I attended the Polyglot Conference 2018 in beautiful Ljubljana, Slovenia. The conference is an annual event where people from around the world who love learning language get together for a weekend. There are talks by all different kinds of people about learning languages, teaching languages, linguistics, and cultural topics related to language. Above all, it’s a great place to meet people who share a passion for language. (And no, you don’t have to know a lot of languages to get in – you just have to want to!)

I had a blast and learned a lot. Here are some of the highlights for me:

Speaking is not the only way to practice a language

A popular idea in the language learning community (most prominently represented by Benny Lewis) is that you should speak, speak early, and speak a lot. It makes a sort of intuitive sense and plays into a motivational narrative – just get out there and do it! But as an introverted language learner, this never rang true to me. I like to learn languages alone, and I don’t mind that I don’t always get a lot of opportunities to speak all of them. But even if I’m not chatting people up in Swedish all the time, it would still be nice to know that I can.

So it was good to see a couple of presentations about practicing active language skills without having to go out and meet speakers. Gareth Popkins spoke about methods for solo speaking practice. For example, one technique is called faithful retelling. You read or listen to a short text, take notes, and then, using only your notes, try to retell it as accurately as you can. Another one is paraphrasing: you take each sentence and rewrite it using as many different words as you can. Another talk by Lindsay Williams was about the “forgotten skill” of writing. She reminded us that writing in a new language is something you can do all the time to practice, and all you need is yourself and a pen and paper!

Studying languages systematically

Another trend in the talks on language learning was the idea of studying systematically: rather than studying whatever strikes your fancy, it can be helpful to set goals and routines. These can take different forms. You might set goals to study a certain amount of time per week, or “check in” with a certain language each day, or to assign different languages or study techniques to each day, week, or month.

Olly Richards spoke about his self-imposed challenge to learn Italian by huge amounts of passive exposure in a short time. Judith Meyer spoke about designing a language course, a Teach Yourself book for Esperanto. And Lýdia Machová talked about her experiences coaching language learners and how she helped them by making sure they each had a system tailored to their own needs. As for me, I’ve decided to try focusing on a different language each month.

The strength of a language is in how it’s used for expression, not how many speakers it has

How do you know if a language is thriving? You could rank them by the number of native speakers – in that case, Mandarin Chinese is number one. Or maybe the number of total speakers? Then English comes out on top, which makes sense. But this is clearly missing something. What if a language is spoken by a small community of people, but they all use it for everything and it shows no signs of dying out? That is clearly better than a language spoken by a larger group, but whose speakers are abandoning it in favor of another dominant language, or are embarrassed to speak it in public.

Several talks at the conference focused on the status of efforts to preserve minority languages. Samanta Baranja spoke – in Slovene – about the Roma community in Slovenia and especially efforts to include them in mainstream education while accommodating, and not marginalizing, the Roma language(s). Claudia Ferigo spoke about the Friulian language of northeastern Italy (not a dialect of Italian!) as well as the Suns Europe festival of minority-language arts. Anoushka Dufeil talked about marginalization from another perspective: women and the struggle to come up with ways of using the French language in a gender-equal way (The Académie française seems to get in the way a lot). Finally, Alex Rawlings exhorted us polyglots, as a community, to use languages to bring down the world’s Grenzen (borders or limits). He told the story of how speaking Greek made his experience as a boy spending a summer Greece incomparably more meaningful than it could ever have been in English. He pointed out that, despite the rise of global English as a means of communication, people still live their fullest lives in a diversity of languages. He spoke out against the increasing trend for people to express themselves publicly and professionally only in English, even losing the ability to use their native language to its fullest extent.

Now, with global audiences available on the Internet, smaller languages increasingly risk being relegated to domestic, social, and traditional uses, but are left out of things like pop music, movies, science, and blog posts. Many universities in Europe offer degrees taught in English that are not available in the local language, and many writers have never published in their native tongue. As a small gesture to counteract this trend, I decided to make this blog fully multilingual: all the content is available in English and Portuguese in my own words (not a translation), and I might add more languages later.

Everyone is hanging out without me on social media and I had no idea

I mean, I knew there was a Facebook group. But apparently everyone was on a Telegram channel and meeting up? I didn’t even have a working phone in Slovenia. But luckily Ljubljana is small enough that I kept running into other people from the conference anyway, and made some good friends.

Next year’s conference will be in Fukuoka, Japan. I really hope I can go but I might be too busy as I’m juggling work and grad school right now.