i cry while u sleep

but also while you're awake

Are Languages with the Zero-Copula Construction More Efficient? Preliminary Exploration

WHO NEEDS A COPULA ANYWAY

The first time I ran into the concept of the copula was when I was studying Arabic some eight years or so ago. The way it was explained to me was that to say “She is beautiful” would sound sort of absurd; it would be literally equating “her” with “beautiful”, essentially saying she is the Embodiment of Beauty. Which, ok, you can do in poetry or something, but for everyday conversation there’s no need to be so dramatic – just say “She beautiful” and everyone understands what you mean. There’s no need to say “is” here. That word (usually a verb) that indicates that two things in a sentence are the same is called the “copula”. In English, we almost always use it, except for certain cases, usually stylistic, like in poetry, headlines, or just certain idioms. But in Arabic, they don’t require it in basic sentences like “She is beautiful” or “He is the manager here”. This is called the “zero copula” - pretty self-explanatory. Turns out there’s a range over languages of what situations you would use the copula in. In Arabic you use the copula in past tense: “They were photographers”, but not in present tense: “They photographers.” In other languages you might not use it even in past tense.

I gave up on Arabic (or at least, stopped studying it) in fairly short order, so it wasn’t until recently when I realized my life was too easy and started studying Russian that I was reminded of the zero copula again. Russian is the same as Arabic in this way – you wouldn’t use the copula in present tense: “He tall” instead of “He is tall.” And confronting the concept a second time around, I found myself thinking that this makes a lot of sense. Using the copula, in these simple present-tense sentences, really seems unnecessary. Inefficient. It’s pretty clear what you mean. You want to say something to indicate past tense so you may as well use a past-tense copula, but for present tense, why bother? We all have human brains, and for a human brain it sure seems like the most obvious way to interpret a sentence, so that the information contained in the word “is” is redundant (as well as resulting in confusing phrases like “the word ‘is’ is redundant”). These zero-copula languages must be more efficient – that is, mathematically, each word must contain more information, on average, so that you require fewer words to convey the same amount of information. (Yes there’s more to language than purely conveying information ROBIN HANSON but you can still analyze it from this perspective, since that’s clearly a significant part of it.)

It does make sense to think about how much information a single word contains. Consider the sentence, “The wolf is black.” The word “the” tells you a lot less than the word “wolf”. I could say “wolf is black” and you really know pretty much the same amount of information as the first sentence. (Incidentally, Russian doesn’t have any articles like “the” or “a” either, I guess it’s an EXTRA efficient language.) But if I drop out the word “wolf”: “The ---- is black”, well, you have no idea what I’m talking about. So “the” contains a lot less information than “wolf”.

HERE COMES THE MATH

So our lord Shannon gave an actual mathematical definition to information, more precise than “oh clearly ‘the’ tells you less than ‘wolf’”. How much less?

The Shannon information content of something is related to its probability. The less likely something is to occur, the more information you gain when it does occur. For example, if someone wears a t-shirt and jeans every day, then seeing them in a t-shirt and jeans doesn’t give you much information. But if one day you see them in a fancy suit, then by nature of that being a less probable, rarer event, you’ve gained some information – something must be happening today!

Probability, in this case, is actually more like frequency – how frequently does a word actually get used in a language? We figure this out by reading everything that’s ever been written in the language, and just tallying up how many times each word occurs. Divide that count by the sum of counts of all words, and you have a word’s frequency. This collection of words and their frequencies is called an ensemble. Take the log of the inverse of that frequency – and voila! A measure of information:

Shannon information content: h(x=a_i) = log₂(1/p_i)

(h is the function that calculates information content, feed it a variable x with the value a_i – if you list all the words of the ensemble and number them, i is the number of the word)

Now that’s only for this one word. What I’m actually interested in is the information of the entire language (this is called entropy, for cool reasons that I don’t remember but I think yes, descended from thermodynamic entropy somehow). Luckily, that’s friggin’ easy. Just multiply the information of each word by its probability and then add all those mothers up. Bam! Entropy of an ensemble:

H(X) = sum_i (p_i*log₂(1/p_i))

So in the case of a language, you get one number that measures the information of that entire language. A higher number means that there is a higher average information per word. So now my naïve rambling of ‘hey man I bet languages with zero copula are like more efficient and stuff’ can actually be measured and evaluated.

LET’S DO IT

To see if it’s a general trend I’d have to get this number for as many languages as possible. This isn’t all that straightforward. I did a bit of searching around and couldn’t seem to find a table anywhere that just listed it for all the languages in the world. Ok, that’s not a problem, it’s pretty easy to calculate, instead I just need all the texts of all the languages in the world so I can count up all the words and their occurrences. Well, the internet is almost that but it’s not very well organized. So I went looking around for already existing corpuses. Of which there are actually are plenty, sort of, the majority of them on the awesome website for the awesome Linguistic Data Consortium:

https://catalog.ldc.upenn.edu/

except I can’t actually GET the data without being a member and it’s supposed to be for like, real people at real institutions so I’m not sure I even COULD join as an individual even if I wanted to play $2000 just to pursue this random thought of mine. So instead I kept looking elsewhere until eventually I found THIS wonderful project:

https://github.com/hermitdave/FrequencyWords/

which is even more exactly what I wanted: just a list, by language, of all the words and how many times those words occurred. The source is from movies, actually, so it’s going to be a different set of words than the ones that come from written text, but it’s still going to be representative of the language, right? And it’s definitely a treasure trove but it’s only about 30 languages, which is not a whole lot compared to the some 6000 that exist. But it’s a start.

The next step was actually harder, which was to label each language as zero-copula or not. Again, there’s no damn table of languages with a “zero copula? Yes/no” column. You couldn’t even damn go to the Wikipedia page on the language and look because it’s not a piece of metadata or listed anywhere in any easy to find format. The thing is that the whole zero copula thing is ‘are there grammatical situations in this language where the copula is not there’ and that’s always yes, you just have to find out in WHICH situations and how universal, and blah de blah. The best I could do was the wonderful wonderful World Atlas of Language Structures, which had a chapter on the zero copula just for me:

http://wals.info/chapter/120

which was very close to exactly the table that I was looking for, except I don’t know what criteria they use to decide which languages to list at all, so some that were present in my corpus weren’t there. I did my best to investigate those missing languages to see if I could label them “zero copula yes/no” but without studying or knowing the first thing about them in most cases I really didn’t feel confident labelling them, so I had to exclude a couple languages from the dataset. That brought the number of data points down even more, but, ok, for a first pass, a few dozen is nothing to sneeze at.

And that’s it!

Language	Entropy	Zero Copula
Greek	11.036768	1
English	9.91789644	1
Icelandic	10.6334764	1
Italian	10.8204408	1
Arabic	12.5347624	0
Czech	11.7213583	0
Estonian	11.3217516	1
Indonesian	9.93464691	0
Spanish	10.5502229	1
Russian	11.6426266	0
Dutch	9.82311298	1
Portuguese	10.3395604	1
Turkish	12.6328705	0
Latvian	11.6488848	0
Lithuanian	12.1635837	0
Romanian	10.5867293	1
Polish	11.8310817	0
French	10.1053896	1
Bulgarian	10.8745093	1
Ukrainian	11.4108081	0
Croatian	11.4931293	1
German	10.5569109	1
Farsi	11.0020978	1
Finnish	12.557681	1
Hungarian	11.9924467	0
Hebrew	11.7588125	0
Albanian	10.4102589	1
Korean	12.1798089	1
Swedish	10.0132267	1
Malay	10.3752016	1

Table 1 – Languages and Their Entropies – 0 means zero copula is possible in the language, 1 means it is not

So obviously zero-copula == highest entropy is not a law – Finnish has one of the highest values and is not a zero-copula language. But of course there’s many other factors that go into this. All I think is that the copula is going to be a contributor, so overall, on average, zero-copula languages will tend to have higher entropies. And yes, if you take the average of all the zero-copula languages you get 11.75, whereas the average of the others is 10.77. Ok, cool! Let’s do some statistics.

Here’s my little game I played:

Sure, it looks like I’ve found a difference between zero-copula languages and non-zero-copula languages, but is it a significant difference? In these samples I ended up having 19 non-zero-copula languages and 11 zero-copula languages. If I divide the group of 30 languages into random groups of 19 and 11, of course one group is going to have a higher mean entropy than the other. How do I know this is telling me that it’s actually a result of zero-copulas languages *actually* having a higher entropy? Well, you play the re-sampling game, and select 11 random languages out of the group of 30 and see what their mean entropy is. Then you do it again, a couple million times, and that’ll tell you how often you end up with entropies as high as 11.75 with any random group of languages. If you get something like 11.75 pretty often with any random group, then that probably means it was just a coincidence.

Fig 1 – Entropies of groups of 11 languages, divided into 50 bins. Entropies on x axis, number of groups within that bin on y axis.

If you do this ten million times, you get entropies ranging between about 10.5 and 12. I divided that range into 50 bins, and counted how many times each randomly selected group of 11 languages fell into each bin. You can see that most of the time you’ll get a mean entropy somewhere around 11.2 or so, which is definitely below our actual zero-copula result of 11.75. How often can you expect a random group of 11 languages to have a mean entropy at or above 11.75? Well in this simulation only about 0.08% of the group were that extreme – that’s a p value of 0.0017, if you consider both ends. And as we all know, p ≤ 0.05 means you’re a good person, so I have been vindicated!

OK BUT SO

This was fun, but there are problems. Oh, so many problems.

First, the dataset: Limited to only the languages that were listed both in the open subtitles project and the world language atlas. The words included are from movies only. To get more reliable data we’d need more languages, with more or at least a wider range of sources for the words.

Calculating entropy: This may be an intractable problem here. I went with the method of identifying words already present in the dataset, which, from what I could see, kept all different spellings, cases, versions, etc of a word as separate words. So “run”, “runs”, and “running” are all different. This may be valid: “run” provides different information than “ran” (information about when it happened), but isn’t the difference between “run” and “ran” different than the difference between “run” and “walk”? How does this factor into what I’m doing here?

Categorizing languages: I may just be plain wrong about how I’ve categorized them, since I am unfamiliar with most languages, and it may not be as simple as two categories, and even if it were, what are the criteria for those categories? Someone with much greater scholastic depth on this count could maybe come up with a spectrum rather than categories, but then there’d also be rating languages on that spectrum.

Confounding factors: My first thought was that this could also/instead be explained by how inflected words are – you know, in order to indicate something about a word you change the word itself instead of adding a helper word. Like how a verb is conjugated to match I/you/they/we/he, so that you might not even need to use the pronoun since it’s understood just from the verb itself. This could certainly make a language more efficient. For a brief moment I thought this might explain away everything I’ve found here – except of course there’s not really much relation between how inflected a language it is and whether or not it uses a copula. So that would be an additional factor, but unrelated.

AND NOW?

Have I actually learned anything from this? I suppose I’ve verified my instinct, but I don’t know for sure what else that might imply or indicate or lead to. I would be interested to see what other language features correspond to different entropies, particularly the level of inflection in a language, how synthetic vs. analytic it is. Does it mean anything that one lanuage is “more efficient” than another? Are there any practical results? Depending on how seriously you take the Sapir-Whorf hypothesis, maybe there are real effects on how people think or communicate?

References

1. MacKay, D.J.C. (2003) Information Theory, Inference, and Learning Algorithms. Cambirdge University Press

2. Leon Stassen. 2013. Zero Copula for Predicate Nominals.

In: Dryer, Matthew S. & Haspelmath, Martin (eds.)

The World Atlas of Language Structures Online.

Leipzig: Max Planck Institute for Evolutionary Anthropology.

(Available online at http://wals.info/chapter/120, Accessed on 2018-02-19.)

3. Pierre Lison and Jörg Tiedemann, 2016, OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016) (http://www.opensubtitles.org/, Accessed on 2018-02-19).