Swedish language analysis – part II: verbs and word lenghts

This post is part 2 of a series of posts about an analysis of the Swedish language. Other parts are:
Part I: nouns

Verbs

Passive and deponent

Swedish has an interesting way of using verbs in a passive way. Where in English the form “<object> is being <verbed>” and in Dutch “<zelfstandig naamwoord> wordt ge<werkwoord>” are used, Swedish uses an inflection of the verb. So the English sentence “the doors are being closed” becomes “dörrarna stängs“. Stängs is a passive inflection of the verb att stänga (to close). Passive forms often end in an “s”.

Now the interesting thing is that there are a handful of verbs of which only the passive form is used, even though the meaning is active. Some examples are the verbs att hoppas (to hope) and att träffas (to meet). These verbs have no active form and are called deponent verbs. Before I started my analysis I could name a few, but I was wondering how many more there are. It turns out that there are 248 of them, out of a total of 8345 (≈3%) according to SAOL.

Groups

Swedish has multiple groups in which verbs are divided. Depending on how a verb is inflected, it belongs to a particular group. Wikipedia has a pretty elaborate article on this, but the basic idea is that there are 4 different groups: groups 1 and 2 are regular, group 3 are short verbs and group 4 are strong and irregular verbs.

I wanted to know how many verbs are there in each group, but unfortunately SAOL doesn’t contain this information. What I did was trying to find a list of verbs belonging to group 3 and 4 and infer groups 1 and 2 by looking at the present tense inflection of the verb. The present tense of a group 1 verb always ends in -ar and the present tense of a group 2 verb always ends in -er. For some verbs (mostly the deponent ones), I couldn’t figure out to which group they belong, so I marked them as unknown. Also, if you read the Wikipedia article, groups 1, 2 and 4 have subgroups, where group 4a for example contains the strong verbs and group 4b contains the irregular verbs, but since I didn’t have enough information, I only created 4 groups in my data set; 1, 2, 3 and irregular.

Now if we create a pie chart out of the data we get this:

Verb groups

As you can see, most verbs belong to group 1 and that about 88% of the Swedish verbs are regular. Slightly unfortunate is the fact that the irregular verbs (the ones you have to know by heart) are also verbs that are used quite often, but I guess frequently used verbs are irregular in many languages. Verbs like att vara (to be), att säga (to say) and att stinka (to stink) are all irregular for example.

Word length

Another thing that I wanted to know is how word length distributions are in Swedish. This is basically counting how long each word is and then graphing that. And it looks like this:

Length dist

Here we see that about 12,5% of the words are 9 letter words. By changing this graph a little bit, we can understand how much words are larger than x letters:

Lenght dist greater than

Here we can see for example that about 53% of all swedish words is larger than 9 letters.

All of the words words in the Swedish language were used in the previous two graphs, but it’s also interesting to see how different types of words are distributed:

Lenght dist per type

Here we can see some interesting things. Not unexpected, nouns follow the same curve as all the words, because it’s the biggest group of words and contribute much to the shape of the curl. But verbs and prepositions are generally smaller than nouns. Also, there are more 3 letter pronouns than pronouns of other sizes.

Normalizing the data by graphing what share of each group contributes to each word lengths, the graph looks like this:

Length of total

Here we see that the amount of 8 letter words seems to be quite equal for all the groups, but that 24% of the pronouns is 3 letters long, but only 1% of the adjectives is.

 

Advertisement

Swedish language analysis – Part I: nouns

This post is part 1 of a series of posts about an analysis of the Swedish language. Other parts are:
Part I: verbs and word lengths

Swedes like to think that Swedish is a very irregular language. Especially when it comes to nouns. There are two genders in Swedish: common and neuter, that respectively go with the indefinite articles “en” and “ett“. Swedish doesn’t have definite articles, but instead uses a system of suffixes that are also defined by gender. I’ll not try to explain exactly how this works, because Wikipedia does a much better job than me.

But how do you know if a noun a common or neuter? Most Swedes will tell you that you just have to know this (Dutch will tell you the same thing about Dutch, by the way). There are no rules, unlike many other language where you can infer the gender (and thus the article) from the ending of a noun. As a foreigner, this makes it pretty hard to speak the language fluently, since you have to basically know all the nouns and their gender. In Swedish, this is especially important, because adjectives are inflected by the gender (and number) of the noun (something that isn’t done in Dutch).

The topic of nouns and language difficulties comes up quite a often during conversations I have with Swedes and non-Swedes. Because I work for an international company where about all nationalities are represented at our headquarters, lunch conversations often gravitate to languages and so it was that once a colleague of mine said that there is a small little rule in Swedish that one might use to better guess the gender of a word. He told me that most nouns that end in an “a” are common. Not all, he said, but betting on a common noun is probably the best you can do when you don’t know the word.

This “hypothesis” has been in my head for a few years now, and I asked many Swedes if they could give me counter-examples of neuter words that ended in an “a”. Over the last few years, I found 3: öga (eye), öra (ear) and hjärta (heart). But I had no way of knowing exactly how many others there were. Until today.

Last year I got hold of a digital copy of the official Swedish word list (Svenska Akedemiens ordlista or SAOL). I really wanted to analyze this data, but what I had was a Windows application that didn’t allow for exporting the data. Through some hacking I managed to export all the data from it and ran some analysis. How I managed to extract the data I’ll explain in a follow-up post.

Swedish language statistics

During my quest to prove or disprove my hypothesis, I gathered some general statistics about the Swedish language:

Total words in SAOL: 123274

Because in Swedish you can create compound words, this is not the total number of words in the language.

The next thing I looked at was the distribution of words in SAOL:

nouns 91808 adjectives 17403
verbs 8345 variants 2679
adverbs 1451 references 757
interjections 248 numerals 141
in_compounds 131 pronouns 83
prepositions 79 conjunctions 73
names 70 articles 3
adverbial suffixes 1 infinitive particles 1

Nouns

Now on to the nouns. First I thought it would be interesting to look at the distribution of nouns:Gender distribution

Apart from common (68946 of them) and neuter (21529) nouns, there are 749 nouns that only have a plural form, 579 nouns that can’t be inflected or can’t have an article (these are words like cash, happy hour and heavy metal) and 6 of which I couldn’t figure out what the gender was.

And now for the big question: “How many words that end in an ‘a’ are neuter?

It turns out that 176 nouns that end in an ‘a’ are neuter. The rest of the words that end in an ‘a’ are common. The total number of nouns that end in an ‘a’ is 8620. Neuter nouns with an ‘a’ make up for about 2% of the total. So, I guess that one could say that the hypothesis is correct.

But since this ‘rule’ only applies to 8620 words it’s only helpful for about 9.3% of the nouns..

Singular and plural

Another thing I noticed about Swedish is that in certain cases the plural indefinite form of a noun is the same as the singular indefinite form (e.g. words like träd (tree): ett träd, flera träd). It seemed apparent that this was more often the case for neuter nouns than for common ones, but I wanted to make sure if my instinct was right. So, what did I find:

4789 out of  68946 common nouns are the same in plural as in singular, which is about 7% and 13839 out of 21529 neuter nouns are the same in plural as in singular; about 64%.

Next time

Obviously, there are many more crazy statistics that I can pull and I have a few I want to explore. One thing I’m interested in is how many verbs are deponent. Deponent verbs are verbs that are conjugated in a passive way, even though they are used in an active way. Some examples are hoppas (to hope) and träffas (to meet).

Since SAOL is copyrighted material, I’m not going to share the dataset I have, but  if you have ideas of statistics you’d like to see, or hypotheses you want to test, let me know.

 

 

Pronunciation problems

Swedes are good at English, as I wrote before, but they often have problems with pronouncing certain letters or letter combinations in English. Some examples are the pronunciation of the English “j” and “ch” in words like “joke” and “cheap”. Swedes often pronounce them as “y” and “sh”. Joke becomes yoke and cheap becomes sheep (“Hey shek it out! That’s really sheep! Nah, I was just yoking..”). When correcting people, they often tell me that they had no idea that their pronunciation was wrong, since teachers in school do it wrong as well.

Tele2, a Swedish telecom operator uses this as a joke (or yoke) in their commercials, where they say they are “sheep”, while they obviously mean that they are cheap:

On the other hand, I also have difficulties pronouncing Swedish words, since there are several letters or letter combinations that are either pronounced as “sh”,  “sk”, “ch”, or “k” . For example (and bear with me, since I really should learn how to write phonetic symbols):

  • Skägg (beard) is pronounced chegg (where the ch is a soft g)
  • While skal (shell) is pronounced as you write it
  • Kort (short or card) is pronounced as you write it
  • While kör (drive) is pronounced shur (with the u as the u in burdon)

And obviously, there are exceptions. When kör means drive, it’s pronounced with a sh, while when it’s meant as a choir, it’s pronounced as you write it, with a k. And the word kort can actually mean card or short, but is either pronounced koort (with the oo as in poo) or kort (with the o as in short).

Lost in translation

Swedes are good at speaking English. Actually, most Swedes (or at least most from my generation or younger) like speaking English and being able to speak Swedish isn’t very necessary if you live here. I have several colleagues who make no effort at learning Swedish, simply because they don’t feel the need or necessity to do so. And honestly, sometimes Swedes don’t make it easy either; they often like to show off how good their English is. Quite a few times Swedes switched to English when they figured out I wasn’t a native viking. Apparently my pronunciation is pretty good, but I do make quite a lot of mistakes when it comes to the correct use of articles and adjectives, so as soon as I start stumbling through a sentence, a lot of Swedes switch to English. Very polite, but it’s not helping me in my quest to master the language.

Swedes are good at speaking English because children learn English in school from a young age and most English movies and TV-shows are subtitled*. But one thing surprises me; titles (and only the titles) of TV-shows and movies are often translated to Swedish. Sometimes it’s a very direct translation (the TV-show “Friends” becomes “Vänner”), or sometimes it’s something completely different (Shawshank Redemption becomes Nyckeln till frihet (literally: The Key to Freedom)). Some more strange examples can be found at thelocal.se.

Why this is done is a complete mystery to me and even to most Swedes that I’ve spoken to about this..

* In European countries where media subtitled, people tend to speak better English (the Nordics, the Netherlands, Belgium and Portugal for example) then where they don’t (Germany, France, Spain, Italy and most of central and eastern Europe).

Numbers

A difference between Dutch and Swedish is the way you use numbers. Swedish is similar to English, where the groups of ten come before single digits, like 23 is “twenty-three” (Swedish: tjugotre). However, in Dutch we switch things around: “twenty-three” becomes “drieëntwintig“, which means “three and twenty“. This is similar to the German dreiundzwanzig. 

Strange? It can be stranger.. Like the Danish number system.

The time, the clock..

An interesting detail in Swedish I bumped into last week is the fact that the word for clock is used for describing time. In English you would ask “what time is it?“, while in Swedish you would ask “vad är klockan?” (what is the clock?). “The clock” is also used when telling time: “klockan är 7” (the clock is 7 / it’s 7 o’clock). Interestingly enough, English uses the clock as a substitute for “hour” here as well, but that’s only used for whole hours; in Swedish you would say “klockan är kvart i 7” (the clock is quarter to 7), while in English you would say “it’s a quarter to 7“.

This is different in Dutch, where we use the word for hour when talking about whole hours; “het is 7 uur” (it’s 7 hour) and “het is kwart voor 7” (it’s a quarter to 7). Half hours in Swedish are similar to Dutch: “half 7” or “halv 7” means “half past 6” (although I’ve heard native English speakers using “half 7” as well). Where it gets complicated for me as a Dutch native is when using the minutes between a quarter past the hour and a quarter to the hour, where in Dutch, we use a somewhat strange construction. Twenty minutes past the hour (say twenty past 4) would be “tien voor half 5” (ten before half past 4) and “twenty to 5” would be “tien over half 5” (ten past half past 4), while Swedish follows the more English variant of “tjugo över 4” and “tjugo i 5“.

Using the clock for time looks like Swedish efficiency to me; when the clock is something abstract, we talk about time, otherwise, it’s an actual clock (“var är klockan?” / “where is the clock?“, “hur stor är klockan” / “how big is the clock“), but we just use one word.

Swearing

I hadn’t really realized how awful it actually was, until I moved abroad. New colleagues pointed out that swearing with diseases isn’t really a normal thing to do, however, we Dutch do it a lot. We use all sorts of diseases when we swear at ourselves (“klere!” – an old word for cholera) or wish others the most horrible things (“krijg de tiefus/tering/klere!” – “get typhus/tuberculosis/cholera!”, synonymous for “fuck you!”). Some diseases are more accepted then others, where using cancer is really pushing it (although definitively heard). And next to all this, we use genitalia a lot.

What I noticed is that in Swedish using genitalia to swear is considered very rude. While in Dutch “kut” (vagina) is used in the same way as the English word “shit” (even though it’s direct translation would be cunt”), using the same word in Swedish (“fitta”) is somewhat of a no-go.

Swedes tend to swear with things that relate to hell or the devil, which, being used to swearing using diseases and genitalia, sounds rather silly and decent. English speakers would say “what the fuck?!”, while Swedes would say “vad fan?!”, which translates to “what the hell?!”, or more directly “what the devil?!”. Also “jävlar” (damn) is heard a lot, which comes from the word djävul (devil). The same goes for “jävla” that is used to amplify adjectives (“det var jävligt gott!”“that was fucking tasty!”), that also seems to come from something that has to do with the devil.

So I think from now on I’ll be using “devilishly tasty!” or “duivels lekker!”.