Swedish language analysis – part II: verbs and word lenghts

This post is part 2 of a series of posts about an analysis of the Swedish language. Other parts are:
Part I: nouns

Verbs

Passive and deponent

Swedish has an interesting way of using verbs in a passive way. Where in English the form “<object> is being <verbed>” and in Dutch “<zelfstandig naamwoord> wordt ge<werkwoord>” are used, Swedish uses an inflection of the verb. So the English sentence “the doors are being closed” becomes “dörrarna stängs“. Stängs is a passive inflection of the verb att stänga (to close). Passive forms often end in an “s”.

Now the interesting thing is that there are a handful of verbs of which only the passive form is used, even though the meaning is active. Some examples are the verbs att hoppas (to hope) and att träffas (to meet). These verbs have no active form and are called deponent verbs. Before I started my analysis I could name a few, but I was wondering how many more there are. It turns out that there are 248 of them, out of a total of 8345 (≈3%) according to SAOL.

Groups

Swedish has multiple groups in which verbs are divided. Depending on how a verb is inflected, it belongs to a particular group. Wikipedia has a pretty elaborate article on this, but the basic idea is that there are 4 different groups: groups 1 and 2 are regular, group 3 are short verbs and group 4 are strong and irregular verbs.

I wanted to know how many verbs are there in each group, but unfortunately SAOL doesn’t contain this information. What I did was trying to find a list of verbs belonging to group 3 and 4 and infer groups 1 and 2 by looking at the present tense inflection of the verb. The present tense of a group 1 verb always ends in -ar and the present tense of a group 2 verb always ends in -er. For some verbs (mostly the deponent ones), I couldn’t figure out to which group they belong, so I marked them as unknown. Also, if you read the Wikipedia article, groups 1, 2 and 4 have subgroups, where group 4a for example contains the strong verbs and group 4b contains the irregular verbs, but since I didn’t have enough information, I only created 4 groups in my data set; 1, 2, 3 and irregular.

Now if we create a pie chart out of the data we get this:

Verb groups

As you can see, most verbs belong to group 1 and that about 88% of the Swedish verbs are regular. Slightly unfortunate is the fact that the irregular verbs (the ones you have to know by heart) are also verbs that are used quite often, but I guess frequently used verbs are irregular in many languages. Verbs like att vara (to be), att säga (to say) and att stinka (to stink) are all irregular for example.

Word length

Another thing that I wanted to know is how word length distributions are in Swedish. This is basically counting how long each word is and then graphing that. And it looks like this:

Length dist

Here we see that about 12,5% of the words are 9 letter words. By changing this graph a little bit, we can understand how much words are larger than x letters:

Lenght dist greater than

Here we can see for example that about 53% of all swedish words is larger than 9 letters.

All of the words words in the Swedish language were used in the previous two graphs, but it’s also interesting to see how different types of words are distributed:

Lenght dist per type

Here we can see some interesting things. Not unexpected, nouns follow the same curve as all the words, because it’s the biggest group of words and contribute much to the shape of the curl. But verbs and prepositions are generally smaller than nouns. Also, there are more 3 letter pronouns than pronouns of other sizes.

Normalizing the data by graphing what share of each group contributes to each word lengths, the graph looks like this:

Length of total

Here we see that the amount of 8 letter words seems to be quite equal for all the groups, but that 24% of the pronouns is 3 letters long, but only 1% of the adjectives is.

 

Advertisements

Swedish language analysis – Part I: nouns

This post is part 1 of a series of posts about an analysis of the Swedish language. Other parts are:
Part I: verbs and word lengths

Swedes like to think that Swedish is a very irregular language. Especially when it comes to nouns. There are two genders in Swedish: common and neuter, that respectively go with the indefinite articles “en” and “ett“. Swedish doesn’t have definite articles, but instead uses a system of suffixes that are also defined by gender. I’ll not try to explain exactly how this works, because Wikipedia does a much better job than me.

But how do you know if a noun a common or neuter? Most Swedes will tell you that you just have to know this (Dutch will tell you the same thing about Dutch, by the way). There are no rules, unlike many other language where you can infer the gender (and thus the article) from the ending of a noun. As a foreigner, this makes it pretty hard to speak the language fluently, since you have to basically know all the nouns and their gender. In Swedish, this is especially important, because adjectives are inflected by the gender (and number) of the noun (something that isn’t done in Dutch).

The topic of nouns and language difficulties comes up quite a often during conversations I have with Swedes and non-Swedes. Because I work for an international company where about all nationalities are represented at our headquarters, lunch conversations often gravitate to languages and so it was that once a colleague of mine said that there is a small little rule in Swedish that one might use to better guess the gender of a word. He told me that most nouns that end in an “a” are common. Not all, he said, but betting on a common noun is probably the best you can do when you don’t know the word.

This “hypothesis” has been in my head for a few years now, and I asked many Swedes if they could give me counter-examples of neuter words that ended in an “a”. Over the last few years, I found 3: öga (eye), öra (ear) and hjärta (heart). But I had no way of knowing exactly how many others there were. Until today.

Last year I got hold of a digital copy of the official Swedish word list (Svenska Akedemiens ordlista or SAOL). I really wanted to analyze this data, but what I had was a Windows application that didn’t allow for exporting the data. Through some hacking I managed to export all the data from it and ran some analysis. How I managed to extract the data I’ll explain in a follow-up post.

Swedish language statistics

During my quest to prove or disprove my hypothesis, I gathered some general statistics about the Swedish language:

Total words in SAOL: 123274

Because in Swedish you can create compound words, this is not the total number of words in the language.

The next thing I looked at was the distribution of words in SAOL:

nouns 91808 adjectives 17403
verbs 8345 variants 2679
adverbs 1451 references 757
interjections 248 numerals 141
in_compounds 131 pronouns 83
prepositions 79 conjunctions 73
names 70 articles 3
adverbial suffixes 1 infinitive particles 1

Nouns

Now on to the nouns. First I thought it would be interesting to look at the distribution of nouns:Gender distribution

Apart from common (68946 of them) and neuter (21529) nouns, there are 749 nouns that only have a plural form, 579 nouns that can’t be inflected or can’t have an article (these are words like cash, happy hour and heavy metal) and 6 of which I couldn’t figure out what the gender was.

And now for the big question: “How many words that end in an ‘a’ are neuter?

It turns out that 176 nouns that end in an ‘a’ are neuter. The rest of the words that end in an ‘a’ are common. The total number of nouns that end in an ‘a’ is 8620. Neuter nouns with an ‘a’ make up for about 2% of the total. So, I guess that one could say that the hypothesis is correct.

But since this ‘rule’ only applies to 8620 words it’s only helpful for about 9.3% of the nouns..

Singular and plural

Another thing I noticed about Swedish is that in certain cases the plural indefinite form of a noun is the same as the singular indefinite form (e.g. words like träd (tree): ett träd, flera träd). It seemed apparent that this was more often the case for neuter nouns than for common ones, but I wanted to make sure if my instinct was right. So, what did I find:

4789 out of  68946 common nouns are the same in plural as in singular, which is about 7% and 13839 out of 21529 neuter nouns are the same in plural as in singular; about 64%.

Next time

Obviously, there are many more crazy statistics that I can pull and I have a few I want to explore. One thing I’m interested in is how many verbs are deponent. Deponent verbs are verbs that are conjugated in a passive way, even though they are used in an active way. Some examples are hoppas (to hope) and träffas (to meet).

Since SAOL is copyrighted material, I’m not going to share the dataset I have, but  if you have ideas of statistics you’d like to see, or hypotheses you want to test, let me know.