Swedish language analysis – Part I: nouns

This post is part 1 of a series of posts about an analysis of the Swedish language. Other parts are:
Part I: verbs and word lengths

Swedes like to think that Swedish is a very irregular language. Especially when it comes to nouns. There are two genders in Swedish: common and neuter, that respectively go with the indefinite articles “en” and “ett“. Swedish doesn’t have definite articles, but instead uses a system of suffixes that are also defined by gender. I’ll not try to explain exactly how this works, because Wikipedia does a much better job than me.

But how do you know if a noun a common or neuter? Most Swedes will tell you that you just have to know this (Dutch will tell you the same thing about Dutch, by the way). There are no rules, unlike many other language where you can infer the gender (and thus the article) from the ending of a noun. As a foreigner, this makes it pretty hard to speak the language fluently, since you have to basically know all the nouns and their gender. In Swedish, this is especially important, because adjectives are inflected by the gender (and number) of the noun (something that isn’t done in Dutch).

The topic of nouns and language difficulties comes up quite a often during conversations I have with Swedes and non-Swedes. Because I work for an international company where about all nationalities are represented at our headquarters, lunch conversations often gravitate to languages and so it was that once a colleague of mine said that there is a small little rule in Swedish that one might use to better guess the gender of a word. He told me that most nouns that end in an “a” are common. Not all, he said, but betting on a common noun is probably the best you can do when you don’t know the word.

This “hypothesis” has been in my head for a few years now, and I asked many Swedes if they could give me counter-examples of neuter words that ended in an “a”. Over the last few years, I found 3: öga (eye), öra (ear) and hjärta (heart). But I had no way of knowing exactly how many others there were. Until today.

Last year I got hold of a digital copy of the official Swedish word list (Svenska Akedemiens ordlista or SAOL). I really wanted to analyze this data, but what I had was a Windows application that didn’t allow for exporting the data. Through some hacking I managed to export all the data from it and ran some analysis. How I managed to extract the data I’ll explain in a follow-up post.

Swedish language statistics

During my quest to prove or disprove my hypothesis, I gathered some general statistics about the Swedish language:

Total words in SAOL: 123274

Because in Swedish you can create compound words, this is not the total number of words in the language.

The next thing I looked at was the distribution of words in SAOL:

nouns 91808 adjectives 17403
verbs 8345 variants 2679
adverbs 1451 references 757
interjections 248 numerals 141
in_compounds 131 pronouns 83
prepositions 79 conjunctions 73
names 70 articles 3
adverbial suffixes 1 infinitive particles 1

Nouns

Now on to the nouns. First I thought it would be interesting to look at the distribution of nouns:Gender distribution

Apart from common (68946 of them) and neuter (21529) nouns, there are 749 nouns that only have a plural form, 579 nouns that can’t be inflected or can’t have an article (these are words like cash, happy hour and heavy metal) and 6 of which I couldn’t figure out what the gender was.

And now for the big question: “How many words that end in an ‘a’ are neuter?

It turns out that 176 nouns that end in an ‘a’ are neuter. The rest of the words that end in an ‘a’ are common. The total number of nouns that end in an ‘a’ is 8620. Neuter nouns with an ‘a’ make up for about 2% of the total. So, I guess that one could say that the hypothesis is correct.

But since this ‘rule’ only applies to 8620 words it’s only helpful for about 9.3% of the nouns..

Singular and plural

Another thing I noticed about Swedish is that in certain cases the plural indefinite form of a noun is the same as the singular indefinite form (e.g. words like träd (tree): ett träd, flera träd). It seemed apparent that this was more often the case for neuter nouns than for common ones, but I wanted to make sure if my instinct was right. So, what did I find:

4789 out of  68946 common nouns are the same in plural as in singular, which is about 7% and 13839 out of 21529 neuter nouns are the same in plural as in singular; about 64%.

Next time

Obviously, there are many more crazy statistics that I can pull and I have a few I want to explore. One thing I’m interested in is how many verbs are deponent. Deponent verbs are verbs that are conjugated in a passive way, even though they are used in an active way. Some examples are hoppas (to hope) and träffas (to meet).

Since SAOL is copyrighted material, I’m not going to share the dataset I have, but  if you have ideas of statistics you’d like to see, or hypotheses you want to test, let me know.