Zipf's LawZipf's law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.
But many a times such a relation with word frequency often comes to little use.For e.g if the text is scrambled the frequency remains same but it makes no sense.
Information is organized into sentences that are made by words in interaction with each other. Words interact in many ways. Some words co-occur with certain words at much higher probability.
A lexicon is a catalogue of a language's words (its wordstock); and the grammar, a system of rules which allow for the combination of those words into meaningful sentences.
The Word Co-occurrence Graph
It is quite obvious that many co-occurrences occur due to syntactical relationships between words. Links can be formed between words that have significant co-occurrences in the same sentence. The biggest connected component that results from the basic and the improved methods is called the Unrestricted Word Network(UNW) and Restricted Word Network(RWN).
The figure below shows the degree distribution curve for British National Corpus. They decay with average exponent γ = -2.7 . This exponent in second regime is similar to BA model (γ = -3). The BA model leads to scale-free distributions using preferential attachment.
|Degree distribution of UWN(filled circles) and RWN(open circles)|
Further the word networks have Small-World features. The average minimum distance between vertices is 2.6 and clustering coefficient is
|Connectivity distribution of 5000 most connected vertices in RWN. |
The exponent of power tail is nearly 3 indicating preferential attachment
Some of the studies about the lexical networks that have small world characteristics show that it is polysemy(many possible meanings for a word) and homophony (two words of different origin but have identical pronunciation) .
Two things can be derived if the small-world property holds-
- Existence of words that speed up navigation to other words. They're called particles and have no semantic value e.g. articles, prepositions etc
- The existence of brain disorders characterized by navigation deficits in which such words are involved e.g Agrammatism.
Language Clustering with Word Co-occurrences
Statistics from different languages show that the word co-occurrence network and its syntactic dependent network (two words connected if they're syntactically dependent) are highly similar in terms of network topology. So a word co-occurrence network can substitute a potential syntactic dependency network in studies of linguistic networks.
|Clustering of 14 word co-occurrence networks with |
eight complex network parameters
Running network algorithms on the same text in a parallel language(as a substitute for syntactic dependency networks) can help us in clustering languages. This work done by HaiTao and CONG Jin in 2012. They've constructed 14 word co-occurrence networks based on parallel text of 12 Slavic languages and 2 non-Slavic languages and conducted network analysis to them according to different combinations of major complex network parameters. They're able to distinguish Slavic and non-Slavic groups and also group some of slavic languages in their respective sub-branches. The clustering has also captured the genetic relationships of some of these Slavic languages within their