How does it work?

Some words in the English language occur frequently together. One example is "Turing" and "Bletchley", the former referring to the famous British computer scientist and the latter to Bletchley Park where he worked. We call such words semantically related, meaning that their co-occurrence is not due to chance but rather due to some non-trivial relationship. Such relationships include, for example, similar syntactic roles or similar meanings.

Semantic Link analyzes the text of the English Wikipedia and attempts to find all pairs of words which are semantically related. For that purpose it uses a statistical measure called mutual information, or MI for short. The higher the MI for a given pair of words, the higher the chance that they are related.

Semantic Link displays the top 100 related words for each query. The search is currently limited to words that have at least 1,000 occurrences in Wikipedia, but if there is enough interest I will gladly expand the database.

The tools used to generate the dataset for Semantic Link are licensed under GPL and are available for download at http://mpacula.com/autocorpus/.