Piotr Migdał

king - man + woman is queen; but why?

6 January 2017 | by Piotr Migdał | 12 min read

Intro

word2vec is an algorithm that transforms words into vectors, so that words with similar meanings end up laying close to each other. Moreover, it allows us to use vector arithmetics to work with analogies, for example, the famous king - man + woman = queen.

I will try to explain how it works, with special emphasis on the meaning of vector differences, at the same time omitting as many technicalities as possible.

If you would rather explore than read, here is an interactive exploration by my mentee Julia Bazińska, now a freshman in computer science at the University of Warsaw:

Counts, coincidences and meaning

I love letter co-occurrence in the word co-occurrence.

Sometimes a seemingly naive technique gives powerful results. It turns out that merely looking at word coincidences, while ignoring all grammar and context, can provide us insight into the meaning of a word. Consider this sentence:

A small, fluffy roosety climbed a tree.

What's a roosety? I would say that something like a squirrel since the two words can be easily interchanged. Such reasoning is called the distributional hypothesis and can be summarized as:

a word is characterized by the company it keeps - John Rupert Firth

If we want to teach it to a computer, the simplest, approximated approach is making it look only at word pairs. Let P(a|b) be the conditional probability that given a word b there is a word a within a short distance (let's say - being spaced by no more than 2 words). Then we claim that two words a and b are similar if

P(wa)=P(wb)P(w|a) = P(w|b)

for every word w. In other words, if we have this equality, no matter if there is a word a or b, all other words occur with the same frequency.

Even simple word counts, compared by source, can give interesting results, e.g. in lyrics of metal songs words (cries, eternity or ashes are popular, while words particularly or approximately are not, well, particularly common), see Heavy Metal and Natural Language Processing. See also Gender Roles with Text Mining and N-grams by Julia Silge.

Looking at co-occurrences can provide much more information. For example, one of my projects, TagOverflow, gives insight into the structure of programming, based only on the usage of tags on Stack Overflow. It also shows that I am in love with pointwise mutual information, which brings us to the next point.

Pointwise mutual information and compression

The Compressors: View cognition as compression. Compressed sensing, approximate matrix factorization - The (n) Cultures of Machine Learning - HN discussion

In principle, we can compute P(a|b) for every word pair. But even with a small dictionary of 100k words (bear in mind that we need to keep all declinations, proper names and things that are not in official dictionaries, yet are in use) keeping track of all pairs would require 8 gigabytes of space.

Often instead of working with conditional probabilities, we use the pointwise mutual information (PMI), defined as:

PMI(a,b)=log[P(a,b)P(a)P(b)]=log[P(ab)P(a)].PMI(a, b) = \log \left[ \frac{P(a,b)}{P(a)P(b)} \right] = \log \left[ \frac{P(a|b)}{P(a)} \right].

Its direct interpretation is how much more likely we get a pair than if it were at random. The logarithm makes it easier to work with words appearing at frequencies of different orders of magnitude. We can approximate PMI as a scalar product:

PMI(a,b)=vavb,PMI(a, b) = \vec{v}_a \cdot \vec{v}_b,

where vi\vec{v}_i are vectors, typically of 50-300 dimensions.

At first glance, it may be strange that all words can be compressed to a space of much smaller dimensionality. But some words can be trivially interchanged (e.g. John to Peter) and there is a lot of structure in general.

The fact that this compression is lossy may give it an advantage, as it can discover patterns rather than only memorize each pair. For example, in recommendation systems for movie ratings, each rating is approximated by a scalar product of two vectors - a movie's content and a user's preference. This is used to predict scores for as yet unseen movies, see Matrix Factorization with TensorFlow - Katherine Bailey.

Word similarity and vector closeness

Stock Market ≈ Thermometer - Five crazy abstractions my word2vec model just did

Let us start with the simple stuff - showing word similarity in a vector space. The condition that P(wa)=P(wb)P(w \vert a)=P(w \vert b) is equivalent to

PMI(w,a)=PMI(w,b),PMI(w, a) = PMI(w, b),

by dividing both sides by P(w) and taking their logarithms. After expressing PMI with vector products, we get

vwva=vwvb\vec{v}_w \cdot \vec{v}_a = \vec{v}_w \cdot \vec{v}_b

vw(vavb)=0\vec{v}_w \cdot \left( \vec{v}_a - \vec{v}_b \right) = 0

If it needs to work for every vw\vec{v}_w, then

va=vb.\vec{v}_a = \vec{v}_b.

Of course, in every practical case we won't get an exact equality, just words being close to each other. Words close in this space are often synonyms (e.g. happy and delighted), antonyms (e.g. good and evil) or other easily interchangeable words (e.g. yellow and blue). In particular, most opposing ideas (e.g. religion and atheism) will have similar contexts. If you want to play with that, look at this word2sense phrase search.

What I find much more interesting is that words form a linear space. In particular, a zero vector represents a totally uncharacteristic word, occurring with every other word at the random chance level (as its scalar product with every word is zero, so is its PMI).

It is one of the reasons why for vector similarity people often use cosine distance, i.e.

vavbvavb.\frac{\vec{v}_a \cdot \vec{v}_b}{\vert \vec{v}_a \vert \vert \vec{v}_b \vert}.

That is, it puts emphasis on the direction in which a given word co-occurs with other words, rather than the strength of this effect.

Analogies and linear space

If we want to make word analogies (a is to b is as A is to B), one may argue that it can be expressed as an equality of conditional probability ratios

P(wa)P(wb)=P(wA)P(wB)\frac{P(w|a)}{P(w|b)} = \frac{P(w|A)}{P(w|B)}

for every word w. Sure, it looks (and is!) a questionable assumption, but it is still the best thing we can do with conditional probability. I will not try defending this formulation - I will show its equivalent formulations.

For example, if we want to say dog is to puppy as cat is to kitten, we expect that if e.g. word nice co-occurs with both dog and cat (likely with different frequencies), then it co-occurs with puppy and kitten by the same factor. It appears it is true, with the factor of two favoring the cubs - compare pairs to words from Google Books Ngram Viewer (while n-grams look only at adjacent words, they can be some sort of approximation).

By proposing ratios for word analogies we implicitly assume that the probabilities of words can be factorized with respect to different dimensions of a word. For the discussed case it would be:

P(wdog)=f(wspecies=dog)×f(wage=adult)×P(wis_a_pet)P(w\vert dog) = f(w\vert species=dog) \times f(w\vert age=adult) \times P(w\vert is\_a\_pet) P(wpuppy)=f(wspecies=dog)×f(wage=cub)×P(wis_a_pet)P(w\vert puppy) = f(w\vert species=dog) \times f(w\vert age=cub) \times P(w\vert is\_a\_pet) P(wcat)=f(wspecies=cat)×f(wage=adult)×P(wis_a_pet)P(w\vert cat) = f(w\vert species=cat) \times f(w\vert age=adult) \times P(w\vert is\_a\_pet) P(wkitten)=f(wspecies=cat)×f(wage=cub)×P(wis_a_pet)P(w\vert kitten) = f(w\vert species=cat) \times f(w\vert age=cub) \times P(w\vert is\_a\_pet)

So, in particular:

P(wdog)P(wpuppy)=f(wage=adult)f(wage=cub)=P(wcat)P(wkitten).\frac{P(w|dog)}{P(w|puppy)} = \frac{f(w\vert age=adult)}{f(w\vert age=cub)} = \frac{P(w|cat)}{P(w|kitten)}.

How does the equality of conditional probability ratios translate to the word vectors? If we express it as mutual information (again, P(w) and logarithms) we get

vwvavwvb=vwvAvwvB,\vec{v}_w \cdot \vec{v}_a - \vec{v}_w \cdot \vec{v}_b = \vec{v}_w \cdot \vec{v}_A - \vec{v}_w \cdot \vec{v}_B,

which is the same as

vw(vavbvA+vB)=0.\vec{v}_w \cdot \left( \vec{v}_a - \vec{v}_b - \vec{v}_A + \vec{v}_B \right) = 0.

Again, if we want it to hold for any word w, this vector difference needs to be zero.

We can use analogies for meaning (e.g. changing gender with vectors), grammar (e.g. changing tenses) or other analogies (e.g. cities into their zip codes). It seems that analogies are not only a computational trick - we may actually use them to think all the time, see:

Differences and projections

woman - man = female - male = she - he, so wo = fe = s (a joke)

Difference of words vectors, like

vshevhe\vec{v}_{she} - \vec{v}_{he}

are not word vectors by themselves. However, it is interesting to project a word on this axis. We can see that the projection

vw(vavb)=log[P(wa)]log[P(wb)]\vec{v}_w \cdot \left( \vec{v}_a - \vec{v}_b \right)= \log\left[ P(w|a) \right] - \log\left[ P(w|b) \right]

is exactly a relative occurrence of a word within different contexts.

Bear in mind that when we want to look at common aspects of a word it is more natural to average two vectors rather than take their sum. While people use it interchangeably, it only works because cosine distance ignores the absolute vector length. So, for a gender neutral pronoun use (vshe+vhe)/2(\vec{v}_{she} + \vec{v}_{he})/2 rather than their sum.

Just looking at the word co-locations can give interesting results, look at these artistic projects - Word Spectrum and Word Associations from Visualizing Google's Bi-Gram Data by Chris Harrison.

I want to play!

If you want to explore, there is Word2viz by Julia Bazińska. You can choose between one of several pre-defined plots, or create one from scratch (choosing words and projections). I've just realized that Google Research released a tool for that as well: Open sourcing the Embedding Projector: a tool for visualizing high dimensional data - Google Research blog (and the actual live demo: Embedding Projector).

If you want to use pre-trained vectors, see Stanford GloVe or Google word2vec download sections. Some examples in Jupyter Notebooks are in our playground github.com/lamyiowce/word2viz (warning: raw state, look at them at your own risk).

If you want to train it on your own dataset, use a Python library gensim: Topic modelling for humans. Often some preprocessing is needed, especially looking at Sense2vec with spaCy and Gensim by Matthew Honnibal.

If you want to create it from scratch, the most convenient way is to start with Vector Representations of Words - TensorFlow Tutorial (see also a respective Jupyter Notebook from Udacity Course).

If you want to learn how it works, I recommend the following materials:

Julia Bazińska, at the rooftop garden of the Warsaw University Library - the building in which we worked

Technicalities

I tried to give some insight into algorithms transforming words into vectors. Every practical approach needs a bit more tweaking. Here are a few clarifications:

For further reading, I recommend:

Some backstory

I got interested in word2vec and related techniques since it combines quite a few things I appreciate

I had a motivation to learn more on the subject as I was tutoring Julia Bazińska during a two-week summer internship at DELab, University of Warsaw, supported by the Polish Children's Fund. See also my blog posts:

Thanks

This draft benefited from feedback from Grzegorz Uriasz (what's simple and what isn't), Sarah Martin (readability and grammar remarks). I want to especially thank Levy Omer for pointing to weak points (and shady assumptions) of word vector arithmetics.