Tag Archives: TSNE

Simple word2vec and doc2vec from PMI matrix decomposition

Earlier I made a post about decomposing PMI matrix as label embedding. It was turning the sparse label representation into a dense representation in the multilabel classification problem, that is to operate on the right-hand side of our AX = Y problem. The same technique works on the left-hand side too. Specifically, when working with text classification problem, we can use PMI matrix decomposition to get word vectors as well as a simple doc2vec transformation.

Here is a short code snippet demonstrating how it works:

We first constructed the simplest binary doc-term matrix M, use it to construct the PMI matrix PP and then decompose it into an embedding matrix E consists of word vectors of dimension 300.

Given the word boston, we can find similar words to it by computing and sorting words based on the cosine similarity between word vectors. And we see the top 20 similar words to boston are cities, places and sports-related words.

We can also transform the original new articles into 300 dimension vectors by simply taking the dot product between M and E, and feed the resulting matrix into downstream modeling tasks, like here we made a logistic regression that archives 0.82 on both recall and precision(as a result 0.82 on f1 too).

We can show the visualize the word vectors via TSNE.

It does showcase that the more frequent two words co-occurs the closer the two words in the embedding space.