# Simple word2vec and doc2vec from PMI matrix decomposition

Earlier I made a post about decomposing PMI matrix as label embedding. It was turning the sparse label representation into a dense representation in the multilabel classification problem, that is to operate on the right-hand side of our AX = Y problem. The same technique works on the left-hand side too. Specifically, when working with text classification problem, we can use PMI matrix decomposition to get word vectors as well as a simple doc2vec transformation.

Here is a short code snippet demonstrating how it works:

We first constructed the simplest binary doc-term matrix M, use it to construct the PMI matrix PP and then decompose it into an embedding matrix E consists of word vectors of dimension 300.

Given the word boston, we can find similar words to it by computing and sorting words based on the cosine similarity between word vectors. And we see the top 20 similar words to boston are cities, places and sports-related words.

We can also transform the original new articles into 300 dimension vectors by simply taking the dot product between M and E, and feed the resulting matrix into downstream modeling tasks, like here we made a logistic regression that archives 0.82 on both recall and precision(as a result 0.82 on f1 too).

We can show the visualize the word vectors via TSNE.

It does showcase that the more frequent two words co-occurs the closer the two words in the embedding space.

# Label Embedding in Multi-label classification

In the recent Kaggle competition, inclusive images challenge  I tried out label embedding technique for training multilabel classifiers, outlined in this paper by François Chollet.

The basic idea here is to decompose the pointwise mutual information(PMI) matrix from the training labels and use that to guide the training of the neural network model. The steps are as follow:

1. Encode training labels as you would with multilabel classification settings. Let $M$ (of size n by m, ie n training example with m labels) denote the matrix constructed by vertically stacking the label vectors.
2. The PMI (of size m by m) is a matrix with $PMI_{i,j}=log(\frac{P(i,j)}{P(i)*P(j)})$, it can be easily implemented via vectorized operations thus very efficient in computing, even on large datasets. See more explanation of the PMI here.
3. The embedding matrix $E$ is obtained by computing the singular value decomposition on PMI matrix and then take the dot product between $U$ and the first k columns of $\sqrt{\Sigma}$.
4. We then can use the embedding matrix to transform the original sparse encoded labels into dense vectors.
5. During the training of deep learning model, instead of using m sigmoid activations together with BCE loss in the end, now we can use k linear activation with cosine proximity loss.
6. During inference time, we take the model prediction and search in the rows from the embedding matrix $E$ and select the top similar vectors and find their corresponding labels.

Below is a toy example calculation of the label embedding procedure. The two pictures are the pairwise cosine similarity between item labels in the embedding space and a 2d display of items in the embedding space.

In my own experiments, I find the model trained on label embeddings are a bit more robust to label noises, it is faster in convergence and returns higher top k precision compared with models with logistic outputs.

I believe it is due to the high number of labels in the competition (m ~= 7000) problem contrasted with the small batches the model is trained on. As this label embedding is obtained from matrix factorization, it is similar to PCA that we keep crucial information and throw out some unnecessary detail/noise, except we are doing so on the labels instead of the inputs.

# CUR Dimension Reduction CUR分解 降维

SVD分解的缺点在于:

1. 计算比较耗时
2. 存储$UV$矩阵比较占空间

CUR分解是另外一个选择，其目标是：找到输入矩阵的一个“尽可能好”的分解为三个矩阵的乘积A = CUR，SVD分解是完美的分解（通过允许误差来加速计算）。其中：

1. $C$矩阵由$A$矩阵中的列构成
2. $R$矩阵由$A$矩阵中的行构成
3. $U$矩阵由$C,R$的交集进行伪逆求得

C矩阵获得：

• 输入：矩阵$Ain mathbb{R}^{mtimes n}$，抽样数量$c$
• 输出：$C_d in mathbb{R}^{mtimes c}$
1. for x = 1:n （计算列的分布）
2.     $P(x) = sum_i A(i,x)^2 / sum_{i,j} A(i,j)^2$
3. for i = 1:c
4.     Pick $j in 1: n$
5.     Comupte $C_d(:,i) = A(:,j)/sqrt{cP(j)}$

R矩阵的获得的方式与上面类似。

U矩阵的计算方式为：

1. 取C矩阵和R矩阵的“交集”W
2. U为W的伪逆