Category Archives: Linear Algebra Simple word2vec and doc2vec from PMI matrix decomposition

Earlier I made a post about decomposing PMI matrix as label embedding. It was turning the sparse label representation into a dense representation in the multilabel classification problem, that is to operate on the right-hand side of our AX = Y problem. The same technique works on the left-hand side too. Specifically, when working with text classification problem, we can use PMI matrix decomposition to get word vectors as well as a simple doc2vec transformation.

Here is a short code snippet demonstrating how it works:

We first constructed the simplest binary doc-term matrix M, use it to construct the PMI matrix PP and then decompose it into an embedding matrix E consists of word vectors of dimension 300.

Given the word boston, we can find similar words to it by computing and sorting words based on the cosine similarity between word vectors. And we see the top 20 similar words to boston are cities, places and sports-related words.

We can also transform the original new articles into 300 dimension vectors by simply taking the dot product between M and E, and feed the resulting matrix into downstream modeling tasks, like here we made a logistic regression that archives 0.82 on both recall and precision(as a result 0.82 on f1 too).

We can show the visualize the word vectors via TSNE. It does showcase that the more frequent two words co-occurs the closer the two words in the embedding space. Label Embedding in Multi-label classification

In the recent Kaggle competition, inclusive images challenge  I tried out label embedding technique for training multilabel classifiers, outlined in this paper by François Chollet.

The basic idea here is to decompose the pointwise mutual information(PMI) matrix from the training labels and use that to guide the training of the neural network model. The steps are as follow:

1. Encode training labels as you would with multilabel classification settings. Let $M$ (of size n by m, ie n training example with m labels) denote the matrix constructed by vertically stacking the label vectors.
2. The PMI (of size m by m) is a matrix with $PMI_{i,j}=log(\frac{P(i,j)}{P(i)*P(j)})$, it can be easily implemented via vectorized operations thus very efficient in computing, even on large datasets. See more explanation of the PMI here.
3. The embedding matrix $E$ is obtained by computing the singular value decomposition on PMI matrix and then take the dot product between $U$ and the first k columns of $\sqrt{\Sigma}$.
4. We then can use the embedding matrix to transform the original sparse encoded labels into dense vectors.
5. During the training of deep learning model, instead of using m sigmoid activations together with BCE loss in the end, now we can use k linear activation with cosine proximity loss.
6. During inference time, we take the model prediction and search in the rows from the embedding matrix $E$ and select the top similar vectors and find their corresponding labels.

Below is a toy example calculation of the label embedding procedure. The two pictures are the pairwise cosine similarity between item labels in the embedding space and a 2d display of items in the embedding space.  In my own experiments, I find the model trained on label embeddings are a bit more robust to label noises, it is faster in convergence and returns higher top k precision compared with models with logistic outputs.

I believe it is due to the high number of labels in the competition (m ~= 7000) problem contrasted with the small batches the model is trained on. As this label embedding is obtained from matrix factorization, it is similar to PCA that we keep crucial information and throw out some unnecessary detail/noise, except we are doing so on the labels instead of the inputs.

SVD Dimension Reduction 奇异值分解 降维

1. 找出隐含的相关关系
2. 去除无意义的特征
3. 便于理解、可视化、存储

方法一：奇异值分解

$A = USigma V^T$

• $A_{[m times n]}$ 表示$m$篇文档和$n$个术语

1. $U_{[mtimes r]}$表示$m$篇文档和$r$个概念
2. $Sigma_{[rtimes r]}$, $r$的值等于矩阵$A$的rank
3. $V_{[ntimes r]}$表示$n$个术语和$r$个概念

1. 实数矩阵$A$，总是可以进行奇异值分解的
2. 分解后的$U,Sigma ,V$都是唯一的
3. $U,V$都是列正交的（每一列都是相互垂直的单位向量）
4. $Sigma$是一个对角矩阵，对角线上的元素是奇异值，且均为正数，并且从左上往下按照大小排列

• 文档1和概念2,4相关性较高
• 文档2和概念2,3相关性较高
• 文档3和概念3相关性较高
• 文档4和概念1相关性较高 1. 进行SVD分解
2. 若要降维至k维，用（$U$矩阵的前$k$列构成的矩阵）与（$Sigma$矩阵的前$k$$k$列构成的矩阵）进行点积
3.  如果要将该低维度的数据表现恢复至高维度，则用2中结果与$V^T$矩阵的前$k$列所构成的矩阵进行点积 $frac{sum_{i = 1}^{k}sigma_i^2}{sum_{i = 1}^{r}sigma_i^2} leq 0.8$

Spectral Clustering 谱聚类

1 衡量聚类的“好坏”

1. 希望聚集内的相互连接尽量多
2. 希望聚集内的节点与聚集外的节点之间连接尽量少 Cut值并不是一个非常好的指标，如果包含度为1的节点，那么仅仅由节点一个节点构成的聚类，Cut值为1。

$phi(A) = frac{text{cut(A)}}{min(vol(A), 2m-vol(A))}$

1. $cut(A)$是聚集A的cut分值
2. $vol(A)$是聚集A的所有节点的度的和
3. $m$是网络内边的总数

2  图谱理论（Spectral Graph Theory）和谱聚类 1. 特征值都是非负实数
2. 特征向量都是实数，并且相互垂直

1. 是单位向量：$sum_ix_i^2=1$
2. 与第一个特征向量垂直：$sum_ix_icdot 1 = 0$

$min_{x}frac{x^TMx}{x^Tx}=min_{x}sum_{i,j}(x_i-x_j)^2$

1. 预处理：用拉普拉斯矩阵表示网络
2. 分解：计算矩阵的特征值和特征向量，找到$lambda_2$对应的特征向量$x$（这样每一个节点就用一个值来表示了）
3. 聚类：决定若干分割值，根据特征向量的值与分割值的关系，将各个节点归属到不同聚集

1. 简单地选择0或者中位数
2. 尝试最小化正则化的1维cut值

1. 递归地进行2分聚类
2. $lambda_2$$lambda_3$等后续特征值对应的特征向量进行K-means

奇异值分解 Singular Value Decomposition,SVD

1. 旋转
2. 伸缩 (a)中即旋转又伸缩，(b)中仅仅进行旋转，(c)中仅仅进行伸缩。 $mathbb{R}^m$下的超椭圆可以被看做是将$mathbb{R}^m$下的单位球沿着正交方向${u}_{1},{u}_{2},...,{u}_{m}inmathbb{R}^m$进行以${sigma}_{1},{sigma}_{2},...,{sigma}_{m}inmathbb{R}^m$为系数的伸缩后得到的结果。 1. 如果矩阵$A$秩(rank)为$r$，则$r$个长度${sigma}_{j}$不为$0$
2. 如果矩阵$A$$m times n$矩阵，并且有$m>n$，则至多有$n$个长度${sigma}_{j}$不为$0$
3. 如果矩阵$A$满秩(full rank)，则矩阵$A$$n$个奇异值为矩阵$AS$中主要半轴的长度${sigma}_{j}$
通常将奇异值按照递减顺序排列，即${sigma}_{1}geq{sigma}_{2}geqldotsgeq{sigma}_{n}>0$

$A{v}_{j}={sigma}_{j}{u}_{j},1leq j leq n$

$AV=hat{U}hat{Sigma}$

1. $A$${m}times{n}$
2. $V$${n}times{n}$的酉矩阵(unitary matrix)。
3. $hat{U}$${m}times{n}$，其中列是标准正交(orthonormal)的。
4. $hat{Sigma}$${n}times{n}$的对角线矩阵(diagonal matrix )。如果$A$满秩，则 $hat{Sigma}$对角线值均为正。

$A=hat{U}hat{Sigma}V^{*}$ 1. $hat{U}$矩阵中增加额外的$m-n$$hat{U}$中原有的列呈标准正交的列，构成矩阵$U$
构建好的$U$${m}times{m}$，并且为酉矩阵
2. $hat{Sigma}$矩阵中增加额外的$m-n$行0

$A=U{Sigma}V^{*}$

1. $Uinmathbb{C}^{{m}times{m}}$为酉矩阵
2. $Vinmathbb{C}^{{n}times{n}}$为酉矩阵
3. $Sigmainmathbb{R}^{{m}times{n}}$为对角线矩阵

• $Sigma$矩阵对角线上的值为非负且按照递减顺序排列
${sigma}_{1}geq{sigma}_{2}geqldotsgeq{sigma}_{p}geq 0,p=min(m,n)$ 1. 首先，通过$V^{*}$矩阵进行酉变换，对例子中的单位球而言，该变换的效果只是旋转
2. 然后，通过$Sigma$矩阵进行伸缩变换，本例中为将旋转后的单位球进行伸缩变换，形成超椭圆
3. 最后，通过$U$矩阵进行酉变换，本例中为将上面形成的超椭圆进一步旋转

• 任何矩阵$A{in}mathbb{C}^{{m}times{n}}$均有一个奇异值分解，并且，奇异值${sigma}_{j}$有唯一确定值。

${A}^{T}{A} = {(U{Sigma}V^{*})}^{T}{(U{Sigma}V^{*})} ={V{Sigma}{U}^{*}}{U{Sigma}{V}^{*}} ={V}{Sigma}^{2}{V}^{*}$

${A}{A}^{T} = {(U{Sigma}V^{*})}{(U{Sigma}V^{*})}^{T} ={U{Sigma}{V}^{*}}{V{Sigma}{U}^{*}} ={U}{Sigma}^{2}{U}^{*}$

${A}^{T}AV = V{Sigma}^{2}$

$A{A}^{T}U = U{Sigma}^{2}$

Reference: