2019 CS224N Assignment 2: word2vec
Written: Understanding word2vec
注:下面的答案中,各向量和矩阵的形式按照题目中所说。 U 和 V U和V U和V 都是词向量矩阵,列向量为单词对应的词向量,同样 v c 、 u o 、 u w v_c、u_o、u_w vc、uo、uw 等都是列向量。 y y y 为 one-hot的行向量, y ^ \hat y y^ 表示softmax之后的结果,也为行向量, y 、 y ^ y、 \hat y y、y^ 长度都为词表大小。
(a) 因为 y \boldsymbol{y} y 是一个 one-hot 向量, 只有词 o o o 的概率为 1, 即只有 y i = 1 y_i = 1 yi=1 ,i=o 时。因此证明如下
− ∑ w ∈ V o c a b y w log ( y ^ w ) = − [ y 1 log ( y ^ 1 ) + ⋯ + y o log ( y ^ o ) + ⋯ + y w log ( y ^ w ) ] = − y o log ( y ^ o ) = − log ( y ^ o ) = − log P ( O = o ∣ C = c ) \begin{aligned} - \sum_{w\in Vocab}y_w\log(\hat{y}_w) &= - [y_1\log(\hat{y}_1) + \cdots + y_o\log(\hat{y}_o) + \cdots + y_w\log(\hat{y}_w)] \\ & = - y_o\log(\hat{y}_o) \\ & = -\log(\hat{y}_o)\\ & = -\log \mathrm{P}(O = o | C = c) \end{aligned} −w∈Vocab∑ywlog(y^w)=−[y1log(y^1)+⋯+yolog(y^o)+⋯+ywlog(y^w)]=−yolog(y^o)=−log(y^o)=−logP(O=o∣C=c)
(b)
∂
J
∂
v
c
=
−
∂
log
p
(
O
=
o
∣
C
=
c
)
∂
v
c
=
−
∂
log
e
x
p
(
u
o
T
v
c
)
∂
v
c
+
∂
(
log
∑
w
∈
V
o
c
a
b
e
x
p
(
u
w
T
v
c
)
)
∂
v
c
=
−
∂
u
o
T
v
c
∂
v
c
+
∑
w
∈
V
o
c
a
b
∂
e
x
p
(
u
w
T
v
c
)
∂
v
c
∑
w
∈
V
o
c
a
b
e
x
p
(
u
w
T
v
c
)
=
−
u
0
+
∑
x
∈
V
o
c
a
b
e
x
p
(
u
x
T
v
c
)
∑
w
∈
V
o
c
a
b
e
x
p
(
u
w
T
v
c
)
u
x
=
−
u
0
+
∑
x
∈
V
o
c
a
b
e
x
p
(
u
x
T
v
c
)
∑
w
∈
V
o
c
a
b
e
x
p
(
u
w
T
v
c
)
u
x
=
−
u
0
+
∑
x
∈
V
o
c
a
b
p
(
O
=
x
∣
C
=
c
)
u
x
=
−
U
y
T
+
U
y
^
T
=
U
(
y
^
−
y
)
T
\begin{aligned} \frac{\partial J}{\partial v_c} =& - \frac{\partial \log p(O=o|C = c)}{\partial v_c}\\ =&-\frac{\partial \log exp(u_o^Tv_c)}{\partial v_c}+ \frac{\partial (\log \sum_{w \in Vocab}exp(u_w^Tv_c))}{\partial v_c}\\ =&- \frac{\partial u_o^Tv_c}{\partial v_c}+\frac{\sum_{w \in Vocab}\frac{\partial exp(u_w^Tv_c)}{\partial v_c}}{\sum_{w \in Vocab}exp(u_w^Tv_c)}\\ =& -u_0 + \frac{\sum_{x \in Vocab}exp(u_x^Tv_c)}{\sum_{w \in Vocab}exp(u_w^Tv_c)}u_x\\ =& -u_0 + \sum_{x \in Vocab} \frac{exp(u_x^Tv_c)}{\sum_{w \in Vocab}exp(u_w^Tv_c)}u_x\\ =& -u_0 + \sum_{x \in Vocab} p(O=x|C=c)u_x\\ =& - Uy^T + U \hat y^T\\ =& U(\hat y - y)^T\\ \end{aligned}
∂vc∂J========−∂vc∂logp(O=o∣C=c)−∂vc∂logexp(uoTvc)+∂vc∂(log∑w∈Vocabexp(uwTvc))−∂vc∂uoTvc+∑w∈Vocabexp(uwTvc)∑w∈Vocab∂vc∂exp(uwTvc)−u0+∑w∈Vocabexp(uwTvc)∑x∈Vocabexp(uxTvc)ux−u0+x∈Vocab∑∑w∈Vocabexp(uwTvc)exp(uxTvc)ux−u0+x∈Vocab∑p(O=x∣C=c)ux−UyT+Uy^TU(y^−y)T
注:
∑
x
∈
V
o
c
a
b
p
(
O
=
x
∣
C
=
c
)
u
x
\sum_{x \in Vocab} p(O=x|C=c)u_x
∑x∈Vocabp(O=x∣C=c)ux 表示对向量集
u
x
u_x
ux的一个线性组合,权重为
p
(
O
=
x
∣
C
=
c
)
p(O=x|C=c)
p(O=x∣C=c),将其转化成矩阵形式即为
U
y
^
T
U \hat y^T
Uy^T。同样
U
y
T
Uy^T
UyT 中
y
y
y为 one-hot 向量,只有
v
o
v_o
vo处为1,因此
U
y
T
Uy^T
UyT 表示将
u
o
u_o
uo 从
U
U
U 中取出来。不熟悉矩阵运算的同学可以详细写出来,自己判断判断。可以直观判断求得梯度的维度应该与
v
c
v_c
vc 的维度相同,即为列向量 embedding × 1,
U
(
y
^
−
y
)
T
U(\hat y - y)^T
U(y^−y)T 维度为
(
e
m
b
e
d
d
i
n
g
×
v
o
c
a
b
s
i
z
e
)
∗
(
v
o
c
a
b
s
i
z
e
×
1
)
=
e
m
b
e
d
d
i
n
g
×
1
(embedding × vocab_{size}) * (vocab_{size} × 1) = embedding × 1
(embedding×vocabsize)∗(vocabsize×1)=embedding×1
(c)
∂
J
∂
u
w
=
−
∂
log
p
(
O
=
o
∣
C
=
c
)
∂
u
w
=
−
∂
log
e
x
p
(
u
o
T
v
c
)
∂
u
w
+
∂
(
log
∑
w
∈
V
o
c
a
b
e
x
p
(
u
w
T
v
c
)
)
∂
u
w
\begin{aligned} \frac{\partial J}{\partial u_w} =& - \frac{\partial \log p(O=o|C = c)}{\partial u_w} \\ =& -\frac{\partial \log exp(u_o^Tv_c)}{\partial u_w} + \frac{\partial (\log \sum_{w \in Vocab}exp(u_w^Tv_c))}{\partial u_w}\\ \end{aligned}
∂uw∂J==−∂uw∂logp(O=o∣C=c)−∂uw∂logexp(uoTvc)+∂uw∂(log∑w∈Vocabexp(uwTvc))
当
w
=
o
w=o
w=o 时,
=
−
∂
log
e
x
p
(
u
o
T
v
c
)
∂
u
o
+
∂
(
log
∑
w
∈
V
o
c
a
b
e
x
p
(
u
w
T
v
c
)
)
∂
u
o
=
−
v
c
+
1
∑
w
∈
V
o
c
a
b
e
x
p
(
u
w
T
v
c
)
∗
e
x
p
(
u
o
T
v
c
)
∗
v
c
=
−
v
c
+
p
(
O
=
o
∣
C
=
c
)
∗
v
c
=
(
p
(
O
=
o
∣
C
=
c
)
−
1
)
v
c
\begin{aligned} =& -\frac{\partial \log exp(u_o^Tv_c)}{\partial u_o} + \frac{\partial (\log \sum_{w \in Vocab}exp(u_w^Tv_c))}{\partial u_o}\\ =& -v_c + \frac{1}{\sum_{w \in Vocab}exp(u_w^Tv_c)} *exp(u_o^Tv_c)*v_c\\ =& -v_c + p(O=o|C=c)*v_c\\ =& (p(O=o|C=c)-1)v_c \end{aligned}
====−∂uo∂logexp(uoTvc)+∂uo∂(log∑w∈Vocabexp(uwTvc))−vc+∑w∈Vocabexp(uwTvc)1∗exp(uoTvc)∗vc−vc+p(O=o∣C=c)∗vc(p(O=o∣C=c)−1)vc
当
w
!
=
o
w!=o
w!=o时,
=
−
∂
log
e
x
p
(
u
o
T
v
c
)
∂
u
w
+
∂
(
log
∑
w
∈
V
o
c
a
b
e
x
p
(
u
w
T
v
c
)
)
∂
u
w
=
1
∑
w
∈
V
o
c
a
b
e
x
p
(
u
w
T
v
c
)
∗
e
x
p
(
u
w
T
v
c
)
∗
v
c
=
p
(
O
=
w
∣
C
=
c
)
∗
v
c
\begin{aligned} =& -\frac{\partial \log exp(u_o^Tv_c)}{\partial u_w} + \frac{\partial (\log \sum_{w \in Vocab}exp(u_w^Tv_c))}{\partial u_w}\\ =& \frac{1}{\sum_{w \in Vocab}exp(u_w^Tv_c)} *exp(u_w^Tv_c)*v_c\\ =& p(O=w|C=c)*v_c \end{aligned}
===−∂uw∂logexp(uoTvc)+∂uw∂(log∑w∈Vocabexp(uwTvc))∑w∈Vocabexp(uwTvc)1∗exp(uwTvc)∗vcp(O=w∣C=c)∗vc
因此,单词
u
o
u_o
uo的梯度为
(
p
(
O
=
o
∣
C
=
c
)
−
1
)
v
c
(p(O=o|C=c)-1)v_c
(p(O=o∣C=c)−1)vc,其余单词
u
w
(
w
∈
V
o
c
a
b
且
w
!
=
o
)
u_w \, (w\in Vocab 且 w!=o)
uw(w∈Vocab且w!=o)的梯度为
p
(
O
=
w
∣
C
=
c
)
∗
v
c
p(O=w|C=c)*v_c
p(O=w∣C=c)∗vc,将两者合并起来得到:
∂
J
∂
U
=
−
∂
log
p
(
O
=
o
∣
C
=
c
)
∂
U
=
v
c
(
y
^
−
y
)
\begin{aligned} \frac{\partial J}{\partial U} =& - \frac{\partial \log p(O=o|C = c)}{\partial U} \\ =& v_c(\hat y - y) \end{aligned}
∂U∂J==−∂U∂logp(O=o∣C=c)vc(y^−y)
可以检验结果的维度与
U
U
U 的维度相同。
(d)
∂
σ
(
x
)
∂
x
=
−
∂
1
e
x
+
1
∂
x
=
e
x
(
e
x
+
1
)
2
=
σ
(
x
)
(
1
−
σ
(
x
)
)
\begin{aligned} \frac{\partial \sigma(x)}{\partial x} =& -\frac{\partial \frac{1}{e^x+1}}{\partial x}\\ =& \frac{e^x}{(e^x+1)^2}\\ =& \sigma (x)(1- \sigma(x)) \end{aligned}
∂x∂σ(x)===−∂x∂ex+11(ex+1)2exσ(x)(1−σ(x))
(e)
-
对 v c v_c vc 求导
∂ J ∂ v c = − 1 σ ( u o T v c ) σ ( u o T v c ) ( 1 − σ ( u o T v c ) ) u o − ∑ k = 1 K ( 1 σ ( − u k T v c ) σ ( − u k T v c ) ( 1 − σ ( − u k T v c ) ) ∗ ( − u k ) ) = − ( 1 − σ ( u o T v c ) ) u o + ∑ k = 1 K ( ( 1 − σ ( − u k T v c ) ) u k ) ) \begin{aligned} \frac{\partial J}{\partial v_c} =& -\frac{1}{\sigma(u_o^Tv_c)}\sigma(u_o^Tv_c)(1-\sigma(u_o^Tv_c))u_o - \sum_{k=1}^K(\frac{1}{\sigma(-u_k^Tv_c)}\sigma(-u_k^Tv_c)(1-\sigma(-u_k^Tv_c))*(-u_k))\\ =& -(1-\sigma(u_o^Tv_c))u_o + \sum_{k=1}^K((1-\sigma(-u_k^Tv_c))u_k))\\ \end{aligned} ∂vc∂J==−σ(uoTvc)1σ(uoTvc)(1−σ(uoTvc))uo−k=1∑K(σ(−ukTvc)1σ(−ukTvc)(1−σ(−ukTvc))∗(−uk))−(1−σ(uoTvc))uo+k=1∑K((1−σ(−ukTvc))uk))
可以看出 v c v_c vc 的梯度是 u o u_o uo 和 u k u_k uk 的线性组合,系数分别为用1减去自己词向量与 v c v_c vc 词向量求内积之后经过sigmoid函数的值。 -
对 u o u_o uo 求导
∂ J ∂ u o = − 1 σ ( u o T v c ) σ ( u o T v c ) ( 1 − σ ( u o T v c ) ) v c = − ( 1 − σ ( u o T v c ) ) v c \begin{aligned} \frac{\partial J}{\partial u_o} =& -\frac{1}{\sigma(u_o^Tv_c)}\sigma(u_o^Tv_c)(1-\sigma(u_o^Tv_c))v_c \\ =& -(1-\sigma(u_o^Tv_c))v_c \\ \end{aligned} ∂uo∂J==−σ(uoTvc)1σ(uoTvc)(1−σ(uoTvc))vc−(1−σ(uoTvc))vc
u o u_o uo 的梯度只与自己和 v c v_c vc 的词向量有关 -
对 u i ( i ∈ { 1 , 2 , 3 , ⋅ ⋅ ⋅ , K } ) u_i \, (i \in \{1,2,3,···,K\}) ui(i∈{1,2,3,⋅⋅⋅,K})求导
∂ J ∂ u i = − 1 σ ( − u i T v c ) σ ( − u i T v c ) ( 1 − σ ( − u i T v c ) ) ∗ ( − v c ) = ( 1 − σ ( − u i T v c ) ) v c ) \begin{aligned} \frac{\partial J}{\partial u_i} =& -\frac{1}{\sigma(-u_i^Tv_c)}\sigma(-u_i^Tv_c)(1-\sigma(-u_i^Tv_c))*(-v_c)\\ =& (1-\sigma(-u_i^Tv_c))v_c)\\ \end{aligned} ∂ui∂J==−σ(−uiTvc)1σ(−uiTvc)(1−σ(−uiTvc))∗(−vc)(1−σ(−uiTvc))vc)
同样,负样例 u i u_i ui 的梯度也只与自身和 v c v_c vc 有关。 区别仅在于词向量之前是否加了符号。
根据(b)(c)可知,使用naive-softmax时,计算
v
c
v_c
vc 和
u
o
u_o
uo 的梯度时,都需要进行矩阵乘法,所有词向量参与运算,当然前向传播计算loss时,同样需要全体词向量参与运算。而根据(e)可知,使用 neg-sample时,需要的计算量少。
(f)
- (i)
∂ J s k i p − g r a m ( v c , w t − m , ⋅ ⋅ ⋅ , w t + m , U ) ∂ U = ∑ − m < j < m , j ≠ 0 ∂ J ( v c , w t + j , U ) ∂ U \begin{aligned} \frac{\partial J_{skip-gram}(v_c,w_{t-m},···,w_{t+m},U)}{\partial U} =& \sum_{-m<j<m,j\ne 0}\frac{\partial J(v_c,w_{t+j},U)}{\partial U}\\ \end{aligned} ∂U∂Jskip−gram(vc,wt−m,⋅⋅⋅,wt+m,U)=−m<j<m,j=0∑∂U∂J(vc,wt+j,U) - (ii)
∂ J s k i p − g r a m ( v c , w t − m , ⋅ ⋅ ⋅ , w t + m , U ) ∂ v c = ∑ − m < j < m , j ≠ 0 ∂ J ( v c , w t + j , U ) ∂ v c \begin{aligned} \frac{\partial J_{skip-gram}(v_c,w_{t-m},···,w_{t+m},U)}{\partial v_c} =& \sum_{-m<j<m,j\ne 0}\frac{\partial J(v_c,w_{t+j},U)}{\partial v_c}\\ \end{aligned} ∂vc∂Jskip−gram(vc,wt−m,⋅⋅⋅,wt+m,U)=−m<j<m,j=0∑∂vc∂J(vc,wt+j,U) - (iii)
∂ J s k i p − g r a m ( v c , w t − m , ⋅ ⋅ ⋅ , w t + m , U ) ∂ v w = 0 ( w ≠ c ) \begin{aligned} \frac{\partial J_{skip-gram}(v_c,w_{t-m},···,w_{t+m},U)}{\partial v_w} =0 \quad (w \ne c)\\ \end{aligned} ∂vw∂Jskip−gram(vc,wt−m,⋅⋅⋅,wt+m,U)=0(w=c)
Coding: Implementing word2vec
完整代码见我的GitHub
建立环境,由于下载缓慢导致失败,我在env.yml 文件中添加了清华的源。

(a) 完成word2vec.py 中的 sigmoid, softmax and negative sampling loss and gradient 和 skip-gram model
- sigmoid:
s
=
1
1
+
e
−
x
s = \frac{1}{1+ e^{-x}}
s=1+e−x1
因此代码为:s = 1/(1+np.exp(-x)) - naiveSoftmaxLossAndGradient:
- 代码中的词向量为行向量, U 和 V U和V U和V 也都是组织成行向量的形式 ,与我们上面题中所规定的形式不同,只需转置即可。
- numpy.array([1,2,3])是一个行向量,shape = (3,)具体来说退化成一个数组,他是可以看做一个行向量或者列向量,具体要看他是在矩阵乘法的左边还是右边,见博客:https://blog.csdn.net/alxe_made/article/details/80492649
- 按照前面求出的公式,只要把维度搞正确就可以了
op = np.dot(centerWordVec,np.transpose(outsideVectors)) #(embedding,) × (embedding,n)=(,n)
y_hat = softmax(op) #row vector (n,)
delta = y_hat.copy()
delta[outsideWordIdx] -= 1 # y_hat - y
loss = - np.log(y_hat[outsideWordIdx])
gradCenterVec = np.dot(delta,outsideVectors) #row vector
gradOutsideVecs = np.dot(delta[:, np.newaxis], centerWordVec[np.newaxis, :]) #np.newaxis add one dimension
### END YOUR CODE
return loss, gradCenterVec, gradOutsideVecsp.newaxis], centerWordVec[np.newaxis, :])
- negSamplingLossAndGradient:
- 依旧需要注意词向量是行向量
- 若多个负样例为同一个词,则改词的梯度是这几个负样例梯度的和
- 此处参考了别人的实现
np.outer(a,b):用于求外积- numpy数组和矩阵的区别
### YOUR CODE HERE
gradCenterVec = np.zeros(centerWordVec.shape)
gradOutsideVecs = np.zeros(outsideVectors.shape)
loss = 0.0
z = sigmoid(np.dot(outsideVectors[outsideWordIdx],centerWordVec)) #(embedding,) × (embedding,)
loss = -np.log(z)
gradCenterVec += (z-1.0)*outsideVectors[outsideWordIdx]
gradOutsideVecs[outsideWordIdx] += (z-1.0)*centerWordVec
#使用向量化实现
u_k = outsideVectors[negSampleWordIndices] #将所负样例的词向量取出来
z = sigmoid(-np.dot(u_k,centerWordVec)) #(k,)
loss += np.sum(-np.log(z))
gradCenterVec += np.dot((1.0-z),u_k) #(k,) × (k,embedding) = (embedding,)
for i, negSampleWordIdx in enumerate(negSampleWordIndices): #注意相同的负例都要计算
gradOutsideVecs[negSampleWordIdx] += (1.0 - z[i]) * centerWordVec
### END YOUR CODE
return loss, gradCenterVec, gradOutsideVecs
- skipgram
- 一个窗口有很多个context word,所以要调用我们实现的函数多遍,求loss和梯度之和。
### YOUR CODE HERE
centerWordVec = centerWordVectors[word2Ind[currentCenterWord]]
for word in outsideWords:
idx = word2Ind[word]
l,v_c,U = word2vecLossAndGradient(centerWordVec,idx,outsideVectors,dataset)
loss+=l
gradCenterVecs[word2Ind[currentCenterWord]] += v_c
gradOutsideVectors+=U
### END YOUR CODE
return loss, gradCenterVecs, gradOutsideVectors
(b) sgd.py
较简单,加入下面代码即可
### YOUR CODE HERE
loss,gradient = f(x)
x -= step*gradient
### END YOUR CODE
(c)
- 这部分不需要代码,需要下载stanfordSentimentTreebank.zip,下载较慢,我在网上找了一份。若需要请从我github中取。在 util/dataset 目录下
- 运行run.py开始训练,由于使用的是SGD,因此loss有时候会涨。但最终的趋势是在减小的。需要运行较长时间


- 由于训练数据太少,因此效果不是非常好,但很明显可以看到,有些词已经开始距离较近了。比如coffee和tea,还有female和woman,以及(0.00, -0.04)周围的形容词。