2019 CS224N Assignment 2: word2vec

Written: Understanding word2vec

注:下面的答案中,各向量和矩阵的形式按照题目中所说。 U 和 V U和V UV 都是词向量矩阵,列向量为单词对应的词向量,同样 v c 、 u o 、 u w v_c、u_o、u_w vcuouw 等都是列向量。 y y y 为 one-hot的行向量, y ^ \hat y y^ 表示softmax之后的结果,也为行向量, y 、 y ^ y、 \hat y yy^ 长度都为词表大小。

(a) 因为 y \boldsymbol{y} y 是一个 one-hot 向量, 只有词 o o o 的概率为 1, 即只有 y i = 1 y_i = 1 yi=1 ,i=o 时。因此证明如下

− ∑ w ∈ V o c a b y w log ⁡ ( y ^ w ) = − [ y 1 log ⁡ ( y ^ 1 ) + ⋯ + y o log ⁡ ( y ^ o ) + ⋯ + y w log ⁡ ( y ^ w ) ] = − y o log ⁡ ( y ^ o ) = − log ⁡ ( y ^ o ) = − log ⁡ P ( O = o ∣ C = c ) \begin{aligned} - \sum_{w\in Vocab}y_w\log(\hat{y}_w) &= - [y_1\log(\hat{y}_1) + \cdots + y_o\log(\hat{y}_o) + \cdots + y_w\log(\hat{y}_w)] \\ & = - y_o\log(\hat{y}_o) \\ & = -\log(\hat{y}_o)\\ & = -\log \mathrm{P}(O = o | C = c) \end{aligned} wVocabywlog(y^w)=[y1log(y^1)++yolog(y^o)++ywlog(y^w)]=yolog(y^o)=log(y^o)=logP(O=oC=c)

(b)
∂ J ∂ v c = − ∂ log ⁡ p ( O = o ∣ C = c ) ∂ v c = − ∂ log ⁡ e x p ( u o T v c ) ∂ v c + ∂ ( log ⁡ ∑ w ∈ V o c a b e x p ( u w T v c ) ) ∂ v c = − ∂ u o T v c ∂ v c + ∑ w ∈ V o c a b ∂ e x p ( u w T v c ) ∂ v c ∑ w ∈ V o c a b e x p ( u w T v c ) = − u 0 + ∑ x ∈ V o c a b e x p ( u x T v c ) ∑ w ∈ V o c a b e x p ( u w T v c ) u x = − u 0 + ∑ x ∈ V o c a b e x p ( u x T v c ) ∑ w ∈ V o c a b e x p ( u w T v c ) u x = − u 0 + ∑ x ∈ V o c a b p ( O = x ∣ C = c ) u x = − U y T + U y ^ T = U ( y ^ − y ) T \begin{aligned} \frac{\partial J}{\partial v_c} =& - \frac{\partial \log p(O=o|C = c)}{\partial v_c}\\ =&-\frac{\partial \log exp(u_o^Tv_c)}{\partial v_c}+ \frac{\partial (\log \sum_{w \in Vocab}exp(u_w^Tv_c))}{\partial v_c}\\ =&- \frac{\partial u_o^Tv_c}{\partial v_c}+\frac{\sum_{w \in Vocab}\frac{\partial exp(u_w^Tv_c)}{\partial v_c}}{\sum_{w \in Vocab}exp(u_w^Tv_c)}\\ =& -u_0 + \frac{\sum_{x \in Vocab}exp(u_x^Tv_c)}{\sum_{w \in Vocab}exp(u_w^Tv_c)}u_x\\ =& -u_0 + \sum_{x \in Vocab} \frac{exp(u_x^Tv_c)}{\sum_{w \in Vocab}exp(u_w^Tv_c)}u_x\\ =& -u_0 + \sum_{x \in Vocab} p(O=x|C=c)u_x\\ =& - Uy^T + U \hat y^T\\ =& U(\hat y - y)^T\\ \end{aligned} vcJ========vclogp(O=oC=c)vclogexp(uoTvc)+vc(logwVocabexp(uwTvc))vcuoTvc+wVocabexp(uwTvc)wVocabvcexp(uwTvc)u0+wVocabexp(uwTvc)xVocabexp(uxTvc)uxu0+xVocabwVocabexp(uwTvc)exp(uxTvc)uxu0+xVocabp(O=xC=c)uxUyT+Uy^TU(y^y)T
注: ∑ x ∈ V o c a b p ( O = x ∣ C = c ) u x \sum_{x \in Vocab} p(O=x|C=c)u_x xVocabp(O=xC=c)ux 表示对向量集 u x u_x ux的一个线性组合,权重为 p ( O = x ∣ C = c ) p(O=x|C=c) p(O=xC=c),将其转化成矩阵形式即为 U y ^ T U \hat y^T Uy^T。同样 U y T Uy^T UyT y y y为 one-hot 向量,只有 v o v_o vo处为1,因此 U y T Uy^T UyT 表示将 u o u_o uo U U U 中取出来。不熟悉矩阵运算的同学可以详细写出来,自己判断判断。可以直观判断求得梯度的维度应该与 v c v_c vc 的维度相同,即为列向量 embedding × 1, U ( y ^ − y ) T U(\hat y - y)^T U(y^y)T 维度为 ( e m b e d d i n g × v o c a b s i z e ) ∗ ( v o c a b s i z e × 1 ) = e m b e d d i n g × 1 (embedding × vocab_{size}) * (vocab_{size} × 1) = embedding × 1 (embedding×vocabsize)(vocabsize×1)=embedding×1

(c)
∂ J ∂ u w = − ∂ log ⁡ p ( O = o ∣ C = c ) ∂ u w = − ∂ log ⁡ e x p ( u o T v c ) ∂ u w + ∂ ( log ⁡ ∑ w ∈ V o c a b e x p ( u w T v c ) ) ∂ u w \begin{aligned} \frac{\partial J}{\partial u_w} =& - \frac{\partial \log p(O=o|C = c)}{\partial u_w} \\ =& -\frac{\partial \log exp(u_o^Tv_c)}{\partial u_w} + \frac{\partial (\log \sum_{w \in Vocab}exp(u_w^Tv_c))}{\partial u_w}\\ \end{aligned} uwJ==uwlogp(O=oC=c)uwlogexp(uoTvc)+uw(logwVocabexp(uwTvc))
w = o w=o w=o 时,
= − ∂ log ⁡ e x p ( u o T v c ) ∂ u o + ∂ ( log ⁡ ∑ w ∈ V o c a b e x p ( u w T v c ) ) ∂ u o = − v c + 1 ∑ w ∈ V o c a b e x p ( u w T v c ) ∗ e x p ( u o T v c ) ∗ v c = − v c + p ( O = o ∣ C = c ) ∗ v c = ( p ( O = o ∣ C = c ) − 1 ) v c \begin{aligned} =& -\frac{\partial \log exp(u_o^Tv_c)}{\partial u_o} + \frac{\partial (\log \sum_{w \in Vocab}exp(u_w^Tv_c))}{\partial u_o}\\ =& -v_c + \frac{1}{\sum_{w \in Vocab}exp(u_w^Tv_c)} *exp(u_o^Tv_c)*v_c\\ =& -v_c + p(O=o|C=c)*v_c\\ =& (p(O=o|C=c)-1)v_c \end{aligned} ====uologexp(uoTvc)+uo(logwVocabexp(uwTvc))vc+wVocabexp(uwTvc)1exp(uoTvc)vcvc+p(O=oC=c)vc(p(O=oC=c)1)vc
w ! = o w!=o w!=o时,
= − ∂ log ⁡ e x p ( u o T v c ) ∂ u w + ∂ ( log ⁡ ∑ w ∈ V o c a b e x p ( u w T v c ) ) ∂ u w = 1 ∑ w ∈ V o c a b e x p ( u w T v c ) ∗ e x p ( u w T v c ) ∗ v c = p ( O = w ∣ C = c ) ∗ v c \begin{aligned} =& -\frac{\partial \log exp(u_o^Tv_c)}{\partial u_w} + \frac{\partial (\log \sum_{w \in Vocab}exp(u_w^Tv_c))}{\partial u_w}\\ =& \frac{1}{\sum_{w \in Vocab}exp(u_w^Tv_c)} *exp(u_w^Tv_c)*v_c\\ =& p(O=w|C=c)*v_c \end{aligned} ===uwlogexp(uoTvc)+uw(logwVocabexp(uwTvc))wVocabexp(uwTvc)1exp(uwTvc)vcp(O=wC=c)vc
因此,单词 u o u_o uo的梯度为 ( p ( O = o ∣ C = c ) − 1 ) v c (p(O=o|C=c)-1)v_c (p(O=oC=c)1)vc,其余单词 u w   ( w ∈ V o c a b 且 w ! = o ) u_w \, (w\in Vocab 且 w!=o) uw(wVocabw!=o)的梯度为 p ( O = w ∣ C = c ) ∗ v c p(O=w|C=c)*v_c p(O=wC=c)vc,将两者合并起来得到:
∂ J ∂ U = − ∂ log ⁡ p ( O = o ∣ C = c ) ∂ U = v c ( y ^ − y ) \begin{aligned} \frac{\partial J}{\partial U} =& - \frac{\partial \log p(O=o|C = c)}{\partial U} \\ =& v_c(\hat y - y) \end{aligned} UJ==Ulogp(O=oC=c)vc(y^y)
可以检验结果的维度与 U U U 的维度相同。

(d)
∂ σ ( x ) ∂ x = − ∂ 1 e x + 1 ∂ x = e x ( e x + 1 ) 2 = σ ( x ) ( 1 − σ ( x ) ) \begin{aligned} \frac{\partial \sigma(x)}{\partial x} =& -\frac{\partial \frac{1}{e^x+1}}{\partial x}\\ =& \frac{e^x}{(e^x+1)^2}\\ =& \sigma (x)(1- \sigma(x)) \end{aligned} xσ(x)===xex+11(ex+1)2exσ(x)(1σ(x))

(e)

  • v c v_c vc 求导
    ∂ J ∂ v c = − 1 σ ( u o T v c ) σ ( u o T v c ) ( 1 − σ ( u o T v c ) ) u o − ∑ k = 1 K ( 1 σ ( − u k T v c ) σ ( − u k T v c ) ( 1 − σ ( − u k T v c ) ) ∗ ( − u k ) ) = − ( 1 − σ ( u o T v c ) ) u o + ∑ k = 1 K ( ( 1 − σ ( − u k T v c ) ) u k ) ) \begin{aligned} \frac{\partial J}{\partial v_c} =& -\frac{1}{\sigma(u_o^Tv_c)}\sigma(u_o^Tv_c)(1-\sigma(u_o^Tv_c))u_o - \sum_{k=1}^K(\frac{1}{\sigma(-u_k^Tv_c)}\sigma(-u_k^Tv_c)(1-\sigma(-u_k^Tv_c))*(-u_k))\\ =& -(1-\sigma(u_o^Tv_c))u_o + \sum_{k=1}^K((1-\sigma(-u_k^Tv_c))u_k))\\ \end{aligned} vcJ==σ(uoTvc)1σ(uoTvc)(1σ(uoTvc))uok=1K(σ(ukTvc)1σ(ukTvc)(1σ(ukTvc))(uk))(1σ(uoTvc))uo+k=1K((1σ(ukTvc))uk))
    可以看出 v c v_c vc 的梯度是 u o u_o uo u k u_k uk 的线性组合,系数分别为用1减去自己词向量与 v c v_c vc 词向量求内积之后经过sigmoid函数的值。

  • u o u_o uo 求导
    ∂ J ∂ u o = − 1 σ ( u o T v c ) σ ( u o T v c ) ( 1 − σ ( u o T v c ) ) v c = − ( 1 − σ ( u o T v c ) ) v c \begin{aligned} \frac{\partial J}{\partial u_o} =& -\frac{1}{\sigma(u_o^Tv_c)}\sigma(u_o^Tv_c)(1-\sigma(u_o^Tv_c))v_c \\ =& -(1-\sigma(u_o^Tv_c))v_c \\ \end{aligned} uoJ==σ(uoTvc)1σ(uoTvc)(1σ(uoTvc))vc(1σ(uoTvc))vc
    u o u_o uo 的梯度只与自己和 v c v_c vc 的词向量有关

  • u i   ( i ∈ { 1 , 2 , 3 , ⋅ ⋅ ⋅ , K } ) u_i \, (i \in \{1,2,3,···,K\}) ui(i{1,2,3,,K})求导
    ∂ J ∂ u i = − 1 σ ( − u i T v c ) σ ( − u i T v c ) ( 1 − σ ( − u i T v c ) ) ∗ ( − v c ) = ( 1 − σ ( − u i T v c ) ) v c ) \begin{aligned} \frac{\partial J}{\partial u_i} =& -\frac{1}{\sigma(-u_i^Tv_c)}\sigma(-u_i^Tv_c)(1-\sigma(-u_i^Tv_c))*(-v_c)\\ =& (1-\sigma(-u_i^Tv_c))v_c)\\ \end{aligned} uiJ==σ(uiTvc)1σ(uiTvc)(1σ(uiTvc))(vc)(1σ(uiTvc))vc)
    同样,负样例 u i u_i ui 的梯度也只与自身和 v c v_c vc 有关。 区别仅在于词向量之前是否加了符号。

根据(b)(c)可知,使用naive-softmax时,计算 v c v_c vc u o u_o uo 的梯度时,都需要进行矩阵乘法,所有词向量参与运算,当然前向传播计算loss时,同样需要全体词向量参与运算。而根据(e)可知,使用 neg-sample时,需要的计算量少。
(f)

  • (i)
    ∂ J s k i p − g r a m ( v c , w t − m , ⋅ ⋅ ⋅ , w t + m , U ) ∂ U = ∑ − m < j < m , j ≠ 0 ∂ J ( v c , w t + j , U ) ∂ U \begin{aligned} \frac{\partial J_{skip-gram}(v_c,w_{t-m},···,w_{t+m},U)}{\partial U} =& \sum_{-m<j<m,j\ne 0}\frac{\partial J(v_c,w_{t+j},U)}{\partial U}\\ \end{aligned} UJskipgram(vc,wtm,,wt+m,U)=m<j<m,j=0UJ(vc,wt+j,U)
  • (ii)
    ∂ J s k i p − g r a m ( v c , w t − m , ⋅ ⋅ ⋅ , w t + m , U ) ∂ v c = ∑ − m < j < m , j ≠ 0 ∂ J ( v c , w t + j , U ) ∂ v c \begin{aligned} \frac{\partial J_{skip-gram}(v_c,w_{t-m},···,w_{t+m},U)}{\partial v_c} =& \sum_{-m<j<m,j\ne 0}\frac{\partial J(v_c,w_{t+j},U)}{\partial v_c}\\ \end{aligned} vcJskipgram(vc,wtm,,wt+m,U)=m<j<m,j=0vcJ(vc,wt+j,U)
  • (iii)
    ∂ J s k i p − g r a m ( v c , w t − m , ⋅ ⋅ ⋅ , w t + m , U ) ∂ v w = 0 ( w ≠ c ) \begin{aligned} \frac{\partial J_{skip-gram}(v_c,w_{t-m},···,w_{t+m},U)}{\partial v_w} =0 \quad (w \ne c)\\ \end{aligned} vwJskipgram(vc,wtm,,wt+m,U)=0(w=c)

Coding: Implementing word2vec

完整代码见我的GitHub
建立环境,由于下载缓慢导致失败,我在env.yml 文件中添加了清华的源。
在这里插入图片描述
(a) 完成word2vec.py 中的 sigmoid, softmax and negative sampling loss and gradient 和 skip-gram model

  • sigmoid: s = 1 1 + e − x s = \frac{1}{1+ e^{-x}} s=1+ex1
    因此代码为:s = 1/(1+np.exp(-x))
  • naiveSoftmaxLossAndGradient:
    • 代码中的词向量为行向量, U 和 V U和V UV 也都是组织成行向量的形式 ,与我们上面题中所规定的形式不同,只需转置即可。
    • numpy.array([1,2,3])是一个行向量,shape = (3,)具体来说退化成一个数组,他是可以看做一个行向量或者列向量,具体要看他是在矩阵乘法的左边还是右边,见博客:https://blog.csdn.net/alxe_made/article/details/80492649
    • 按照前面求出的公式,只要把维度搞正确就可以了
    op = np.dot(centerWordVec,np.transpose(outsideVectors)) #(embedding,) × (embedding,n)=(,n)
    y_hat = softmax(op) #row vector (n,)
    delta = y_hat.copy() 
    delta[outsideWordIdx] -= 1  # y_hat - y
    loss = - np.log(y_hat[outsideWordIdx])
    gradCenterVec = np.dot(delta,outsideVectors) #row vector
    gradOutsideVecs = np.dot(delta[:, np.newaxis], centerWordVec[np.newaxis, :]) #np.newaxis add one dimension
    ### END YOUR CODE

    return loss, gradCenterVec, gradOutsideVecsp.newaxis], centerWordVec[np.newaxis, :])
    ### YOUR CODE HERE
    gradCenterVec = np.zeros(centerWordVec.shape)
    gradOutsideVecs = np.zeros(outsideVectors.shape) 
    loss = 0.0
    z = sigmoid(np.dot(outsideVectors[outsideWordIdx],centerWordVec)) #(embedding,) × (embedding,)
    loss = -np.log(z)
    gradCenterVec += (z-1.0)*outsideVectors[outsideWordIdx]
    gradOutsideVecs[outsideWordIdx] += (z-1.0)*centerWordVec
    #使用向量化实现
    u_k = outsideVectors[negSampleWordIndices] #将所负样例的词向量取出来
    z = sigmoid(-np.dot(u_k,centerWordVec)) #(k,)
    loss += np.sum(-np.log(z))
    gradCenterVec += np.dot((1.0-z),u_k)  #(k,) × (k,embedding) = (embedding,)
    for i, negSampleWordIdx in enumerate(negSampleWordIndices): #注意相同的负例都要计算
        gradOutsideVecs[negSampleWordIdx] += (1.0 - z[i]) * centerWordVec
    ### END YOUR CODE
    return loss, gradCenterVec, gradOutsideVecs
  • skipgram
    • 一个窗口有很多个context word,所以要调用我们实现的函数多遍,求loss和梯度之和。
    ### YOUR CODE HERE
    centerWordVec = centerWordVectors[word2Ind[currentCenterWord]]
    for word in outsideWords:
        idx = word2Ind[word]
        l,v_c,U = word2vecLossAndGradient(centerWordVec,idx,outsideVectors,dataset)
        loss+=l
        gradCenterVecs[word2Ind[currentCenterWord]] += v_c
        gradOutsideVectors+=U
    ### END YOUR CODE
    return loss, gradCenterVecs, gradOutsideVectors

(b) sgd.py
较简单,加入下面代码即可

### YOUR CODE HERE
loss,gradient = f(x)
x -= step*gradient
### END YOUR CODE

(c)

  • 这部分不需要代码,需要下载stanfordSentimentTreebank.zip,下载较慢,我在网上找了一份。若需要请从我github中取。在 util/dataset 目录下
  • 运行run.py开始训练,由于使用的是SGD,因此loss有时候会涨。但最终的趋势是在减小的。需要运行较长时间
    在这里插入图片描述
    在这里插入图片描述
  • 由于训练数据太少,因此效果不是非常好,但很明显可以看到,有些词已经开始距离较近了。比如coffee和tea,还有female和woman,以及(0.00, -0.04)周围的形容词。