交叉熵损失(Cross Entropy)求导
Cross Entropy是分类问题中常见的一种损失函数,我们在之前的文章提到过二值交叉熵的证明和交叉熵的作用,下面解释一下交叉熵损失的求导。
首先一个模型的最后一层神经元的输出记为
f
0
.
.
.
f
i
f_{0}...f_{i}
f0...fi,
输出经过softmax激活之后记为
p
0
.
.
.
p
i
p_{0}...p_{i}
p0...pi,那么:
p
i
=
e
f
i
∑
k
=
0
C
−
1
e
f
k
p_{i} = \frac{e^{f_{i}}}{\sum_{k=0}^{C-1} e^{f_{k}}}
pi=∑k=0C−1efkefi
类别的实际标签记为
y
0
.
.
.
y
i
y_{0}...y_{i}
y0...yi,那么交叉熵损失L为:
L
=
−
∑
i
=
0
C
−
1
y
i
l
o
g
p
i
L = -\sum_{i=0}^{C-1} y_{i}log^{p_{i}}
L=−i=0∑C−1yilogpi
上式中的
l
o
g
log
log是一种简写,为了后续的求导方便,一般我们认为
l
o
g
log
log的底是
e
e
e,即
l
o
g
log
log为
l
n
ln
ln。
那么
L
L
L对第
i
i
i个神经元的输出
f
i
f_{i}
fi求偏导
∂
L
∂
f
i
\frac{\partial L}{\partial f_{i}}
∂fi∂L:
根据复合函数求导原则:
∂
L
∂
f
i
=
∑
j
=
0
C
−
1
∂
L
j
∂
p
j
∂
p
j
∂
f
i
\frac{\partial L}{\partial f_{i}} = \sum_{j=0}^{C-1} \frac{\partial L_{j}}{\partial p_{j}}\frac{\partial p_{j}}{\partial f_{i}}
∂fi∂L=j=0∑C−1∂pj∂Lj∂fi∂pj
在这里需要说明,在softmax中我们使用了下标
i
i
i和
k
k
k,在交叉熵中使用了下标
i
i
i,但是这里的两个
i
i
i并不等价,因为softmax的分母中包含了每个神经元的输出
f
f
f,也就是激活后所有的
p
p
p对任意的
f
i
f_{i}
fi求偏导都不为0,同时
L
L
L中又包含了所有的
p
p
p,所以为了避免重复我们需要为
p
p
p引入一个新的下标
j
j
j,
j
j
j有
0...
C
−
1
0...C-1
0...C−1这C种情况。
那么依次求导:
∂ L j ∂ p j = ∂ ( − y j l o g p j ) ∂ ( p j ) \frac{\partial L_{j}}{\partial p_{j}}= \frac{\partial (-y_{j}log^{p_{j}})}{\partial (p_{j})} ∂pj∂Lj=∂(pj)∂(−yjlogpj)
由于默认一般我们认为 l o g log log的底是 e e e,即 l o g log log为 l n ln ln,所以:
∂ L j ∂ p j = ∂ ( − y j l o g p j ) ∂ ( p j ) = − y j p j \frac{\partial L_{j}}{\partial p_{j}}= \frac{\partial (-y_{j}log^{p_{j}})}{\partial (p_{j})} =-\frac{y_{j}}{p_{j}} ∂pj∂Lj=∂(pj)∂(−yjlogpj)=−pjyj
接着要求 ∂ p j ∂ f i \frac{\partial p_{j}}{\partial f_{i}} ∂fi∂pj的值,在这里可以发现,每一个 p j p_{j} pj中都包含 f i f_{i} fi,所以 ∂ p j ∂ f i \frac{\partial p_{j}}{\partial f_{i}} ∂fi∂pj都不是0,但是 j = i j=i j=i和 j ≠ i j \neq i j=i的时候, ∂ p j ∂ f i \frac{\partial p_{j}}{\partial f_{i}} ∂fi∂pj结果又不相同,所以这里需要分开讨论:
-
首先 j = i j=i j=i时:
∂ p j ∂ f i = ∂ p i ∂ f i = ∂ e f i ∑ k = 0 C − 1 e f k ∂ f i \frac{\partial p_{j}}{\partial f_{i}} = \frac{\partial p_{i}}{\partial f_{i}} = \frac{\partial \frac{e^{f_{i}}}{\sum_{k=0}^{C-1} e^{f_{k}}}}{\partial f_{i}} ∂fi∂pj=∂fi∂pi=∂fi∂∑k=0C−1efkefi
= ( e f i ) ′ ∑ k = 0 C − 1 e f k − e f i ( ∑ k = 0 C − 1 e f k ) ′ ( ∑ k = 0 C − 1 e f k ) 2 = \frac{ (e^{f_{i}})' \sum_{k=0}^{C-1} e^{f_{k}} - e^{f_{i}}(\sum_{k=0}^{C-1} e^{f_{k}})' }{(\sum_{k=0}^{C-1} e^{f_{k}})^{2}} =(∑k=0C−1efk)2(efi)′∑k=0C−1efk−efi(∑k=0C−1efk)′
= e f i ∑ k = 0 C − 1 e f k − ( e f i ) 2 ( ∑ k = 0 C − 1 e f k ) 2 = e f i ∑ k = 0 C − 1 e f k − ( e f i ∑ k = 0 C − 1 e f k ) 2 = \frac{ e^{f_{i}}\sum_{k=0}^{C-1} e^{f_{k}} - (e^{f_{i}})^2 }{(\sum_{k=0}^{C-1} e^{f_{k}})^{2}}= \frac{ e^{f_{i}} }{\sum_{k=0}^{C-1} e^{f_{k}}} - (\frac{ e^{f_{i}} }{\sum_{k=0}^{C-1} e^{f_{k}}})^2 =(∑k=0C−1efk)2efi∑k=0C−1efk−(efi)2=∑k=0C−1efkefi−(∑k=0C−1efkefi)2
= p i − ( p i ) 2 = p i ( 1 − p i ) = p_{i}-(p{i})^2 = p_{i}(1-p_{i}) =pi−(pi)2=pi(1−pi) -
然后 j ≠ i j\neq i j=i时:
∂ p j ∂ f i = ∂ e f j ∑ k = 0 C − 1 e f k ∂ f i \frac{\partial p_{j}}{\partial f_{i}}= \frac{\partial \frac{e^{f_{j}}}{\sum_{k=0}^{C-1} e^{f_{k}}}}{\partial f_{i}} ∂fi∂pj=∂fi∂∑k=0C−1efkefj
= ( e f j ) ′ ∑ k = 0 C − 1 e f k − e f j ( ∑ k = 0 C − 1 e f k ) ′ ( ∑ k = 0 C − 1 e f k ) 2 = \frac{ (e^{f_{j}})' \sum_{k=0}^{C-1} e^{f_{k}} - e^{f_{j}}(\sum_{k=0}^{C-1} e^{f_{k}})' }{(\sum_{k=0}^{C-1} e^{f_{k}})^{2}} =(∑k=0C−1efk)2(efj)′∑k=0C−1efk−efj(∑k=0C−1efk)′
= − e f i e f j ( ∑ k = 0 C − 1 e f k ) 2 = − e f i ∑ k = 0 C − 1 e f k e f j ∑ k = 0 C − 1 e f k = \frac{ - e^{f_{i}} e^{f_{j}} }{(\sum_{k=0}^{C-1} e^{f_{k}})^{2}} = - \frac{ e^{f_{i}} }{\sum_{k=0}^{C-1} e^{f_{k}}} \frac{ e^{f_{j}} }{\sum_{k=0}^{C-1} e^{f_{k}}} =(∑k=0C−1efk)2−efiefj=−∑k=0C−1efkefi∑k=0C−1efkefj
= − p i p j = -p_{i}p_{j} =−pipj
对于最后的偏导数,需要把上述两个部分加起来:
∂
L
∂
f
i
=
∑
j
=
i
C
−
1
∂
L
j
∂
p
j
∂
p
j
∂
f
i
+
∑
j
≠
i
C
−
1
∂
L
j
∂
p
j
∂
p
j
∂
f
i
\frac{\partial L}{\partial f_{i}} = \sum_{j=i}^{C-1} \frac{\partial L_{j}}{\partial p_{j}}\frac{\partial p_{j}}{\partial f_{i}} + \sum_{j\neq i}^{C-1} \frac{\partial L_{j}}{\partial p_{j}}\frac{\partial p_{j}}{\partial f_{i}}
∂fi∂L=j=i∑C−1∂pj∂Lj∂fi∂pj+j=i∑C−1∂pj∂Lj∂fi∂pj
=
−
y
i
p
i
p
i
(
1
−
p
i
)
+
∑
j
≠
i
C
−
1
−
p
i
p
j
(
−
y
j
p
j
)
=-\frac{y_{i}}{p_{i}}p_{i}(1-p_{i}) + \sum_{j\neq i}^{C-1}-p_{i}p_{j}(-\frac{y_{j}}{p_{j}})
=−piyipi(1−pi)+j=i∑C−1−pipj(−pjyj)
=
−
y
i
(
1
−
p
i
)
+
∑
j
≠
i
C
−
1
p
i
y
j
=-y_{i}(1-p_{i}) + \sum_{j\neq i}^{C-1}p_{i}y_{j}
=−yi(1−pi)+j=i∑C−1piyj
=
y
i
p
i
−
y
i
+
∑
j
≠
i
C
−
1
p
i
y
j
=y_{i}p_{i}-y_{i} + \sum_{j\neq i}^{C-1}p_{i}y_{j}
=yipi−yi+j=i∑C−1piyj
在上式中,
j
≠
i
j\neq i
j=i的情况中刚好缺了
j
=
i
j=i
j=i,所以可以继续改写为:
=
∑
j
=
0
C
−
1
p
i
y
j
−
y
i
=\sum_{j=0}^{C-1}p_{i}y_{j} - y_{i}
=j=0∑C−1piyj−yi
=
p
i
∑
j
=
0
C
−
1
y
j
−
y
i
=p_{i}\sum_{j=0}^{C-1}y_{j} - y_{i}
=pij=0∑C−1yj−yi
而
∑
j
=
0
C
−
1
y
j
=
1
\sum_{j=0}^{C-1}y_{j} = 1
∑j=0C−1yj=1,所以:
=
p
i
∑
j
=
0
C
−
1
y
j
−
y
i
=
p
i
−
y
i
=p_{i}\sum_{j=0}^{C-1}y_{j} - y_{i} = p_{i}-y_{i}
=pij=0∑C−1yj−yi=pi−yi