(5)机器学习_K折交叉验证(iris数据集实例)
文章目录
1、什么是K折交叉验证
定义:将训练集分成K份,每次用其中一份做测试集,其余的k-1份作为训练集,循环k次,取每次训练结果的平均值作为评分。
c
l
a
s
s
s
k
l
e
a
r
n
.
m
o
d
e
l
_
s
e
l
e
c
t
i
o
n
.
K
F
o
l
d
(
n
_
s
p
l
i
t
s
=
5
,
∗
,
s
h
u
f
f
l
e
=
F
a
l
s
e
,
r
a
n
d
o
m
_
s
t
a
t
e
=
N
o
n
e
)
class sklearn.model\_selection.KFold(n\_splits=5, *, shuffle=False, random\_state=None)
classsklearn.model_selection.KFold(n_splits=5,∗,shuffle=False,random_state=None)
n_splits:将训练集分成几份,一般设成10
shuffle:分数据的时候是否将原数据打乱,default=False
random_state:随机生成的种子

2、为什么要引入K折交叉验证
我们训练模型的时候,需要将一部分数据预留出来作为测试集,在某种程度上我们的数据集损失了一部分,为了充分利用这部分的数据集,那么我们引入的K折交叉验证起到了很好的作用。
3、如何实现K折交叉验证
这里我使用的是sklearn自带的iris数据集,和支持向量机SVC模型,当然,也可以用其他的分类器。另外,sklearn库为我们提供了能直接观察k折交叉验证以后的模型评分的方法
c
r
o
s
s
_
v
a
l
_
s
c
o
r
e
cross\_val\_score
cross_val_score
s
k
l
e
a
r
n
.
m
o
d
e
l
_
s
e
l
e
c
t
i
o
n
.
c
r
o
s
s
_
v
a
l
_
s
c
o
r
e
(
e
s
t
i
m
a
t
o
r
,
X
,
y
=
N
o
n
e
,
∗
,
g
r
o
u
p
s
=
N
o
n
e
,
s
c
o
r
i
n
g
=
N
o
n
e
,
c
v
=
N
o
n
e
,
n
_
j
o
b
s
=
N
o
n
e
,
v
e
r
b
o
s
e
=
0
,
f
i
t
_
p
a
r
a
m
s
=
N
o
n
e
,
p
r
e
_
d
i
s
p
a
t
c
h
=
′
2
∗
n
_
j
o
b
s
′
,
e
r
r
o
r
_
s
c
o
r
e
=
n
a
n
)
sklearn.model\_selection.cross\_val\_score(estimator, X, y=None, *, groups=None, scoring=None, cv=None, n\_jobs=None, verbose=0, fit\_params=None, pre\_dispatch='2*n\_jobs', error\_score=nan)
sklearn.model_selection.cross_val_score(estimator,X,y=None,∗,groups=None,scoring=None,cv=None,n_jobs=None,verbose=0,fit_params=None,pre_dispatch=′2∗n_jobs′,error_score=nan)
params:
estimator: 分类器
X:特征值
y:目标值
scoring:模型评价方式
cv:验证策略,默认使用5折交叉验证,int,用于指定(Stratified)KFold的折数,如果是分类器,则使用StratifiedKFold,其他一律使用KFold
n_jobs:等于-1时,调用所有处理器工作
returns:
每次交叉验证的得分数组,一般使用corss_val_score().mean()直接查看数组的平均值
步骤如下:
3.1 导入必要的包:
from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn.model_selection import KFold,train_test_split,cross_val_score
from sklearn.preprocessing import StandardScaler
import math
3.2 导入iris数据集并预处理
iris = load_iris()
X = iris.data
y = iris.target
std = StandardScaler()
X = std.fit_transform(X)
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.33,random_state=0)
3.3 设置KFold参数
KF = KFold(n_splits=10,random_state=7,shuffle=True)
3.4 调参,寻找相对最优
indexi = -1
indexj = -1
indexk = -1
bestscore = -1
for i in range(5, -18, -2):
for j in range(-3, 18, 2):
g = math.pow(3,i)
c = math.pow(3,j)
clf = SVC(C=c,gamma=g,kernel='rbf',probability=True,random_state=7)
score = cross_val_score(clf,X_train,y_train,cv=KF,scoring='accuracy',n_jobs=-1).mean()
if score > bestscore:
indexi = i
indexj = j
bestscore = score
print(indexi,indexj,bestscore)
3.5 将最佳参数训练模型并查看结果
g = math.pow(3,indexi)
c = math.pow(3,indexj)
clf = SVC(C=c,gamma=g,kernel='rbf',probability=True,random_state=7)
clf.fit(X_train,y_train)
print(clf.score(X_test,y_test))
附件:
完整代码:
from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn.model_selection import KFold,train_test_split,cross_val_score
from sklearn.preprocessing import StandardScaler
import math
iris = load_iris()
X = iris.data
y = iris.target
std = StandardScaler()
X = std.fit_transform(X)
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.33,random_state=0)
KF = KFold(n_splits=10,random_state=7,shuffle=True)
indexi = -1
indexj = -1
indexk = -1
bestscore = -1
for i in range(5, -18, -2):
for j in range(-3, 18, 2):
g = math.pow(3,i)
c = math.pow(3,j)
clf = SVC(C=c,gamma=g,kernel='rbf',probability=True,random_state=7)
score = cross_val_score(clf,X_train,y_train,cv=KF,scoring='accuracy',n_jobs=-1).mean()
if score > bestscore:
indexi = i
indexj = j
bestscore = score
print(indexi,indexj,bestscore)
g = math.pow(3,indexi)
c = math.pow(3,indexj)
clf = SVC(C=c,gamma=g,kernel='rbf',probability=True,random_state=7)
clf.fit(X_train,y_train)
print(clf.score(X_test,y_test))
4、分层交叉验证
分层交叉验证主要针对二分类问题
下图是标准交叉验证和分层交叉验证的区别

核心代码为:
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5,shuffle=True,random_state=0)
#在cross_val_score参数中cv赋为skf就是了
5、重复交叉验证
如果训练集不能很好地代表整个样本总体,分层交叉验证就没有意义了。这时候,可以使用重复交叉验证。
核心代码为:
from sklearn.model_selection import RepeatedKFold
rkf = RepeatedKFold(n_splits=5,n_repeats=2,random_state=0)
6、参考文献
https://blog.csdn.net/Softdiamonds/article/details/80062638
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html?highlight=cross_val_score#sklearn.model_selection.cross_val_score
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html
https://blog.csdn.net/weixin_39183369/article/details/78953653
https://scikit-learn.org.cn/view/663.html
https://blog.csdn.net/WHYbeHERE/article/details/108192957#t9