金融风控实战——特征工程上
特征工程

业务建模流程
- 将业务抽象为分类or回归问题
- 定义标签,得到y
- 选取合适的样本,并匹配出全部的信息作为特征来源
- 特征工程+模型训练+模型评价与调优(相互之间可能会有交互)
- 输出模型报告
- 上线与监控
什么是特征
在机器学习的背景下,特征是用来解释现象发生的单个特性或一组特性。 当这些特性转换为某种可度量的形式时,它们被称为特征。
举个例子,假设你有一个学生列表,这个列表里包含每个学生的姓名、学习小时数、IQ和之前考试的总分数。现在,有一个新学生,你知道他/她的学习小时数和IQ,但他/她的考试分数缺失,你需要估算他/她可能获得的考试分数。
在这里,你需要用IQ和study_hours构建一个估算分数缺失值的预测模型。所以,IQ和study_hours就成了这个模型的特征。
特征工程可能包含的内容
-
基础特征构造
-
数据预处理
-
特征衍生
-
特征筛选
这是一个完整的特征工程流程,但不是唯一的流程,每个过程都有可能会交换顺序,具体的场景需要具体分析。
import pandas as pd
import numpy as np
df_train = pd.read_csv('/Users/zhucan/Desktop/金融风控实战/第三课资料/train.csv')
df_train.head()
结果:
#查看数据基本情况
df_train.shape
#(891, 12)
df_train.info()
结果:

df_train.describe()
结果:

#箱线图
df_train.boxplot(column = "Age")
结果:

import seaborn as sns
sns.set(color_codes = True)
np.random.seed(sum(map(ord,"distributions"))) #固定种子
sns.distplot(df_train.Age, kde = True, bins = 20, rug = True)
结果:
set(df_train.label)
#{0, 1}
数据预处理
(1)缺失值
主要用到的两个包:
- pandas fillna
-
sklearn Imputer
df_train['Age'].sample(10)
#299 50.0
#408 21.0
#158 NaN
#672 70.0
#172 1.0
#447 34.0
#86 16.0
#824 2.0
#527 NaN
#327 36.0
#Name: Age, dtype: float64
df_train['Age'].fillna(value=df_train['Age'].mean()).sample(10)
#115 21.000000
#372 19.000000
#771 48.000000
#379 19.000000
#855 18.000000
#231 29.000000
#641 24.000000
#854 44.000000
#303 29.699118
#0 22.000000
#Name: Age, dtype: float64
(2)数值型
数值缩放
"""取对数等变换,可以对分布做一定的缓解
可以让数值间的差异变大"""
import numpy as np
log_age = df_train['Age'].apply(lambda x:np.log(x))
df_train.loc[:,'log_age'] = log_age
df_train.head(10)
结果:
""" 幅度缩放,最大最小值缩放到[0,1]区间内 """
from sklearn.preprocessing import MinMaxScaler
mm_scaler = MinMaxScaler()
fare_trans = mm_scaler.fit_transform(df_train[['Fare']])
""" 幅度缩放,将每一列的数据标准化为正态分布 """
from sklearn.preprocessing import StandardScaler
std_scaler = StandardScaler()
fare_std_trans = std_scaler.fit_transform(df_train[['Fare']])
""" 中位数或者四分位数去中心化数据,对异常值不敏感 """
from sklearn.preprocessing import robust_scale
fare_robust_trans = robust_scale(df_train[['Fare','Age']])
""" 将同一行数据规范化,前面的同一变为1以内也可以达到这样的效果 """
from sklearn.preprocessing import Normalizer
normalizer = Normalizer()
fare_normal_trans = normalizer.fit_transform(df_train[['Age','Fare']])
(3)统计值
""" 最大最小值 """
max_age = df_train['Age'].max()
min_age = df_train["Age"].min()
""" 分位数,极值处理,我们最粗暴的方法就是将前后1%的值替换成前后两个端点的值 """
age_quarter_01 = df_train['Age'].quantile(0.01)
age_quarter_99 = df_train['Age'].quantile(0.99)
""" 四则运算 """
df_train.loc[:,'family_size'] = df_train['SibSp']+df_train['Parch']+1
df_train.loc[:,'tmp'] = df_train['Age']*df_train['Pclass'] + 4*df_train['family_size']
""" 多项式特征 """
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2)
df_train[['SibSp','Parch']].head()
poly_fea = poly.fit_transform(df_train[['SibSp','Parch']])
pd.DataFrame(poly_fea,columns = poly.get_feature_names()).head()
(4)离散化/分箱/分桶
""" 等距切分 """
df_train.loc[:, 'fare_cut'] = pd.cut(df_train['Fare'], 20)
df_train.head()
""" 等频切分做切分,但是每一部分的人数是差不多的"""
""" 通常情况都是使用等频分箱,让每个区间人数差不多"""
df_train.loc[:,'fare_qcut'] = pd.qcut(df_train['Fare'], 10)
df_train.head()
结果:

(5)BiVar图
""" BiVar图是指横轴为特征升序,纵轴为badrate的变化趋势 """
""" badrate曲线 """
df_train = df_train.sort_values('Fare')
alist = list(set(df_train['fare_qcut']))
badrate = {}
for x in alist:
a = df_train[df_train.fare_qcut == x]
bad = a[a.label == 1]['label'].count()
good = a[a.label == 0]['label'].count()
badrate[x] = bad/(bad+good)
f = zip(badrate.keys(),badrate.values())
f = sorted(f,key = lambda x : x[1],reverse = True )
badrate = pd.DataFrame(f)
badrate.columns = pd.Series(['cut','badrate'])
badrate = badrate.sort_values('cut')
print(badrate)
badrate.plot("cut","badrate",figsize=(10,4)) #.plot用于前面是dataframe,series
结果:
一般采取等频分箱,很少等距分箱,等距分箱可能造成样本非常不均匀
一般分5-6箱,保证badrate曲线从非严格递增转化为严格递增曲线,分箱同时要考虑占比均衡
BIivar图(1)业务上可解释(2)bivar图太平也不好,类似星座这个变量,去掉(3)粗分箱,使bivar图严格单调递增
(6)OneHot编码
""" OneHot encoding/独热向量编码 """
""" 一般像男、女这种二分类categories类型的数据采取独热向量编码, 转化为0、1主要用到 pd.get_dummies """
fare_qcut_oht = pd.get_dummies(df_train[['fare_qcut']])
fare_qcut_oht.head()
embarked_oht = pd.get_dummies(df_train[['Embarked']])
embarked_oht.head()
结果:

onehot编码会导致维度过高的问题,可以分箱后再使用onehot
分箱会损失信息,但会带来稳定性、鲁棒性
(7)时间型数据
'''时间型 日期处理'''
car_sales = pd.read_csv('/Users/zhucan/Desktop/金融风控实战/第三课资料/car_data.csv')
print(car_sales.head())
car_sales.loc[:,'date'] = pd.to_datetime(car_sales['date_t'])
print(car_sales.head())
结果:
car_sales.info() '''原始是字符型的,转变后变成datatime型的'''
结果:

""" 取出关键时间信息 """
""" 月份 """
car_sales.loc[:,'month'] = car_sales['date'].dt.month
""" 几号 """
car_sales.loc[:,'dom'] = car_sales['date'].dt.day
""" 一年当中第几天 """
car_sales.loc[:,'doy'] = car_sales['date'].dt.dayofyear
""" 星期几 """
car_sales.loc[:,'dow'] = car_sales['date'].dt.dayofweek
print(car_sales.head())
结果:

(8)文本型数据
""" 词袋模型 """
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
corpus = [
'This is a very good class',
'students are very very very good',
'This is the third sentence',
'Is this the last doc',
'PS teacher Mei is very very handsome'
]
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
X.toarray()
结果:
可以得到样本的词向量
'''单分词,双分词,多分词'''
vec = CountVectorizer(ngram_range=(1,3))
X_ngram = vec.fit_transform(corpus)
print(vec.get_feature_names())
X_ngram.toarray()
结果:
""" TF-IDF """
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vec = TfidfVectorizer()
tfidf_X = tfidf_vec.fit_transform(corpus)
print(tfidf_vec.get_feature_names())
tfidf_X.toarray()
结果:
可视化
""" 词云图可以直观的反应哪些词作用权重比较大 """
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
corpus = [
'This is a very good class',
'students are very very very good',
'This is the third sentence',
'Is this the last doc',
'teacher Mei is very very handsome'
]
X = vectorizer.fit_transform(corpus)
L = []
for item in list(X.toarray()):
L.append(list(item))
value = [0 for i in range(len(L[0]))]
for i in range(len(L[0])):
for j in range(len(L)):
value[i] += L[j][i]
from pyecharts import WordCloud
wordcloud = WordCloud(width=800,height=500)
#这里是需要做的
wordcloud.add('',vectorizer.get_feature_names(),value,word_size_range=[20,100])
wordcloud
结果:


(9)组合特征
""" 根据条件去判断获取组合特征 """
df_train.loc[:,'alone'] = (df_train['SibSp']==0)&(df_train['Parch']==0)
基于时间序列进行特征衍生
import pandas as pd
import numpy as np
data = pd.read_excel('/Users/zhucan/Desktop/金融风控实战/第三课资料/textdata.xlsx')
data.head()
""" ft 和 gt 表示两个变量名 1-12 表示对应12个月中每个月的相应数值 """
'''ft1 指的是 离申请当天一个月内的数据计算出来的加油次数'''
'''gt1 指的是 离申请当天一个月内的数据计算出来的加油金额'''
结果:
""" 基于时间序列进行特征衍生 """
""" 最近p个月,inv>0的月份数 inv表示传入的变量名 """
def Num(data,inv,p):
df=data.loc[:,inv+'1':inv+str(p)]
auto_value=np.where(df>0,1,0).sum(axis=1)
return data,inv+'_num'+str(p),auto_value
data_new = data.copy()
for p in range(1,12):
for inv in ['ft','gt']:
data_new,columns_name,values=Num(data_new,inv,p)
data_new[columns_name]=values
结果:
'''构建时间序列衍生特征,37个函数'''
import numpy as np
import pandas as pd
class time_series_feature(object):
def __init__(self):
pass
def Num(self,data,inv,p):
"""
:param data:
:param inv:
:param p:
:return: 最近p个月,inv大于0的月份个数
"""
df = data.loc[:,inv+'1':inv+str(p)]
auto_value = np.where(df > 0,1,0).sum(axis=1)
return inv+'_num'+str(p),auto_value
def Nmz(self,data,inv,p):
"""
:param data:
:param inv:
:param p:
:return: 最近p个月,inv=0的月份个数
"""
df = data.loc[:, inv + '1':inv + str(p)]
auto_value = np.where(df == 0, 1, 0).sum(axis=1)
return inv + '_nmz' + str(p), auto_value
def Evr(self,data,inv, p):
"""
:param data:
:param inv:
:param p:
:return: 最近p个月,inv>0的月份数是否>=1
"""
df = data.loc[:, inv + '1':inv + str(p)]
arr = np.where(df > 0, 1, 0).sum(axis=1)
auto_value = np.where(arr, 1, 0)
return inv + '_evr' + str(p), auto_value
def Avg(self,data,inv, p):
"""
:param p:
:return: 最近p个月,inv均值
"""
df = data.loc[:, inv + '1':inv + str(p)]
auto_value = np.nanmean(df, axis=1)
return inv + '_avg' + str(p), auto_value
def Tot(self,data,inv, p):
"""
:param data:
:param inv:
:param p:
:return: 最近p个月,inv和
"""
df = data.loc[:, inv + '1':inv + str(p)]
auto_value = np.nansum(df, axis=1)
return inv + '_tot' + str(p), auto_value
def Tot2T(self,data,inv, p):
"""
:param data:
:param inv:
:param p:
:return: 最近(2,p+1)个月,inv和可以看出该变量的波动情况
"""
df = data.loc[:, inv + '2':inv + str(p + 1)]
auto_value = df.sum(1)
return inv + '_tot2t' + str(p), auto_value
def Max(self,data,inv, p):
"""
:param data:
:param inv:
:param p:
:return: 最近p个月,inv最大值
"""
df = data.loc[:, inv + '1':inv + str(p)]
auto_value = np.nanmax(df, axis=1)
return inv + '_max' + str(p), auto_value
def Min(self,data,inv, p):
"""
:param data:
:param inv:
:param p:
:return: 最近p个月,inv最小值
"""
df = data.loc[:, inv + '1':inv + str(p)]
auto_value = np.nanmin(df, axis=1)
return inv + '_min' + str(p), auto_value
def Msg(self,data,inv, p):
"""
:param data:
:param inv:
:param p:
:return: 最近p个月,最近一次inv>0到现在的月份数
"""
df = data.loc[:, inv + '1':inv + str(p)]
df_value = np.where(df > 0, 1, 0)
auto_value = []
for i in range(len(df_value)):
row_value = df_value[i, :]
if row_value.max() <= 0:
indexs = '0'
auto_value.append(indexs)
else:
indexs = 1
for j in row_value:
if j > 0:
break
indexs += 1
auto_value.append(indexs)
return inv + '_msg' + str(p), auto_value
def Msz(self,data,inv, p):
"""
:param data:
:param inv:
:param p:
:return: 最近p个月,最近一次inv=0到现在的月份数
"""
df = data.loc[:, inv + '1':inv + str(p)]
df_value = np.where(df == 0, 1, 0)
auto_value = []
for i in range(len(df_value)):
row_value = df_value[i, :]
if row_value.max() <= 0:
indexs = '0'
auto_value.append(indexs)
else:
indexs = 1
for j in row_value:
if j > 0:
break
indexs += 1
auto_value.append(indexs)
return inv + '_msz' + str(p), auto_value
def Cav(self,data,inv, p):
"""
:param p:
:return: 当月inv/(最近p个月inv的均值)
"""
df = data.loc[:, inv + '1':inv + str(p)]
auto_value = df[inv + '1'] / np.nanmean(df, axis=1)
return inv + '_cav' + str(p), auto_value
def Cmn(self,data,inv, p):
"""
:param data:
:param inv:
:param p:
:return: 当月inv/(最近p个月inv的最小值)
"""
df = data.loc[:, inv + '1':inv + str(p)]
auto_value = df[inv + '1'] / np.nanmin(df, axis=1)
return inv + '_cmn' + str(p), auto_value
def Mai(self,data,inv, p):
"""
:param data:
:param inv:
:param p:
:return: 最近p个月,每两个月间的inv的增长量的最大值
"""
arr = np.array(data.loc[:, inv + '1':inv + str(p)])
auto_value = []
for i in range(len(arr)):
df_value = arr[i, :]
value_lst = []
for k in range(len(df_value) - 1):
minus = df_value[k] - df_value[k + 1]
value_lst.append(minus)
auto_value.append(np.nanmax(value_lst))
return inv + '_mai' + str(p), auto_value
def Mad(self,data,inv, p):
"""
:param data:
:param inv:
:param p:
:return: 最近p个月,每两个月间的inv的减少量的最大值
"""
arr = np.array(data.loc[:, inv + '1':inv + str(p)])
auto_value = []
for i in range(len(arr)):
df_value = arr[i, :]
value_lst = []
for k in range(len(df_value) - 1):
minus = df_value[k + 1] - df_value[k]
value_lst.append(minus)
auto_value.append(np.nanmax(value_lst))
return inv + '_mad' + str(p), auto_value
def Std(self,data,inv, p):
"""
:param data:
:param inv:
:param p:
:return: 最近p个月,inv的标准差
"""
df = data.loc[:, inv + '1':inv + str(p)]
auto_value = np.nanvar(df, axis=1)
return inv + '_std' + str(p), auto_value
def Cva(self,data,inv, p):
"""
:param p:
:return: 最近p个月,inv的变异系数
"""
df = data.loc[:, inv + '1':inv + str(p)]
auto_value = np.nanmean(df, axis=1) / np.nanvar(df, axis=1)
return inv + '_cva' + str(p), auto_value
def Cmm(self,data,inv, p):
"""
:param data:
:param inv:
:param p:
:return: (当月inv) - (最近p个月inv的均值)
"""
df = data.loc[:, inv + '1':inv + str(p)]
auto_value = df[inv + '1'] - np.nanmean(df, axis=1)
return inv + '_cmm' + str(p), auto_value
def Cnm(self,data,inv, p):
"""
:param p:
:return: (当月inv) - (最近p个月inv的最小值)
"""
df = data.loc[:, inv + '1':inv + str(p)]
auto_value = df[inv + '1'] - np.nanmin(df, axis=1)
return inv + '_cnm' + str(p), auto_value
def Cxm(self,data,inv, p):
"""
:param data:
:param inv:
:param p:
:return: (当月inv) - (最近p个月inv的最大值)
"""
df = data.loc[:, inv + '1':inv + str(p)]
auto_value = df[inv + '1'] - np.nanmax(df, axis=1)
return inv + '_cxm' + str(p), auto_value
def Cxp(self,data,inv, p):
"""
:param p:
:return: ( (当月inv) - (最近p个月inv的最大值) ) / (最近p个月inv的最大值) )
"""
df = data.loc[:, inv + '1':inv + str(p)]
temp = np.nanmin(df, axis=1)
auto_value = (df[inv + '1'] - temp) / temp
return inv + '_cxp' + str(p), auto_value
def Ran(self,data,inv, p):
"""
:param data:
:param inv:
:param p:
:return: 最近p个月,inv的极差
"""
df = data.loc[:, inv + '1':inv + str(p)]
auto_value = np.nanmax(df, axis=1) - np.nanmin(df, axis=1)
return inv + '_ran' + str(p), auto_value
def Nci(self,data,inv, p):
"""
:param data:
:param inv:
:param p:
:return: 最近min( Time on book,p )个月中,后一个月相比于前一个月增长了的月份数
"""
arr = np.array(data.loc[:, inv + '1':inv + str(p)])
auto_value = []
for i in range(len(arr)):
df_value = arr[i, :]
value_lst = []
for k in range(len(df_value) - 1):
minus = df_value[k] - df_value[k + 1]
value_lst.append(minus)
value_ng = np.where(np.array(value_lst) > 0, 1, 0).sum()
auto_value.append(np.nanmax(value_ng))
return inv + '_nci' + str(p), auto_value
def Ncd(self,data,inv, p):
"""
:param data:
:param inv:
:param p:
:return: 最近min( Time on book,p )个月中,后一个月相比于前一个月减少了的月份数
"""
arr = np.array(data.loc[:, inv + '1':inv + str(p)])
auto_value = []
for i in range(len(arr)):
df_value = arr[i, :]
value_lst = []
for k in range(len(df_value) - 1):
minus = df_value[k] - df_value[k + 1]
value_lst.append(minus)
value_ng = np.where(np.array(value_lst) < 0, 1, 0).sum()
auto_value.append(np.nanmax(value_ng))
return inv + '_ncd' + str(p), auto_value
def Ncn(self,data,inv, p):
"""
:param data:
:param inv:
:param p:
:return: 最近min( Time on book,p )个月中,相邻月份inv 相等的月份数
"""
arr = np.array(data.loc[:, inv + '1':inv + str(p)])
auto_value = []
for i in range(len(arr)):
df_value = arr[i, :]
value_lst = []
for k in range(len(df_value) - 1):
minus = df_value[k] - df_value[k + 1]
value_lst.append(minus)
value_ng = np.where(np.array(value_lst) == 0, 1, 0).sum()
auto_value.append(np.nanmax(value_ng))
return inv + '_ncn' + str(p), auto_value
def Bup(self,data,inv, p):
"""
:param p:
:return:
desc:If 最近min( Time on book,p )个月中,对任意月份i ,都有 inv[i] > inv[i+1] 即严格递增,且inv > 0则flag = 1 Else flag = 0
"""
arr = np.array(data.loc[:, inv + '1':inv + str(p)])
auto_value = []
for i in range(len(arr)):
df_value = arr[i, :]
index = 0
for k in range(len(df_value) - 1):
if df_value[k] > df_value[k + 1]:
break
index = + 1
if index == p:
value = 1
else:
value = 0
auto_value.append(value)
return inv + '_bup' + str(p), auto_value
def Pdn(self,data,inv, p):
"""
:param data:
:param inv:
:param p:
:return:
desc: If 最近min( Time on book,p )个月中,对任意月份i ,都有 inv[i] < inv[i+1] ,即严格递减,且inv > 0则flag = 1 Else flag = 0
"""
arr = np.array(data.loc[:, inv + '1':inv + str(p)])
auto_value = []
for i in range(len(arr)):
df_value = arr[i, :]
index = 0
for k in range(len(df_value) - 1):
if df_value[k + 1] > df_value[k]:
break
index = + 1
if index == p:
value = 1
else:
value = 0
auto_value.append(value)
return inv + '_pdn' + str(p), auto_value
def Trm(self,data,inv, p):
"""
:param data:
:param inv:
:param p:
:return: 最近min( Time on book,p )个月,inv的修建均值
"""
df = data.loc[:, inv + '1':inv + str(p)]
auto_value = []
for i in range(len(df)):
trm_mean = list(df.loc[i, :])
trm_mean.remove(np.nanmax(trm_mean))
trm_mean.remove(np.nanmin(trm_mean))
temp = np.nanmean(trm_mean)
auto_value.append(temp)
return inv + '_trm' + str(p), auto_value
def Cmx(self,data,inv, p):
"""
:param data:
:param inv:
:param p:
:return: 当月inv / 最近p个月的inv中的最大值
"""
df = data.loc[:, inv + '1':inv + str(p)]
auto_value = (df[inv + '1'] - np.nanmax(df, axis=1)) / np.nanmax(df, axis=1)
return inv + '_cmx' + str(p), auto_value
def Cmp(self,data,inv, p):
"""
:param data:
:param inv:
:param p:
:return: ( 当月inv - 最近p个月的inv均值 ) / inv均值
"""
df = data.loc[:, inv + '1':inv + str(p)]
auto_value = (df[inv + '1'] - np.nanmean(df, axis=1)) / np.nanmean(df, axis=1)
return inv + '_cmp' + str(p), auto_value
def Cnp(self,data,inv, p):
"""
:param p:
:return: ( 当月inv - 最近p个月的inv最小值 ) /inv最小值
"""
df = data.loc[:, inv + '1':inv + str(p)]
auto_value = (df[inv + '1'] - np.nanmin(df, axis=1)) / np.nanmin(df, axis=1)
return inv + '_cnp' + str(p), auto_value
def Msx(self,data,inv, p):
"""
:param data:
:param inv:
:param p:
:return: 最近min( Time on book,p )个月取最大值的月份距现在的月份数
"""
df = data.loc[:, inv + '1':inv + str(p)]
df['_max'] = np.nanmax(df, axis=1)
for i in range(1, p + 1):
df[inv + str(i)] = list(df[inv + str(i)] == df['_max'])
del df['_max']
df_value = np.where(df == True, 1, 0)
auto_value = []
for i in range(len(df_value)):
row_value = df_value[i, :]
indexs = 1
for j in row_value:
if j == 1:
break
indexs += 1
auto_value.append(indexs)
return inv + '_msx' + str(p), auto_value
def Rpp(self,data,inv, p):
"""
:param data:
:param inv:
:param p:
:return: 近p个月的均值/((p,2p)个月的inv均值)
"""
df1 = data.loc[:, inv + '1':inv + str(p)]
value1 = np.nanmean(df1, axis=1)
df2 = data.loc[:, inv + str(p):inv + str(2 * p)]
value2 = np.nanmean(df2, axis=1)
auto_value = value1 / value2
return inv + '_rpp' + str(p), auto_value
def Dpp(self,data,inv, p):
"""
:param data:
:param inv:
:param p:
:return: 最近p个月的均值 - ((p,2p)个月的inv均值)
"""
df1 = data.loc[:, inv + '1':inv + str(p)]
value1 = np.nanmean(df1, axis=1)
df2 = data.loc[:, inv + str(p):inv + str(2 * p)]
value2 = np.nanmean(df2, axis=1)
auto_value = value1 - value2
return inv + '_dpp' + str(p), auto_value
def Mpp(self,data,inv, p):
"""
:param data:
:param inv:
:param p:
:return: (最近p个月的inv最大值)/ (最近(p,2p)个月的inv最大值)
"""
df1 = data.loc[:, inv + '1':inv + str(p)]
value1 = np.nanmax(df1, axis=1)
df2 = data.loc[:, inv + str(p):inv + str(2 * p)]
value2 = np.nanmax(df2, axis=1)
auto_value = value1 / value2
return inv + '_mpp' + str(p), auto_value
def Npp(self,data,inv, p):
"""
:param data:
:param inv:
:param p:
:return: (最近p个月的inv最小值)/ (最近(p,2p)个月的inv最小值)
"""
df1 = data.loc[:, inv + '1':inv + str(p)]
value1 = np.nanmin(df1, axis=1)
df2 = data.loc[:, inv + str(p):inv + str(2 * p)]
value2 = np.nanmin(df2, axis=1)
auto_value = value1 / value2
return inv + '_npp' + str(p), auto_value
def auto_var(self,data_new,inv,p):
"""
:param data:
:param inv:
:param p:
:return: 批量调用双参数函数
"""
try:
columns_name, values = self.Num(data_new,inv, p)
data_new[columns_name] = values
columns_name, values = self.Nmz(data_new,inv, p)
data_new[columns_name] = values
columns_name, values = self.Evr(data_new,inv, p)
data_new[columns_name] = values
columns_name, values = self.Avg(data_new,inv, p)
data_new[columns_name] = values
columns_name, values = self.Tot(data_new,inv, p)
data_new[columns_name] = values
columns_name, values = self.Tot2T(data_new,inv, p)
data_new[columns_name] = values
columns_name, values = self.Max(data_new,inv, p)
data_new[columns_name] = values
columns_name, values = self.Max(data_new,inv, p)
data_new[columns_name] = values
columns_name, values = self.Min(data_new,inv, p)
data_new[columns_name] = values
columns_name, values = self.Msg(data_new,inv, p)
data_new[columns_name] = values
columns_name, values = self.Msz(data_new,inv, p)
data_new[columns_name] = values
columns_name, values = self.Cav(data_new,inv, p)
data_new[columns_name] = values
columns_name, values = self.Cmn(data_new,inv, p)
data_new[columns_name] = values
columns_name, values = self.Std(data_new,inv, p)
data_new[columns_name] = values
columns_name, values = self.Cva(data_new,inv, p)
data_new[columns_name] = values
columns_name, values = self.Cmm(data_new,inv, p)
data_new[columns_name] = values
columns_name, values = self.Cnm(data_new,inv, p)
data_new[columns_name] = values
columns_name, values = self.Cxm(data_new,inv, p)
data_new[columns_name] = values
columns_name, values = self.Cxp(data_new,inv, p)
data_new[columns_name] = values
columns_name, values = self.Ran(data_new,inv, p)
data_new[columns_name] = values
columns_name, values = self.Nci(data_new,inv, p)
data_new[columns_name] = values
columns_name, values = self.Ncd(data_new,inv, p)
data_new[columns_name] = values
columns_name, values = self.Ncn(data_new,inv, p)
data_new[columns_name] = values
columns_name, values = self.Pdn(data_new,inv, p)
data_new[columns_name] = values
columns_name, values = self.Cmx(data_new,inv, p)
data_new[columns_name] = values
columns_name, values = self.Cmp(data_new,inv, p)
data_new[columns_name] = values
columns_name, values = self.Cnp(data_new,inv, p)
data_new[columns_name] = values
columns_name, values = self.Msx(data_new,inv, p)
data_new[columns_name] = values
columns_name, values = self.Nci(data_new,inv, p)
data_new[columns_name] = values
columns_name, values = self.Trm(data_new,inv, p)
data_new[columns_name] = values
columns_name, values = self.Bup(data_new,inv, p)
data_new[columns_name] = values
columns_name, values = self.Mai(data_new,inv, p)
data_new[columns_name] = values
columns_name, values = self.Mad(data_new,inv, p)
data_new[columns_name] = values
columns_name, values = self.Rpp(data_new,inv, p)
data_new[columns_name] = values
columns_name, values = self.Dpp(data_new,inv, p)
data_new[columns_name] = values
columns_name, values = self.Mpp(data_new,inv, p)
data_new[columns_name] = values
columns_name, values = self.Npp(data_new,inv, p)
data_new[columns_name] = values
except:
pass
return data_new
auto_var2 = time_series_feature()
for p in range(1,12):
for inv in ['ft','gt']:
data = auto_var2.auto_var(data,inv,p)
data
结果: