返回

【阿旭机器学习实战】【24】信用卡用户流失预测实战

发布时间:2022-12-06 08:45:17 488
# 数据

【阿旭机器学习实战】系列文章主要介绍机器学习的各种算法模型及其实战案例,欢迎点赞,关注共同学习交流。

本文针对某国外匿名化处理后的信用卡真实数据集,通过建模判断该用户是否已经流失,包括特征处理与分类模型建模评估。

目录

  • ​​问题描述​​
  • ​​1. 读取数据并分离特征与标签​​
  • ​​2.特征工程​​
  • ​​2.1 删除无用特征​​
  • ​​2.2 将字符串特征进行编码​​
  • ​​2.3 对特征数据进行归一化​​
  • ​​3. 建模预测与评估​​

问题描述

依据某国外匿名化处理后的真实数据集,通过建模,判断该用户是否已经流失。

1. 读取数据并分离特征与标签

import pandas as pd
import numpy as np
# 读取数据
train_data = pd.read_csv('./Churn-Modelling.csv')
test_data = pd.read_csv('./Churn-Modelling-Test-Data.csv')
x_train = train_data.iloc[:,:-1]
y_train = train_data.iloc[:,-1].astype(int)
x_test = test_data.iloc[:,:-1]
y_test = test_data.iloc[:,-1].astype(int)
x_train.head()

 

 

 

 

RowNumber

CustomerId

Surname

CreditScore

Geography

Gender

Age

Tenure

Balance

NumOfProducts

HasCrCard

IsActiveMember

EstimatedSalary

0

1

15634602

Hargrave

619

France

Female

42

2

0.00

1

1

1

101348.88

1

2

15647311

Hill

608

Spain

Female

41

1

83807.86

1

0

1

112542.58

2

3

15619304

Onio

502

France

Female

42

8

159660.80

3

1

0

113931.57

3

4

15701354

Boni

699

France

Female

39

1

0.00

2

0

0

93826.63

4

5

15737888

Mitchell

850

Spain

Female

43

2

125510.82

1

1

1

79084.10

数据说明:
RowNumber:行号
CustomerID:用户编号
Surname:用户姓名
CreditScore:信用分数
Geography:用户所在国家/地区
Gender:用户性别
Age:年龄
Tenure:当了本银行多少年用户
Balance:存贷款情况
NumOfProducts:使用产品数量
HasCrCard:是否有本行信用卡
IsActiveMember:是否活跃用户
EstimatedSalary:估计收入
Exited:是否已流失,这将作为我们的标签数据

2.特征工程

2.1 删除无用特征

# 删除前三列没用的数据
x_train = x_train.drop(labels=x_train.columns[[0,1,2]], axis=1)
x_test = x_test.drop(labels=x_test.columns[[0,1,2]], axis=1)
x_train.head()

 

 

CreditScore

Geography

Gender

Age

Tenure

Balance

NumOfProducts

HasCrCard

IsActiveMember

EstimatedSalary

0

619

France

Female

42

2

0.00

1

1

1

101348.88

1

608

Spain

Female

41

1

83807.86

1

0

1

112542.58

2

502

France

Female

42

8

159660.80

3

1

0

113931.57

3

699

France

Female

39

1

0.00

2

0

0

93826.63

4

850

Spain

Female

43

2

125510.82

1

1

1

79084.10

y_train[:5]
0    1
1 0
2 1
3 0
4 0
Name: Exited, dtype: int32

2.2 将字符串特征进行编码

# 国家与性别两列为非数值型数据,使用LabelEncoder进行编码,将其转换为数值数据
from sklearn.preprocessing import LabelEncoder
Lb1 = LabelEncoder()
x_train.iloc[:,1] = Lb1.fit_transform(x_train.iloc[:,1])
x_test.iloc[:,1] = Lb1.transform(x_test.iloc[:,1])
Lb2 = LabelEncoder()
x_train.iloc[:,2] = Lb2.fit_transform(x_train.iloc[:,2])
x_test.iloc[:,2] = Lb2.transform(x_test.iloc[:,2])
x_train[:5]

 

 

CreditScore

Geography

Gender

Age

Tenure

Balance

NumOfProducts

HasCrCard

IsActiveMember

EstimatedSalary

0

619

0

0

42

2

0.00

1

1

1

101348.88

1

608

2

0

41

1

83807.86

1

0

1

112542.58

2

502

0

0

42

8

159660.80

3

1

0

113931.57

3

699

0

0

39

1

0.00

2

0

0

93826.63

4

850

2

0

43

2

125510.82

1

1

1

79084.10

x_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 10 columns):
CreditScore 10000 non-null int64
Geography 10000 non-null int64
Gender 10000 non-null int64
Age 10000 non-null int64
Tenure 10000 non-null int64
Balance 10000 non-null float64
NumOfProducts 10000 non-null int64
HasCrCard 10000 non-null int64
IsActiveMember 10000 non-null int64
EstimatedSalary 10000 non-null float64
dtypes: float64(2), int64(8)
memory usage: 781.3 KB

2.3 对特征数据进行归一化

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)
x_train[:5]
array([[-0.32622142, -0.90188624, -1.09598752,  0.29351742, -1.04175968,
-1.22584767, -0.91158349, 0.64609167, 0.97024255, 0.02188649],
[-0.44003595, 1.51506738, -1.09598752, 0.19816383, -1.38753759,
0.11735002, -0.91158349, -1.54776799, 0.97024255, 0.21653375],
[-1.53679418, -0.90188624, -1.09598752, 0.29351742, 1.03290776,
1.33305335, 2.52705662, 0.64609167, -1.03067011, 0.2406869 ],
[ 0.50152063, -0.90188624, -1.09598752, 0.00745665, -1.38753759,
-1.22584767, 0.80773656, -1.54776799, -1.03067011, -0.10891792],
[ 2.06388377, 1.51506738, -1.09598752, 0.38887101, -1.04175968,
0.7857279 , -0.91158349, 0.64609167, 0.97024255, -0.36527578]])

3. 建模预测与评估

# 使用逻辑回归进行建模
from sklearn.linear_model import LogisticRegression
lr=LogisticRegression()
sgd=SGDClassifier()
lr.fit(x_train,y_train)
lr_y_predict=lr.predict(x_test)
#使用逻辑斯蒂回归墨香自带的评分函数score获得模型在测试集上的准确性结果
print('LogisticRegression测试集准确度:',lr.score(x_test,y_test))
print('LogisticRegression训练集准确度:',lr.score(x_train,y_train))
LogisticRegression测试集准确度: 0.761
LogisticRegression训练集准确度: 0.809
from sklearn.metrics import classification_report
#使用classificaion_report模块获得LogisticRegression其他三个指标的结果
print(classification_report(y_test,lr_y_predict,target_names=['Exited','UnExited']))
precision    recall  f1-score   support

Exited 0.77 0.97 0.86 740
UnExited 0.68 0.15 0.25 260

avg / total 0.74 0.76 0.70 1000

结果表明该模型准确率只有76%,还有一定的优化空间。

如果内容对你有帮助,感谢点赞+关注哦!

欢迎关注我的公众号:​​阿旭算法与机器学习​​​,共同学习交流。
更多干货内容持续更新中…

 

特别声明:以上内容(图片及文字)均为互联网收集或者用户上传发布,本站仅提供信息存储服务!如有侵权或有涉及法律问题请联系我们。
举报
评论区(0)
按点赞数排序
用户头像
精选文章
thumb 中国研究员首次曝光美国国安局顶级后门—“方程式组织”
thumb 俄乌线上战争,网络攻击弥漫着数字硝烟
thumb 从网络安全角度了解俄罗斯入侵乌克兰的相关事件时间线
下一篇
ElasticSearch的IK分词器 2022-12-06 08:16:51