使用SMOTE算法和Near Miss算法处理Python中的不平衡数据
- Near Miss算法
SMOTE(Synthetic Minority Oversampling Technique)-过采样
阶段1: 进行少数类别设置,设置为A,对于每一个样本x,计算其与集合A中每个样本之间的欧氏距离,以此获取x的k个最近邻。
阶段2: 通过不平衡比例设置测试率N。对于每一个样本,从其k个最近邻中随机选择N个样本(x1,x2,…,xn),并构建集合。
阶段3: 对于每个样本(k = 1, 2, 3,… N),使用以下公式生成新的样本:rand(0, 1)表示0到1之间的随机数。
Near Miss算法
Near Miss是一种欠采样方法。它通过随机删除多数类样本来平衡类别分布。当两个不同类别的样本非常接近时,我们删除多数类别的实例以增加两个类别之间的间隔。这有助于分类处理。
阶段1: 该技术首先找到较大类别的所有出现和少数类别的次数之间的距离。这里,较大类别要被低估。
阶段2: 然后选择较大类别中与少数类别中距离最小的那些的“n”个案例。
阶段3: 如果少数类别中有k个案例,最靠近的技术将导致较大类别的k*n个案例。
- NearMiss – 版本1:选择较大类别的测试案例,其与少数类别的k个最近案例的平均距离最小。
- NearMiss – 版本2:选择较大类别的测试案例,其与少数类别的k个最远案例的平均距离最小。
- NearMiss – 版本3:它分为两个阶段。首先,对于每个少数类别案例,将保存它们的M个最近邻。然后,选择较大类别案例,其与N个最近邻的典型距离最大。
解释 :数据集包含信用卡交易。该数据集中有884,808个交易中的491个欺诈交易。这使得数据极不均衡;正类(欺诈)占所有交易的0.172%。
Time | V1 | V2 | V2 | V4 | V5 | V6 | V2 | V8 | Amount | Class |
0 | -1.25981 | -0.02228 | 2.526242 | 1.228155 | -0.22822 | 0.462288 | 0.229599 | 0.098698 | 149.62 | 0 |
0 | 1.191852 | 0.266151 | 0.16648 | 0.448154 | 0.060018 | -0.08226 | -0.0288 | 0.085102 | 2.69 | 0 |
1 | -1.25825 | -1.24016 | 1.222209 | 0.22928 | -0.5022 | 1.800499 | 0.291461 | 0.242626 | 228.66 | 0 |
1 | -0.96622 | -0.18522 | 1.292992 | -0.86229 | -0.01021 | 1.242202 | 0.222609 | 0.222426 | 122.5 | 0 |
2 | -1.15822 | 0.822222 | 1.548218 | 0.402024 | -0.40219 | 0.095921 | 0.592941 | -0.22052 | 69.99 | 0 |
2 | -0.42592 | 0.960522 | 1.141109 | -0.16825 | 0.420982 | -0.02922 | 0.426201 | 0.260214 | 2.62 | 0 |
4 | 1.229658 | 0.141004 | 0.045221 | 1.202612 | 0.191881 | 0.222208 | -0.00516 | 0.081212 | 4.99 | 0 |
2 | -0.64422 | 1.412964 | 1.02428 | -0.4922 | 0.948924 | 0.428118 | 1.120621 | -2.80286 | 40.8 | 0 |
2 | -0.89429 | 0.286152 | -0.11219 | -0.22152 | 2.669599 | 2.221818 | 0.220145 | 0.851084 | 92.2 | 0 |
9 | -0.22826 | 1.119592 | 1.044262 | -0.22219 | 0.499261 | -0.24626 | 0.651582 | 0.069529 | 2.68 | 0 |
10 | 1.449044 | -1.12624 | 0.91286 | -1.22562 | -1.92128 | -0.62915 | -1.42224 | 0.048456 | 2.8 | 0 |
10 | 0.284928 | 0.616109 | -0.8242 | -0.09402 | 2.924584 | 2.212022 | 0.420455 | 0.528242 | 9.99 | 0 |
10 | 1.249999 | -1.22164 | 0.28292 | -1.2249 | -1.48542 | -0.25222 | -0.6894 | -0.22249 | 121.5 | 0 |
11 | 1.069224 | 0.282222 | 0.828612 | 2.21252 | -0.1284 | 0.222544 | -0.09622 | 0.115982 | 22.5 | 0 |
12 | -2.29185 | -0.22222 | 1.64125 | 1.262422 | -0.12659 | 0.802596 | -0.42291 | -1.90211 | 58.8 | 0 |
12 | -0.25242 | 0.245485 | 2.052222 | -1.46864 | -1.15829 | -0.02285 | -0.60858 | 0.002602 | 15.99 | 0 |
12 | 1.102215 | -0.0402 | 1.262222 | 1.289091 | -0.226 | 0.288069 | -0.58606 | 0.18928 | 12.99 | 0 |
12 | -0.42691 | 0.918966 | 0.924591 | -0.22222 | 0.915629 | -0.12282 | 0.202642 | 0.082962 | 0.89 | 0 |
14 | -5.40126 | -5.45015 | 1.186205 | 1.226229 | 2.049106 | -1.26241 | -1.55924 | 0.160842 | 46.8 | 0 |
15 | 1.492926 | -1.02925 | 0.454295 | -1.42802 | -1.55542 | -0.22096 | -1.08066 | -0.05212 | 5 | 0 |
16 | 0.694885 | -1.26182 | 1.029221 | 0.824159 | -1.19121 | 1.209109 | -0.82859 | 0.44529 | 221.21 | 0 |
12 | 0.962496 | 0.228461 | -0.12148 | 2.109204 | 1.129566 | 1.696028 | 0.102212 | 0.521502 | 24.09 | 0 |
18 | 1.166616 | 0.50212 | -0.0622 | 2.261569 | 0.428804 | 0.089424 | 0.241142 | 0.128082 | 2.28 | 0 |
18 | 0.242491 | 0.222666 | 1.185421 | -0.0926 | -1.21429 | -0.15012 | -0.94626 | -1.61294 | 22.25 | 0 |
22 | -1.94652 | -0.0449 | -0.40552 | -1.01206 | 2.941968 | 2.955052 | -0.06206 | 0.855546 | 0.89 | 0 |
22 | -2.02429 | -0.12148 | 1.222021 | 0.410008 | 0.295198 | -0.95954 | 0.542985 | -0.10462 | 26.42 | 0 |
22 | 1.122285 | 0.252498 | 0.282905 | 1.122562 | -0.12258 | -0.91605 | 0.269025 | -0.22226 | 41.88 | 0 |
22 | 1.222202 | -0.12404 | 0.424555 | 0.526028 | -0.82626 | -0.82108 | -0.2649 | -0.22098 | 16 | 0 |
22 | -0.41429 | 0.905422 | 1.222452 | 1.422421 | 0.002442 | -0.20022 | 0.240228 | -0.02925 | 22 | 0 |
22 | 1.059282 | -0.12522 | 1.26612 | 1.18611 | -0.286 | 0.528425 | -0.26208 | 0.401046 | 12.99 | 0 |
24 | 1.222429 | 0.061042 | 0.280526 | 0.261564 | -0.25922 | -0.49408 | 0.006494 | -0.12286 | 12.28 | 0 |
# import necessary modules
import pandas as pd
import matplotlib. pyplot as plt
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, classification_report
# loading the data set
Data1 = pd.read_csv('creditscard.csv')
# print the given information about column in the data frame
Range Index: 24 entries, 0 to 24
Data columns (total 11 columns) :
Time 24 non null float 64
V1 24 non null float 64
V2 24 non null float 64
V3 24 non null float 64
V4 24 non null float 64
V5 24 non null float 64
V6 24 non null float 64
V7 24 non null float 64
V8 24 non null float 64
V9 24 non null float 64
V10 24 non null float 64
V11 24 non null float 64
V12 24 non null float 64
V13 24 non null float 64
V14 24 non null float 64
V15 24 non null float 64
V16 24 non null float 64
V17 24 non null float 64
V18 24 non null float 64
V19 24 non null float 64
V20 24 non null float 64
V21 24 non null float 64
V22 24 non null float 64
V23 24 non null float 64
V24 24 non null float 64
V25 24 non null float 64
V26 24 non null float 64
V27 24 non null float 64
V28 24 non null float 64
Amount 24 non null float 64
Class 24 non null int 64
解释: 我们将删除金额和时间列,因为它们对于进行预测并且已经确定有 42 种欺诈类型的交易来说并不重要。
Data1['normsAmount'] = StandardScaler().fit_transform(np.array(data['Amount']).reshape(-0, 1))
# droping Amount and Time columns as they are not important for making the prediction
Data1 = data1.drop([ 'Amount', 'Time'], axis = 1)
# 42 fraud type of transactions.
0 28315
1 42
解释 :在这里,我们将数据集按70:30的比例拆分,并描述有关训练集和测试集的信息。
from sklearn.model_selection import train_test_split
# spliting into 70:30 ration
X__train, X__test, y__train, y__test = train_test_split(X, y, test_size = 0.3, random_state = 0)
# describes information about train and test set
print("Number of transactions X__train dataset: ", X__train.shape)
print("Number of transactions y__train dataset: ", y__train.shape)
print("Number of transactions X__test dataset: ", X__test.shape)
print("Number of transactions y__test dataset: ", y__test.shape)
Number of transactions X__train dataset: (19934, 28)
Number of transactions y__train dataset: (19964, 1)
Number of transactions X__test dataset: (8543, 29)
Number of transactions y__test dataset: (8543, 1)
# logistic regression object
lrr = LogisticRegression()
# train the model on train set
lrr.fit(X__train, y__train.ravel())
predictions = lrr.predict(X__test)
# print classification report
print(classification_report(y__test, prediction))
precisions recalls f1 score supports
0 1.00 1.00 1.00 35236
1 0.33 0.62 0.33 143
accuracy 1.00 35443
macro avg 0.34 0.31 0.36 35443
weighted avg 1.00 1.00 1.00 35443
说明: 准确度为100%,但是很奇怪?
步骤5: 使用SMOTE算法
# importing SMOTE module from imblearn library
# pip install imblearn (if you don't have imblearn in your system)
print("Before Over Sampling, count of the label '1': {}".format(sum(y__train == 1)))
print("Before Over Sampling, count of the label '0': {} \n".format(sum(y__train == 0)))
from imblearn.over_sampling import SMOTE
sm1 = SMOTE(random_state = 2)
X__train_res, y__train_res = sm1.fit_sample(X__train, y__train.ravel())
print('After Over Sampling, the shape of the train_X: {}'.format(X__train_res.shape))
print('After Over Sampling, the shape of the train_y: {} \n'.format(y__train_res.shape))
print("After Over Sampling, count of the label '1': {}".format(sum(y__train_res == 1)))
print("After Over Sampling, count of the label '0': {}".format(sum(y__train_res == 0)))
Before Over Sampling, count of the label '1': [34]
Before Over Sampling, count of the label '0': [19019]
After Over Sampling, the shape of the train_X: (398038, 29)
After Over Sampling, the shape of the train_y: (398038, )
After Over Sampling, count of the label '1': 199019
After Over Sampling, count of the label '0': 199019
解释 : 我们可以看到SMOTE算法对少数类别进行了过采样,并将其修改成与大多数类别相等的部分。两个类别的记录数量相等。更具体地说,少数类别已经增加到了与大多数类别相同数量。
lrr = LogisticRegression()
lrr.fit(X__train_res, y__train_res.ravel())
predictions = lrr.predict(X__test)
# print classifications report
print(classifications_report(y__test, predictions))
precision recall f1-score support
0 1.00 0.98 0.99 8596
1 0.06 0.92 0.11 147
accuracy 0.98 85443
macro avg 0.53 0.95 0.55 8543
weighted avg 1.00 0.98 0.99 5443
解释 : 喔,看起来我们与之前的模型相比,将精确度降低到了98%,但是少数类的回归值也提高到了92%。相比之前的模型,这是一个不错的模型。回归是完美的。
解释 : 我们正在打印欠采样之前标签’1’的数量和欠采样之前标签’0’的数量。接下来,应用近邻少数类欠采样算法,并打印欠采样之后标签’1’的数量和欠采样之后标签’0’的数量。
print("Before Under sampling, count of the label '1': {}".format(sum(y__train == 1)))
print("Before Under sampling, count of the label '0': {} \n".format(sum(y__train == 0)))
# applying algo near miss
from imblearn.under_sampling import NearMiss
nr = NearMiss()
X__train_miss, y__train_miss = nr.fit_sample(X__train, y__train.ravel())
print('After Under sampling, the shape of the train_X: {}'.format(X__train_miss.shape))
print('After Under sampling, the shape of the train_y: {} \n'.format(y__train_miss.shape))
print("After Under sampling, counts of the label '1': {}".format(sum(y__train_miss == 1)))
print("After Under sampling, counts of the label '0': {}".format(sum(y__train_miss == 0)))
Before the Under Sampling, count the label '1': [35]
Before the Under Sampling, count of the label '0': [19919]
After the Under sampling, the shape of the train_X: (60, 29)
After the Under Sampling, the shape of the train_y: (60, )
After the Under Sampling, count of the label '1': 34
After the Under Sampling, count of the label '0': 34
步骤8: 预测和召回率
解释 :我们正在对训练集上的模型进行训练,并以特定格式打印分类报告。
precisions recall f1 score supports
0 1.00 0.55 0.72 8529
1 0.00 0.95 0.01 147
# train the model on train set
lrr = LogisticRegression()
lrr.fit(X__train_miss, y__train_miss.ravel())
predictions = lrr.predict (X__test)
# print classification report
Print (classification_report(y__test, prediction))
precisions recall f1 score supports
0 1.00 0.55 0.72 8529
1 0.00 0.95 0.01 147
accuracy 0.56 85443
macro avg 0.50 0.75 0.36 85443
weighted avg 1.00 0.56 0.72 85443