使用SMOTE算法和Near Miss算法处理Python中的不平衡数据|极客笔记

使用SMOTE算法和Near Miss算法处理Python中的不平衡数据

在数据科学和机器学习中，我们经常遇到一个名为不平衡数据分布的术语，通常情况下，当某一类别的观测值远高于或远低于其他类别时会出现这种情况。机器学习算法通常通过减少错误来增加准确性，因此它们不考虑类别分布。这个问题在欺诈检测、异常检测、人脸识别等模型中非常普遍。

标准机器学习技术，如决策树和逻辑回归，倾向于多数类别，并且经常忽略少数类别。它们倾向于预测多数类别，因此与多数类别相比，对少数类别具有显著的错误分类。更具体地说，如果数据集中存在不平衡的数据分布，那么我们的模型更容易出现少数类别评级不相关或极低的情况。

不平衡数据处理技术：主要有两种广泛使用的算法用于处理不平衡的类别分布。

SMOTE算法
Near Miss算法

SMOTE（Synthetic Minority Oversampling Technique）-过采样

SMOTE（合成少数类过采样技术）是解决不平衡问题最常用的过采样方法之一。

它通过随机复制少数类别模型来平衡类别分布。

SMOTE在现有的少数类样本之间生成新的少数类样本。它通过为少数类中的每个样本选择k个最近邻之一来生成虚拟训练记录。在过采样过程中，重塑数据，并可以对处理后的数据应用一些分类模型。

SMOTE算法工作过程

阶段1： 进行少数类别设置，设置为A，对于每一个样本x，计算其与集合A中每个样本之间的欧氏距离，以此获取x的k个最近邻。

阶段2： 通过不平衡比例设置测试率N。对于每一个样本，从其k个最近邻中随机选择N个样本（x1，x2，…，xn），并构建集合。

阶段3： 对于每个样本（k = 1, 2, 3，… N），使用以下公式生成新的样本：rand(0, 1)表示0到1之间的随机数。

Near Miss算法

Near Miss是一种欠采样方法。它通过随机删除多数类样本来平衡类别分布。当两个不同类别的样本非常接近时，我们删除多数类别的实例以增加两个类别之间的间隔。这有助于分类处理。

在大多数欠采样方法中，近邻方法广泛使用以防止数据损失的问题。

关于近邻方法工作原理的基本直觉如下：

阶段1： 该技术首先找到较大类别的所有出现和少数类别的次数之间的距离。这里，较大类别要被低估。

阶段2： 然后选择较大类别中与少数类别中距离最小的那些的“n”个案例。

阶段3： 如果少数类别中有k个案例，最靠近的技术将导致较大类别的k*n个案例。

为了找到较大类别中n个最近的案例，有多种方式可以应用NearMiss算法：

NearMiss – 版本1：选择较大类别的测试案例，其与少数类别的k个最近案例的平均距离最小。
NearMiss – 版本2：选择较大类别的测试案例，其与少数类别的k个最远案例的平均距离最小。
NearMiss – 版本3：它分为两个阶段。首先，对于每个少数类别案例，将保存它们的M个最近邻。然后，选择较大类别案例，其与N个最近邻的典型距离最大。

步骤1：加载数据文件和库

解释：数据集包含信用卡交易。该数据集中有884,808个交易中的491个欺诈交易。这使得数据极不均衡；正类（欺诈）占所有交易的0.172%。

Time	V1	V2	V2	V4	V5	V6	V2	V8	Amount
0	-1.25981	-0.02228	2.526242	1.228155	-0.22822	0.462288	0.229599	0.098698	149.62
0	1.191852	0.266151	0.16648	0.448154	0.060018	-0.08226	-0.0288	0.085102	2.69
1	-1.25825	-1.24016	1.222209	0.22928	-0.5022	1.800499	0.291461	0.242626	228.66
1	-0.96622	-0.18522	1.292992	-0.86229	-0.01021	1.242202	0.222609	0.222426	122.5
2	-1.15822	0.822222	1.548218	0.402024	-0.40219	0.095921	0.592941	-0.22052	69.99
2	-0.42592	0.960522	1.141109	-0.16825	0.420982	-0.02922	0.426201	0.260214	2.62
4	1.229658	0.141004	0.045221	1.202612	0.191881	0.222208	-0.00516	0.081212	4.99
2	-0.64422	1.412964	1.02428	-0.4922	0.948924	0.428118	1.120621	-2.80286	40.8
2	-0.89429	0.286152	-0.11219	-0.22152	2.669599	2.221818	0.220145	0.851084	92.2
9	-0.22826	1.119592	1.044262	-0.22219	0.499261	-0.24626	0.651582	0.069529	2.68
10	1.449044	-1.12624	0.91286	-1.22562	-1.92128	-0.62915	-1.42224	0.048456	2.8
10	0.284928	0.616109	-0.8242	-0.09402	2.924584	2.212022	0.420455	0.528242	9.99
10	1.249999	-1.22164	0.28292	-1.2249	-1.48542	-0.25222	-0.6894	-0.22249	121.5
11	1.069224	0.282222	0.828612	2.21252	-0.1284	0.222544	-0.09622	0.115982	22.5
12	-2.29185	-0.22222	1.64125	1.262422	-0.12659	0.802596	-0.42291	-1.90211	58.8
12	-0.25242	0.245485	2.052222	-1.46864	-1.15829	-0.02285	-0.60858	0.002602	15.99
12	1.102215	-0.0402	1.262222	1.289091	-0.226	0.288069	-0.58606	0.18928	12.99
12	-0.42691	0.918966	0.924591	-0.22222	0.915629	-0.12282	0.202642	0.082962	0.89
14	-5.40126	-5.45015	1.186205	1.226229	2.049106	-1.26241	-1.55924	0.160842	46.8
15	1.492926	-1.02925	0.454295	-1.42802	-1.55542	-0.22096	-1.08066	-0.05212	5
16	0.694885	-1.26182	1.029221	0.824159	-1.19121	1.209109	-0.82859	0.44529	221.21
12	0.962496	0.228461	-0.12148	2.109204	1.129566	1.696028	0.102212	0.521502	24.09
18	1.166616	0.50212	-0.0622	2.261569	0.428804	0.089424	0.241142	0.128082	2.28
18	0.242491	0.222666	1.185421	-0.0926	-1.21429	-0.15012	-0.94626	-1.61294	22.25
22	-1.94652	-0.0449	-0.40552	-1.01206	2.941968	2.955052	-0.06206	0.855546	0.89
22	-2.02429	-0.12148	1.222021	0.410008	0.295198	-0.95954	0.542985	-0.10462	26.42
22	1.122285	0.252498	0.282905	1.122562	-0.12258	-0.91605	0.269025	-0.22226	41.88
22	1.222202	-0.12404	0.424555	0.526028	-0.82626	-0.82108	-0.2649	-0.22098	16
22	-0.41429	0.905422	1.222452	1.422421	0.002442	-0.20022	0.240228	-0.02925	22
22	1.059282	-0.12522	1.26612	1.18611	-0.286	0.528425	-0.26208	0.401046	12.99
24	1.222429	0.061042	0.280526	0.261564	-0.25922	-0.49408	0.006494	-0.12286	12.28

源代码：

# import necessary modules
import pandas as pd
import matplotlib. pyplot as plt
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, classification_report
# loading the data set
Data1 = pd.read_csv('creditscard.csv')
# print the given information about column in the data frame
print(data1.info())

输出：

Range Index: 24 entries, 0 to 24
Data columns (total 11 columns) :
Time      24 non null float 64
V1        24 non null float 64
V2        24 non null float 64
V3        24 non null float 64
V4        24 non null float 64
V5        24 non null float 64
V6        24 non null float 64
V7        24 non null float 64
V8        24 non null float 64
V9        24 non null float 64
V10       24 non null float 64
V11       24 non null float 64
V12       24 non null float 64
V13       24 non null float 64
V14       24 non null float 64
V15       24 non null float 64
V16       24 non null float 64
V17       24 non null float 64
V18       24 non null float 64
V19       24 non null float 64
V20       24 non null float 64
V21       24 non null float 64
V22       24 non null float 64
V23       24 non null float 64
V24       24 non null float 64
V25       24 non null float 64
V26       24 non null float 64
V27       24 non null float 64
V28       24 non null float 64
Amount    24 non null float 64
Class     24 non null int 64

步骤2：对列进行归一化

解释： 我们将删除金额和时间列，因为它们对于进行预测并且已经确定有 42 种欺诈类型的交易来说并不重要。

源代码：

Data1['normsAmount'] = StandardScaler().fit_transform(np.array(data['Amount']).reshape(-0, 1))

# droping Amount  and Time columns as they are not important for making the prediction 
Data1 = data1.drop([ 'Amount', 'Time'], axis = 1)

# 42 fraud type of transactions.
Data1['Class'].value_counts()

输出：

       0    28315
       1       42

步骤3：将数据拆分为测试集和训练集

解释：在这里，我们将数据集按70:30的比例拆分，并描述有关训练集和测试集的信息。

输出为X__train数据集、y__train数据集、X__test数据集、y__test数据集的交易次数。

源代码：

from sklearn.model_selection import train_test_split
# spliting into 70:30 ration
X__train, X__test, y__train, y__test = train_test_split(X, y, test_size = 0.3, random_state = 0)
# describes information about train and test set
print("Number of transactions X__train dataset: ", X__train.shape)
print("Number of transactions y__train dataset: ", y__train.shape)
print("Number of transactions X__test dataset: ", X__test.shape)
print("Number of transactions y__test dataset: ", y__test.shape)

输出：

      Number of transactions X__train dataset:  (19934, 28)
      Number of transactions y__train dataset:  (19964, 1)
      Number of transactions X__test dataset:  (8543, 29)
      Number of transactions y__test dataset:  (8543, 1)

步骤4：现在无需处理不平衡的类分布就训练模型

源代码：

# logistic regression object
lrr = LogisticRegression()
# train the model on train set
lrr.fit(X__train, y__train.ravel())
predictions = lrr.predict(X__test)
# print classification report
print(classification_report(y__test, prediction))

输出:

                precisions   recalls   f1 score  supports
           0       1.00      1.00      1.00     35236
           1       0.33      0.62      0.33       143
    accuracy                           1.00     35443
   macro avg       0.34      0.31      0.36     35443
weighted avg       1.00      1.00      1.00     35443

说明: 准确度为100%，但是很奇怪？

对于少数类别的评论极少。这表明模型更加偏向于多数类别。因此，这不是一个理想的模型。

现在，我们将应用不同的不平衡数据处理方法并查看它们的准确度和回顾结果。

步骤5: 使用SMOTE算法

源代码:

# importing SMOTE module from imblearn library
# pip install imblearn (if you don't have imblearn in your system)
print("Before Over Sampling, count of the label '1': {}".format(sum(y__train == 1)))
print("Before Over Sampling, count of the label '0': {} \n".format(sum(y__train == 0)))
from imblearn.over_sampling import SMOTE
sm1 = SMOTE(random_state = 2)
X__train_res, y__train_res = sm1.fit_sample(X__train, y__train.ravel())
print('After Over Sampling, the shape of the train_X: {}'.format(X__train_res.shape))
print('After Over Sampling, the shape of the train_y: {} \n'.format(y__train_res.shape))
print("After Over Sampling, count of the label '1': {}".format(sum(y__train_res == 1)))
print("After Over Sampling, count of the label '0': {}".format(sum(y__train_res == 0)))

输出：

Before Over Sampling, count of the label '1': [34]
Before Over Sampling, count of the label '0': [19019] 
After Over Sampling, the shape of the train_X: (398038, 29)
After Over Sampling, the shape of the train_y: (398038, ) 
After Over Sampling, count of the label '1': 199019
After Over Sampling, count of the label '0': 199019

解释 : 我们可以看到SMOTE算法对少数类别进行了过采样，并将其修改成与大多数类别相等的部分。两个类别的记录数量相等。更具体地说，少数类别已经增加到了与大多数类别相同数量。

现在看看应用SMOTE算法（过采样）之后的准确率和召回率结果。

步骤6：预测和回归

源代码：

lrr = LogisticRegression()
lrr.fit(X__train_res, y__train_res.ravel())
predictions = lrr.predict(X__test)
# print classifications report
print(classifications_report(y__test, predictions))

输出：

                precision   recall   f1-score support
           0       1.00      0.98      0.99     8596
           1       0.06      0.92      0.11       147
    accuracy                           0.98     85443
   macro avg       0.53      0.95      0.55     8543
weighted avg       1.00      0.98      0.99     5443

解释 : 喔，看起来我们与之前的模型相比，将精确度降低到了98%，但是少数类的回归值也提高到了92%。相比之前的模型，这是一个不错的模型。回归是完美的。

现在，我们将应用近邻少数类欠采样（NearMiss）算法来观察其精确度和回归结果。

步骤7：NearMiss算法：

解释 : 我们正在打印欠采样之前标签’1’的数量和欠采样之前标签’0’的数量。接下来，应用近邻少数类欠采样算法，并打印欠采样之后标签’1’的数量和欠采样之后标签’0’的数量。

源代码：

print("Before Under sampling, count of the label '1': {}".format(sum(y__train == 1)))
print("Before Under sampling, count of the label '0': {} \n".format(sum(y__train == 0)))
# applying algo near miss
from imblearn.under_sampling import NearMiss
nr = NearMiss()
X__train_miss, y__train_miss = nr.fit_sample(X__train, y__train.ravel())
print('After Under sampling, the shape of the train_X: {}'.format(X__train_miss.shape))
print('After Under sampling, the shape of the train_y: {} \n'.format(y__train_miss.shape))
print("After Under sampling, counts of the label '1': {}".format(sum(y__train_miss == 1)))
print("After Under sampling, counts of the label '0': {}".format(sum(y__train_miss == 0)))

输出：

Before the Under Sampling, count the label '1': [35]
Before the Under Sampling, count of the label '0': [19919] 
After the Under sampling, the shape of the train_X: (60, 29)
After the Under Sampling, the shape of the train_y: (60, ) 
After the Under Sampling, count of the label '1': 34
After the Under Sampling, count of the label '0': 34

近似算法已对大部分事件进行了下采样，并将其等同于大多数类别。在这里，大多数类别已经缩小到与少数类别相同的数量，因此两个类别将具有相等数量的记录。

步骤8： 预测和召回率

解释：我们正在对训练集上的模型进行训练，并以特定格式打印分类报告。

      precisions   recall       f1       score     supports
           0              1.00        0.55      0.72     8529
           1              0.00        0.95      0.01       147

源代码：

# train the model on train set
lrr = LogisticRegression()
lrr.fit(X__train_miss, y__train_miss.ravel())
predictions = lrr.predict (X__test)
# print classification report
Print (classification_report(y__test, prediction))

输出：

               precisions    recall   f1 score   supports
           0       1.00      0.55      0.72     8529
           1       0.00      0.95      0.01       147
    accuracy                           0.56     85443
   macro avg       0.50      0.75      0.36     85443
weighted avg       1.00      0.56      0.72     85443

该模型比主要模型更优越，因为它的排列更好，少数类的评价价值为95%。然而，在对多数类进行欠采样的情况下，其评价降低到56%。所以在这种情况下，SMOTE为我们提供了非常准确和高回忆率的结果。