Python 如何生成分类的模拟数据

Python 如何生成分类的模拟数据

在本教程中,我们将学习如何从Python中的分类中创建模拟数据。

介绍

模拟数据可以被定义为不代表真实现象但使用参数和约束合成生成的任何数据。

何时和为什么我们需要模拟数据

有时候在机器学习或深度学习中原型化特定算法时,我们通常面临缺乏有用的真实世界数据的问题。有时候对于给定的任务没有这样的数据可用。在这种情况下,我们可能需要合成生成的数据。这些数据也可以来自实验室模拟。

模拟数据的优势

  • 大部分表示数据在真实形式中的样子

  • 噪声变化较小,可以被视为理想的数据集

  • 用于快速原型设计和概念验证

使用Python生成分类的模拟数据

在这个演示中,我们将使用sci-ki learn来生成模拟数据。

示例

from sklearn.datasets import make_classification
import pandas as pd
import seaborn as sns

# Creating a simulated feature matrix and output vector with 100 samples
features, output = make_classification(n_samples = 100,

# taking ten features
n_features = 10,

# five features that predict the output's classes
n_informative = 5,

# five features that are random and unrelated to the output's classes
n_redundant = 5,

# three output classes
n_classes = 3,

# with 20% of observations in the first class, 30% in the second class,
# and 50% in the third class. ('None' makes balanced classes)
weights = [.2, .3, .8])
print("Feature Dataframe: ");
df_features = pd.DataFrame(features,
   columns=["Feature 1", "Feature 2","Feature 3", "Feature 4", "Feature 5","Feature 6", "Feature 7", "Feature 8", "Feature 9", "Feature 10"])
output_series = pd.Series(output,name='label')
df = pd.concat([df_features,output_series],axis=1)
print(df.head())
## plot using seaborn
sns.set(rc={"figure.figsize":(16, 8)})

## Plotting 'Feature 1' vs label
sns.scatterplot(data=df,x='Feature 1',y='label',s=50)

输出

Feature Dataframe: 
   Feature 1    Feature 2    Feature 3    Feature 4    Feature 5    Feature 6 \
0  0.849715     -0.381343    0.650106     -1.439747    -0.442026    0.785891 
1  1.841786     0.912779     2.090686     -2.220130    -0.744132    -0.116817 
2  -0.915034    -3.324696    -2.613417    0.852612     -3.908363    4.352266 
3  1.305116     -1.582905    -0.797318    -0.943912    -1.753893    1.721998 
4  0.894486     -0.130399    -0.968311    0.989773     -0.987330    -0.296457
   Feature 7    Feature 8    Feature 9    Feature 10 label 
0  0.119725     1.156633     0.794226     0.511587   2 
1  -0.064624    2.311732     0.178347     1.294978   1 
2  3.038898     -2.273558    4.194868     2.693096   2
3  0.817046     0.577196     2.651006     1.826657   2 
4 -0.280331     0.096983     1.227921     0.909471   2

Python 如何生成分类的模拟数据

还有另一种方法,可以使用Faker Python库。让我们通过下面的示例来了解一下。 安装Faker库

示例

!pip install Faker

from random import randint
import pandas as pd
from faker import Faker
from faker.providers import DynamicProvider

medical_professions_provider = DynamicProvider(
   provider_name="medical_profession",
   elements=["dr.", "doctor", "nurse", "surgeon", "clerk"],
)
fake = Faker()
fake.add_provider(medical_professions_provider)

def input_data(x):
   # pandas dataframe
   data = pd.DataFrame()
   for i in range(0, x):
      data.loc[i,'id']= randint(1, 100)
      data.loc[i,'name']= fake.name()
      data.loc[i,'address']= fake.address()
      data.loc[i,'latitude']= str(fake.latitude())
      data.loc[i,'longitude']= str(fake.longitude())
      data.loc[i,'target'] = str(fake.medical_profession())
   return data
print(input_data(10))

输出

id  name    address \
    7.0 Monique Rodriguez   481 Rebecca Landing Suite 727\nDominiquefurt, ...
    4.0 Elizabeth Johnson   62492 Zimmerman Crest Apt. 047\nPort Jerome, W...
    18.0    Max Rangel  4379 Obrien Curve\nDavistown, IA 02341
    31.0    Tammie Kent 4866 Angela Turnpike Apt. 658\nNorth Sheilabor...
    42.0    James Johnston  26827 Jeremiah Alley\nFreystad, SC 86902
    21.0    Shawn Robles    137 Jessica Ridges Apt. 436\nWilliamburgh, AZ ...
    13.0    Stephen Hodges  Unit 9799 Box 0625\nDPO AA 94415
    91.0    Eric Lewis PhD  4711 Nicholas Loaf\nWest Lisa, UT 28944
    68.0    Matthew Munoz   37836 White Crest\nGonzalezport, NC 75320
    34.0    Lawrence Anderson   76712 Garza Mills Apt. 751\nPort Penny, CT 43042

latitude        longitude   target 0    60.574796   109.367770      clerk
1   84.7225155  -167.216393 dr.
2   82.598649   62.961322   surgeon
3   26.9617205  89.333171   doctor
4   -37.1740195 -140.766121 dr.
5   -40.8904645 28.820918   clerk
6   88.809220   76.442779   dr.
7   35.728143   178.729120  doctor
8   -16.5669945 126.686740  dr.
9   -49.271970  160.737754  clerk

结论

在日常机器学习应用中,模拟数据对于原型设计或小型概念验证非常有用。Python中有一些方便的工具,可以用几行代码轻松创建模拟数据。

Camera课程

Python教程

Java教程

Web教程

数据库教程

图形图像教程

办公软件教程

Linux教程

计算机教程

大数据教程

开发工具教程