Python 如何将非结构化数据转换为结构化数据

非结构化数据是指不遵循任何特定数据模型或格式的数据，可以以文本、图像、音频和视频等不同形式存在。将非结构化数据转换为结构化数据是数据分析中的重要任务，因为结构化数据更容易分析和提取见解。Python提供了各种库和工具，用于将非结构化数据转换为结构化数据，以使其更易于管理和分析。

在本文中，我们将探讨如何使用Python将非结构化生物特征数据转换为结构化格式，从而更有意义地分析和解释数据。

虽然我们可以使用不同的方法将非结构化数据转换为结构化数据，但在本文中，我们将讨论以下两种方法：

正则表达式（Regex）： 这种方法涉及使用正则表达式从非结构化文本中提取结构化数据。可以定义正则表达式模式以匹配非结构化文本中的特定模式并提取相关信息。
数据处理库： 可以使用数据处理库（如pandas）来清洁和转换非结构化数据为结构化格式。这些库提供了执行数据清洗、归一化和转换等操作的函数。

使用正则表达式

考虑下面显示的代码。

示例

import re
import pandas as pd

# sample unstructured text data
text_data = """
Employee ID: 1234
Name: John Doe
Department: Sales
Punch Time: 8:30 AM

Employee ID: 2345
Name: Jane Smith
Department: Marketing
Punch Time: 9:00 AM
"""

# define regular expression patterns to extract data
id_pattern = re.compile(r'Employee ID: (\d+)')
name_pattern = re.compile(r'Name: (.+)')
dept_pattern = re.compile(r'Department: (.+)')
time_pattern = re.compile(r'Punch Time: (.+)')

# create empty lists to store extracted data
ids = []
names = []
depts = []
times = []

# iterate through each line of the text data
for line in text_data.split('\n'):
    # check if the line matches any of the regular expression patterns
    if id_pattern.match(line):
        ids.append(id_pattern.match(line).group(1))
    elif name_pattern.match(line):
        names.append(name_pattern.match(line).group(1))
    elif dept_pattern.match(line):
        depts.append(dept_pattern.match(line).group(1))
    elif time_pattern.match(line):
        times.append(time_pattern.match(line).group(1))

# create a dataframe using the extracted data
data = {'Employee ID': ids, 'Name': names, 'Department': depts, 'Punch Time': times}
df = pd.DataFrame(data)

# print the dataframe
print(df)

说明

首先，我们将非结构化文本数据定义为多行字符串。
接下来，我们定义了用于从文本中提取相关数据的正则表达式模式。我们在Python中使用re模块进行此操作。
我们创建空列表来存储提取的数据。
我们迭代遍历文本数据的每一行，并检查是否与任何正则表达式模式匹配。如果匹配成功，我们提取相关数据并将其追加到相应的列表中。
最后，我们使用提取的数据创建一个Pandas数据帧并打印它。

输出

Employee ID      Name           Department  Punch Time
0        1234                 John Doe      Sales            8:30 AM
1        2345                 Jane Smith   Marketing      9:00 AM

使用Pandas库

假设我们有一个看起来像这样的非结构化数据。

employee_id,date,time,type
1001,2022-01-01,09:01:22,Punch-In
1001,2022-01-01,12:35:10,Punch-Out
1002,2022-01-01,08:58:30,Punch-In
1002,2022-01-01,17:03:45,Punch-Out
1001,2022-01-02,09:12:43,Punch-In
1001,2022-01-02,12:37:22,Punch-Out
1002,2022-01-02,08:55:10,Punch-In
1002,2022-01-02,17:00:15,Punch-Out

示例

import pandas as pd

# Load unstructured data
unstructured_data = pd.read_csv("unstructured_data.csv")

# Extract date and time from the 'date_time' column
unstructured_data['date'] = pd.to_datetime(unstructured_data['date_time']).dt.date
unstructured_data['time'] = pd.to_datetime(unstructured_data['date_time']).dt.time

# Rename 'date_time' column to 'datetime' and drop it
unstructured_data = unstructured_data.rename(columns={"date_time": "datetime"})
unstructured_data = unstructured_data.drop(['datetime'], axis=1)

# Pivot the table to get 'Punch-In' and 'Punch-Out' time for each employee on each date
structured_data = unstructured_data.pivot(index=['employee_id', 'date'], columns='type', values='time').reset_index()

# Rename column names
structured_data = structured_data.rename(columns={"Punch-In": "punch_in", "Punch-Out": "punch_out"})

# Calculate total hours worked by subtracting 'punch_in' from 'punch_out'
structured_data['hours_worked'] = pd.to_datetime(structured_data['punch_out']) - pd.to_datetime(structured_data['punch_in'])

# Print the structured data
print(structured_data)

输出

type  employee_id        date   punch_in  punch_out hours_worked
0           1001  2022-01-01  09:01:22  12:35:10     03:33:48
1           1001  2022-01-02  09:12:43  12:37:22     03:24:39
2           1002  2022-01-01  08:58:30  17:03:45     08:05:15
3           1002  2022-01-02  08:55:10  17:00:15     08:05:05