如何在Python中处理XML – Element Tree库|极客笔记

如何在Python中处理XML – Element Tree库

在本教程中，我们将学习如何使用Python解析XML文件，使用Python的ElementTree包修改和填充XML文件。为了理解数据，我们还将学习XPath表达式和XML树。

让我们简要介绍XML。如果您对XML的概念已经很熟悉，可以跳过本节，从下一节开始。

什么是XML

XML是”可扩展标记语言”的缩写。它用于通过XML框架动态地理解数据。它主要用于创建具有特定结构的网页。

使用XML创建的页面被称为XML文档。XML生成一个类似树形结构的数据，非常直观且支持层次结构。让我们了解一些XML的重要属性。

XML文档由位于开始<和>结束标记之间的元素组成。开始和结束标记之间的字符是元素的内容。元素可以包含标记，包括其他元素，即”子元素”。顶级元素被称为根元素，包含所有其他文档。
开始标签或空元素包含名值对，称为属性。

下面是XML文件的示例结构。

XML

<?xml version="1.0"?>
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre> 
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications 
      with XML.</description>
   </book>
   <book id="bk102">
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-12-16</publish_date>
      <description>A former architect battles corporate zombies, 
      an evil sorceress, and her own childhood to become queen 
      of the world.</description>
   </book>
   <book id="bk103">
      <author>Corets, Eva</author>
      <title>Maeve Ascendant</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-11-17</publish_date>
      <description>After the collapse of a nanotechnology 
      society in England, the young survivors lay the 
      foundation for a new society.</description>
   </book>
   <book id="bk104">
      <author>Corets, Eva</author>
      <title>Oberon's Legacy</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2001-03-10</publish_date>
      <description>In post-apocalypse England, the mysterious 
      agent known only as Oberon helps to create a new life 
      for the inhabitants of London. Sequel to Maeve 
      Ascendant.</description>
   </book>
   <book id="bk105">
      <author>Corets, Eva</author>
      <title>The Sundered Grail</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2001-09-10</publish_date>
      <description>The two daughters of Maeve, half-sisters, 
      battle one another for control of England. Sequel to 
      Oberon's Legacy.</description>
   </book>
   <book id="bk106">
      <author>Randall, Cynthia</author>
      <title>Lover Birds</title>
      <genre>Romance</genre>
      <price>4.95</price>
      <publish_date>2000-09-02</publish_date>
      <description>When Carla meets Paul at an ornithology 
      conference, tempers fly as feathers get ruffled.</description>
   </book>
   <book id="bk107">
      <author>Thurman, Paula</author>
      <title>Splish Splash</title>
      <genre>Romance</genre>
      <price>4.95</price>
      <publish_date>2000-11-02</publish_date>
      <description>A deep sea diver finds true love twenty 
      thousand leagues beneath the sea.</description>
   </book>
   <book id="bk108">
      <author>Knorr, Stefan</author>
      <title>Creepy Crawlies</title>
      <genre>Horror</genre>
      <price>4.95</price>
      <publish_date>2000-12-06</publish_date>
      <description>An anthology of horror stories about roaches,
      centipedes, scorpions  and other insects.</description>
   </book>
   <book id="bk109">
      <author>Kress, Peter</author>
      <title>Paradox Lost</title>
      <genre>Science Fiction</genre>
      <price>6.95</price>
      <publish_date>2000-11-02</publish_date>
      <description>After an inadvertant trip through a Heisenberg
      Uncertainty Device, James Salway discovers the problems 
      of being quantum.</description>
   </book>
</catalog>

正如我们在上面的XML示例文件中看到的-

<catlog>是单一的根元素，其中包含其他所有元素，如或<title>。
子元素或子元素位于<catlog>内部，我们可以看到它们是嵌套的。
<book>元素包含多个“属性”，如作者、标题等。

注意-子元素可以包含自己的子元素，也称为“子子元素”。

现在，让我们转向ElementTree库。

什么是ElementTree

XML树结构允许我们以简单的方式进行修改、导航和删除。Python带有ElementTree库，它提供了几个函数来读取和操作XML。它用于解析（从文件中读取信息并将其分成片段）。下面是XML数据结构的表格表示。

属性	描述
标签	它代表被存储的数据。基本上是一个字符串。
属性	它包含作为字典存储的多个属性。
文本字符串	它是由需要显示的信息组成的文本字符串。
尾部字符串	如果需要，还可以有尾部字符串。
子元素	它由作为序列存储的多个子元素组成。

要使用ElementTree模块，我们需要将其导入到我们的程序中，如下所示。

import xml.etree.ElementTree as ET

解析XML数据

本教程的主要目标是使用Python读取和理解文件。我们的示例xml文件中有很多图书详细信息，但数据混乱不一致。任何人都可以将数据以自己的方式输入到文件中，导致数据不一致。

让我们看以下示例。

示例

import xml.etree.ElementTree as ET
tree = ET.parse('book.xml')
root = tree.getroot()
print(root)

输出：

<Element 'catalog' at 0x000001FAD52C44A0>

我们在上面的代码中初始化了树，同时打印了XML根对象。现在，我们可以打印树的每个部分，以便更容易地了解树的结构。

正如之前讨论的那样，树的每个部分都包含一个标签，确定元素。元素可以包含属性，在验证输入的值方面起着重要作用。让我们打印XML的根标签。

print(root.tag)

输出：

catalog

如果我们观察XML文件的顶层，这个XML是以collection标签为根。让我们看一下根标签的属性。

print("Attributes are:",root.attrib)

输出：

Attributes are: {}

正如我们所看到的，在根元素中没有属性。

使用For循环解析

我们可以使用for循环遍历根元素中的子元素或子节点。让我们理解以下示例。

示例

import xml.etree.ElementTree as ET
tree = ET.parse('book.xml')
root = tree.getroot()

for ch in root:
    print(ch.tag, ch.attrib)

输出：

Iterating root using for loop
book {'id': 'bk101'}
book {'id': 'bk102'}
book {'id': 'bk103'}
book {'id': 'bk104'}
book {'id': 'bk105'}
book {'id': 'bk106'}
book {'id': 'bk107'}
book {'id': 'bk108'}
book {'id': 'bk109'}

如我们所见，所有的书属性都是根元素 catalog 的子元素。id属性指定了书属性。不同的id有不同的书。

获取整个树中元素的信息非常有帮助。现在我们使用 root.iter() 方法在for循环中，它返回我们拥有的元素数量。然而，它不显示树中的属性或层级。

示例

import xml.etree.ElementTree as ET
tree = ET.parse('book.xml')
root = tree.getroot()

print("Iterating root using for loop:")
tags = [elem.tag for elem in root.iter()]
print(tags)

输出：

['catalog', 'book', 'author', 'title', 'genre', 'price', 'publish_date', 'description', 'book', 'author', 'title', 'genre', 'price', 'publish_date', 'description', 'book', 'author', 'title', 'genre', 'price', 'publish_date', 'description', 'book', 'author', 'title', 'genre', 'price', 'publish_date', 'description', 'book', 'author', 'title', 'genre', 'price', 'publish_date', 'description', 'book', 'author', 'title', 'genre', 'price', 'publish_date', 'description', 'book', 'author', 'title', 'genre', 'price', 'publish_date', 'description', 'book', 'author', 'title', 'genre', 'price', 'publish_date', 'description', 'book', 'author', 'title', 'genre', 'price', 'publish_date', 'description']

由于ElementTree是一个强大的库，我们可以使用.tostring()方法打印整个文档。我们需要将根元素传递到此方法中，并对文档进行编码和解码。对于XML文件，它使用’utf98’。

让我们了解以下代码片段。

示例

print(ET.tostring(root, encoding='utf8').decode('utf8'))

输出:

<?xml version='1.0' encoding='utf8'?>
<catalog>   
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications
      with XML.</description>
   </book>
   <book id="bk102">
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-12-16</publish_date>
      <description>A former architect battles corporate zombies,
      an evil sorceress, and her own childhood to become queen
      of the world.</description>
   </book>
   <book id="bk103">
      <author>Corets, Eva</author>
      <title>Maeve Ascendant</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-11-17</publish_date>
      <description>After the collapse of a nanotechnology
      society in England, the young survivors lay the
      foundation for a new society.</description>
   </book>
   <book id="bk104">
      <author>Corets, Eva</author>
      <title>Oberon's Legacy</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2001-03-10</publish_date>
      <description>In post-apocalypse England, the mysterious
      agent known only as Oberon helps to create a new life
      for the inhabitants of London. Sequel to Maeve
      Ascendant.</description>
   </book>
   <book id="bk105">
      <author>Corets, Eva</author>
      <title>The Sundered Grail</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2001-09-10</publish_date>
      <description>The two daughters of Maeve, half-sisters,
      battle one another for control of England. Sequel to
      Oberon's Legacy.</description>
   </book>
   <book id="bk106">
      <author>Randall, Cynthia</author>
      <title>Lover Birds</title>
      <genre>Romance</genre>
      <price>4.95</price>
      <publish_date>2000-09-02</publish_date>
      <description>When Carla meets Paul at an ornithology
      conference, tempers fly as feathers get ruffled.</description>
   </book>
   <book id="bk107">
      <author>Thurman, Paula</author>
      <title>Splish Splash</title>
      <genre>Romance</genre>
      <price>4.95</price>
      <publish_date>2000-11-02</publish_date>
      <description>A deep sea diver finds true love twenty
      thousand leagues beneath the sea.</description>
   </book>
   <book id="bk108">
      <author>Knorr, Stefan</author>
      <title>Creepy Crawlies</title>
      <genre>Horror</genre>
      <price>4.95</price>
      <publish_date>2000-12-06</publish_date>
      <description>An anthology of horror stories about roaches,
      centipedes, scorpions  and other insects.</description>
   </book>
   <book id="bk109">
      <author>Kress, Peter</author>
      <title>Paradox Lost</title>
      <genre>Science Fiction</genre>
      <price>6.95</price>
      <publish_date>2000-11-02</publish_date>
      <description>After an inadvertant trip through a Heisenberg
      Uncertainty Device, James Salway discovers the problems
      of being quantum.</description>
   </book>
</catalog>

根据指定元素查找所有子元素，可以使用root.iter()方法。该方法将返回根元素下所有匹配指定元素的子元素。让我们来看下面的代码：

示例

for book in root.iter('book'):
    print(book.attrib)

输出：

{'id': 'bk101'}
{'id': 'bk102'}
{'id': 'bk103'}
{'id': 'bk104'}
{'id': 'bk105'}
{'id': 'bk106'}
{'id': 'bk107'}
{'id': 'bk108'}
{'id': 'bk109'}

XPath表达式

有时候，元素没有属性，只有文本内容。我们可以使用 .text 属性来打印文本内容。让我们理解以下示例。

示例

import xml.etree.ElementTree as ET
tree = ET.parse('book.xml')
root = tree.getroot()

print("Desctiption Values:")
for description in root.iter('description'):
    print(description.text)

输出：

An in-depth look at creating applications 
      with XML.
A former architect battles corporate zombies, 
      an evil sorceress, and her own childhood to become queen 
      of the world.
After the collapse of a nanotechnology
      society in England, the young survivors lay the
      foundation for a new society.
In post-apocalypse England, the mysterious
      agent known only as Oberon helps to create a new life
      for the inhabitants of London. Sequel to Maeve
      Ascendant.
The two daughters of Maeve, half-sisters,
      Oberon's Legacy.
When Carla meets Paul at an ornithology
      conference, tempers fly as feathers get ruffled.
A deep sea diver finds true love twenty
      thousand leagues beneath the sea.
An anthology of horror stories about roaches,
      centipedes, scorpions  and other insects.
After an inadvertant trip through a Heisenberg
      Uncertainty Device, James Salway discovers the problems
      of being quantum.
(Django) PS D:\Python Project> & "C:/Users/DEVANSH SHARMA/.virtualenvs/Django-ExvyqL3O/Scripts/python.exe" "d:/Python Project/sellshares.py"
Desctiption Values:
An in-depth look at creating applications 
      with XML.
A former architect battles corporate zombies,
      an evil sorceress, and her own childhood to become queen
      of the world.
After the collapse of a nanotechnology
      society in England, the young survivors lay the
      foundation for a new society.
In post-apocalypse England, the mysterious
      agent known only as Oberon helps to create a new life
      for the inhabitants of London. Sequel to Maeve
      Ascendant.
The two daughters of Maeve, half-sisters,
      battle one another for control of England. Sequel to
      Oberon's Legacy.
When Carla meets Paul at an ornithology
      conference, tempers fly as feathers get ruffled.
A deep sea diver finds true love twenty
      thousand leagues beneath the sea.
An anthology of horror stories about roaches,
      centipedes, scorpions  and other insects.
After an inadvertant trip through a Heisenberg
      Uncertainty Device, James Salway discovers the problems
      of being quantum.

使用 .text 属性，我们可以获取任何属性的内容。

示例2

print("Title Values:")
for title in root.iter('title'):
    print(title.text)

输出：

Title Values:
XML Developer's Guide
Midnight Rain
Maeve Ascendant
Oberon's Legacy
The Sundered Grail
Lover Birds
Splish Splash
Creepy Crawlies
Paradox Lost

这种打印XML文件的方法不被推荐。然而，XPath是最常用和推荐的方式。它代表XML路径语言，是一种用于快速和简便地搜索XML的查询语言。它具有类似路径的语法，用于在XML文档中标识和导航节点。

ElementTree提供了 findall() 方法，用于遍历所引用元素的直接子元素。

让我们理解以下示例。

示例

import xml.etree.ElementTree as ET
tree = ET.parse('book.xml')
root = tree.getroot()

for val in root.findall("./book/[price='5.95']"):
    print(val.attrib)

输出：

{'id': 'bk102'}
{'id': 'bk103'}
{'id': 'bk104'}
{'id': 'bk105'}

有三本书的价格都是5.95元。这种方法在大型XML文件中查找特定结果非常有效快速。现在，我们找到了流派是浪漫的书籍。

示例2:

for val in root.findall("./book/[genre='Romance']"):
    print(val.attrib)

输出:

{'id': 'bk106'}
{'id': 'bk107'}

修改XML

根据需求，我们可以修改XML文件。让我们看一下下面的示例。

示例再次打印出书籍的标题

for title in root.iter('title'):
    print(title.text)

输出：

XML Developer's Guide
Midnight Rain
Maeve Ascendant
Oberon's Legacy
The Sundered Grail
Lover Birds
Splish Splash
Creepy Crawlies
Paradox Lost

现在我们将把 ‘Midnight Rain’ 的标题替换为 the Alchemist 。

mod_title = root.find("./book/[title='Midnight Rain']")
print(mod_title)

mod_title.attrib["title"] = "The Alchemist"
print(mod_title.attrib)

输出:

<Element 'book' at 0x0000024822762770>
{'id': 'bk102', 'title': 'The Alchemist'}

一旦我们修改了XML文件，我们将把这个更改写回XML。让我们理解下面的示例。

示例

tree.write("book.xml")

tree = ET.parse('book.xml')
root = tree.getroot()

for title in root.iter('title'):
    print(title.attrib)

输出:

XML Developer's Guide
The Alchemist
Maeve Ascendant
Oberon's Legacy
The Sundered Grail
Lover Birds
Splish Splash
Creepy Crawlies
Paradox Lost

示例2:

for description in root.iter('description'):
     new_desc = str(description.text)+'This is a author view'
     description.text = str(new_desc)
     description.set('updated', 'yes')

tree.write('book.xml')

输出：

<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications 
      with XML.</description>
   </book>
   <book id="bk102">
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-12-16</publish_date>
      <description>A former architect battles corporate zombies, 
      an evil sorceress, and her own childhood to become queen 
      of the world.</description>
   </book>

上面的代码将把新的描述添加到book.xml文件中。我们只展示了两本书的输出，但它将反映在整个文件数据中。