如何在Python中处理XML – Element Tree库
在本教程中,我们将学习如何使用Python解析XML文件,使用Python的ElementTree包修改和填充XML文件。为了理解数据,我们还将学习XPath表达式和XML树。
让我们简要介绍XML。如果您对XML的概念已经很熟悉,可以跳过本节,从下一节开始。
什么是XML
XML是”可扩展标记语言”的缩写。它用于通过XML框架动态地理解数据。它主要用于创建具有特定结构的网页。
使用XML创建的页面被称为XML文档。XML生成一个类似树形结构的数据,非常直观且支持层次结构。让我们了解一些XML的重要属性。
- XML文档由位于开始<和>结束标记之间的元素组成。开始和结束标记之间的字符是元素的内容。元素可以包含标记,包括其他元素,即”子元素”。顶级元素被称为根元素,包含所有其他文档。
- 开始标签或空元素包含名值对,称为属性。
下面是XML文件的示例结构。
XML
<?xml version="1.0"?>
<catalog>
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications
with XML.</description>
</book>
<book id="bk102">
<author>Ralls, Kim</author>
<title>Midnight Rain</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-12-16</publish_date>
<description>A former architect battles corporate zombies,
an evil sorceress, and her own childhood to become queen
of the world.</description>
</book>
<book id="bk103">
<author>Corets, Eva</author>
<title>Maeve Ascendant</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-11-17</publish_date>
<description>After the collapse of a nanotechnology
society in England, the young survivors lay the
foundation for a new society.</description>
</book>
<book id="bk104">
<author>Corets, Eva</author>
<title>Oberon's Legacy</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2001-03-10</publish_date>
<description>In post-apocalypse England, the mysterious
agent known only as Oberon helps to create a new life
for the inhabitants of London. Sequel to Maeve
Ascendant.</description>
</book>
<book id="bk105">
<author>Corets, Eva</author>
<title>The Sundered Grail</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2001-09-10</publish_date>
<description>The two daughters of Maeve, half-sisters,
battle one another for control of England. Sequel to
Oberon's Legacy.</description>
</book>
<book id="bk106">
<author>Randall, Cynthia</author>
<title>Lover Birds</title>
<genre>Romance</genre>
<price>4.95</price>
<publish_date>2000-09-02</publish_date>
<description>When Carla meets Paul at an ornithology
conference, tempers fly as feathers get ruffled.</description>
</book>
<book id="bk107">
<author>Thurman, Paula</author>
<title>Splish Splash</title>
<genre>Romance</genre>
<price>4.95</price>
<publish_date>2000-11-02</publish_date>
<description>A deep sea diver finds true love twenty
thousand leagues beneath the sea.</description>
</book>
<book id="bk108">
<author>Knorr, Stefan</author>
<title>Creepy Crawlies</title>
<genre>Horror</genre>
<price>4.95</price>
<publish_date>2000-12-06</publish_date>
<description>An anthology of horror stories about roaches,
centipedes, scorpions and other insects.</description>
</book>
<book id="bk109">
<author>Kress, Peter</author>
<title>Paradox Lost</title>
<genre>Science Fiction</genre>
<price>6.95</price>
<publish_date>2000-11-02</publish_date>
<description>After an inadvertant trip through a Heisenberg
Uncertainty Device, James Salway discovers the problems
of being quantum.</description>
</book>
</catalog>
正如我们在上面的XML示例文件中看到的-
<catlog>
是单一的根元素,其中包含其他所有元素,如或 <title>
。- 子元素或子元素位于
<catlog>
内部,我们可以看到它们是嵌套的。 <book>
元素包含多个“属性”,如作者、标题等。
注意-子元素可以包含自己的子元素,也称为“子子元素”。
现在,让我们转向ElementTree库。
什么是ElementTree
XML树结构允许我们以简单的方式进行修改、导航和删除。Python带有ElementTree库,它提供了几个函数来读取和操作XML。它用于解析(从文件中读取信息并将其分成片段)。下面是XML数据结构的表格表示。
属性 | 描述 |
---|---|
标签 | 它代表被存储的数据。基本上是一个字符串。 |
属性 | 它包含作为字典存储的多个属性。 |
文本字符串 | 它是由需要显示的信息组成的文本字符串。 |
尾部字符串 | 如果需要,还可以有尾部字符串。 |
子元素 | 它由作为序列存储的多个子元素组成。 |
要使用ElementTree模块,我们需要将其导入到我们的程序中,如下所示。
import xml.etree.ElementTree as ET
解析XML数据
本教程的主要目标是使用Python读取和理解文件。我们的示例xml文件中有很多图书详细信息,但数据混乱不一致。任何人都可以将数据以自己的方式输入到文件中,导致数据不一致。
让我们看以下示例。
示例
import xml.etree.ElementTree as ET
tree = ET.parse('book.xml')
root = tree.getroot()
print(root)
输出:
<Element 'catalog' at 0x000001FAD52C44A0>
我们在上面的代码中初始化了树,同时打印了XML根对象。现在,我们可以打印树的每个部分,以便更容易地了解树的结构。
正如之前讨论的那样,树的每个部分都包含一个标签,确定元素。元素可以包含属性,在验证输入的值方面起着重要作用。让我们打印XML的根标签。
print(root.tag)
输出:
catalog
如果我们观察XML文件的顶层,这个XML是以collection标签为根。让我们看一下根标签的属性。
print("Attributes are:",root.attrib)
输出:
Attributes are: {}
正如我们所看到的,在根元素中没有属性。
使用For循环解析
我们可以使用for循环遍历根元素中的子元素或子节点。让我们理解以下示例。
示例
import xml.etree.ElementTree as ET
tree = ET.parse('book.xml')
root = tree.getroot()
for ch in root:
print(ch.tag, ch.attrib)
输出:
Iterating root using for loop
book {'id': 'bk101'}
book {'id': 'bk102'}
book {'id': 'bk103'}
book {'id': 'bk104'}
book {'id': 'bk105'}
book {'id': 'bk106'}
book {'id': 'bk107'}
book {'id': 'bk108'}
book {'id': 'bk109'}
如我们所见,所有的书属性都是根元素 catalog 的子元素。id属性指定了书属性。不同的id有不同的书。
获取整个树中元素的信息非常有帮助。现在我们使用 root.iter() 方法在for循环中,它返回我们拥有的元素数量。然而,它不显示树中的属性或层级。
示例
import xml.etree.ElementTree as ET
tree = ET.parse('book.xml')
root = tree.getroot()
print("Iterating root using for loop:")
tags = [elem.tag for elem in root.iter()]
print(tags)
输出:
['catalog', 'book', 'author', 'title', 'genre', 'price', 'publish_date', 'description', 'book', 'author', 'title', 'genre', 'price', 'publish_date', 'description', 'book', 'author', 'title', 'genre', 'price', 'publish_date', 'description', 'book', 'author', 'title', 'genre', 'price', 'publish_date', 'description', 'book', 'author', 'title', 'genre', 'price', 'publish_date', 'description', 'book', 'author', 'title', 'genre', 'price', 'publish_date', 'description', 'book', 'author', 'title', 'genre', 'price', 'publish_date', 'description', 'book', 'author', 'title', 'genre', 'price', 'publish_date', 'description', 'book', 'author', 'title', 'genre', 'price', 'publish_date', 'description']
由于ElementTree是一个强大的库,我们可以使用.tostring()方法打印整个文档。我们需要将根元素传递到此方法中,并对文档进行编码和解码。对于XML文件,它使用’utf98’。
让我们了解以下代码片段。
示例
print(ET.tostring(root, encoding='utf8').decode('utf8'))
输出:
<?xml version='1.0' encoding='utf8'?>
<catalog>
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications
with XML.</description>
</book>
<book id="bk102">
<author>Ralls, Kim</author>
<title>Midnight Rain</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-12-16</publish_date>
<description>A former architect battles corporate zombies,
an evil sorceress, and her own childhood to become queen
of the world.</description>
</book>
<book id="bk103">
<author>Corets, Eva</author>
<title>Maeve Ascendant</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-11-17</publish_date>
<description>After the collapse of a nanotechnology
society in England, the young survivors lay the
foundation for a new society.</description>
</book>
<book id="bk104">
<author>Corets, Eva</author>
<title>Oberon's Legacy</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2001-03-10</publish_date>
<description>In post-apocalypse England, the mysterious
agent known only as Oberon helps to create a new life
for the inhabitants of London. Sequel to Maeve
Ascendant.</description>
</book>
<book id="bk105">
<author>Corets, Eva</author>
<title>The Sundered Grail</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2001-09-10</publish_date>
<description>The two daughters of Maeve, half-sisters,
battle one another for control of England. Sequel to
Oberon's Legacy.</description>
</book>
<book id="bk106">
<author>Randall, Cynthia</author>
<title>Lover Birds</title>
<genre>Romance</genre>
<price>4.95</price>
<publish_date>2000-09-02</publish_date>
<description>When Carla meets Paul at an ornithology
conference, tempers fly as feathers get ruffled.</description>
</book>
<book id="bk107">
<author>Thurman, Paula</author>
<title>Splish Splash</title>
<genre>Romance</genre>
<price>4.95</price>
<publish_date>2000-11-02</publish_date>
<description>A deep sea diver finds true love twenty
thousand leagues beneath the sea.</description>
</book>
<book id="bk108">
<author>Knorr, Stefan</author>
<title>Creepy Crawlies</title>
<genre>Horror</genre>
<price>4.95</price>
<publish_date>2000-12-06</publish_date>
<description>An anthology of horror stories about roaches,
centipedes, scorpions and other insects.</description>
</book>
<book id="bk109">
<author>Kress, Peter</author>
<title>Paradox Lost</title>
<genre>Science Fiction</genre>
<price>6.95</price>
<publish_date>2000-11-02</publish_date>
<description>After an inadvertant trip through a Heisenberg
Uncertainty Device, James Salway discovers the problems
of being quantum.</description>
</book>
</catalog>
根据指定元素查找所有子元素,可以使用root.iter()方法。该方法将返回根元素下所有匹配指定元素的子元素。让我们来看下面的代码:
示例
for book in root.iter('book'):
print(book.attrib)
输出:
{'id': 'bk101'}
{'id': 'bk102'}
{'id': 'bk103'}
{'id': 'bk104'}
{'id': 'bk105'}
{'id': 'bk106'}
{'id': 'bk107'}
{'id': 'bk108'}
{'id': 'bk109'}
XPath表达式
有时候,元素没有属性,只有文本内容。我们可以使用 .text 属性来打印文本内容。让我们理解以下示例。
示例
import xml.etree.ElementTree as ET
tree = ET.parse('book.xml')
root = tree.getroot()
print("Desctiption Values:")
for description in root.iter('description'):
print(description.text)
输出:
An in-depth look at creating applications
with XML.
A former architect battles corporate zombies,
an evil sorceress, and her own childhood to become queen
of the world.
After the collapse of a nanotechnology
society in England, the young survivors lay the
foundation for a new society.
In post-apocalypse England, the mysterious
agent known only as Oberon helps to create a new life
for the inhabitants of London. Sequel to Maeve
Ascendant.
The two daughters of Maeve, half-sisters,
Oberon's Legacy.
When Carla meets Paul at an ornithology
conference, tempers fly as feathers get ruffled.
A deep sea diver finds true love twenty
thousand leagues beneath the sea.
An anthology of horror stories about roaches,
centipedes, scorpions and other insects.
After an inadvertant trip through a Heisenberg
Uncertainty Device, James Salway discovers the problems
of being quantum.
(Django) PS D:\Python Project> & "C:/Users/DEVANSH SHARMA/.virtualenvs/Django-ExvyqL3O/Scripts/python.exe" "d:/Python Project/sellshares.py"
Desctiption Values:
An in-depth look at creating applications
with XML.
A former architect battles corporate zombies,
an evil sorceress, and her own childhood to become queen
of the world.
After the collapse of a nanotechnology
society in England, the young survivors lay the
foundation for a new society.
In post-apocalypse England, the mysterious
agent known only as Oberon helps to create a new life
for the inhabitants of London. Sequel to Maeve
Ascendant.
The two daughters of Maeve, half-sisters,
battle one another for control of England. Sequel to
Oberon's Legacy.
When Carla meets Paul at an ornithology
conference, tempers fly as feathers get ruffled.
A deep sea diver finds true love twenty
thousand leagues beneath the sea.
An anthology of horror stories about roaches,
centipedes, scorpions and other insects.
After an inadvertant trip through a Heisenberg
Uncertainty Device, James Salway discovers the problems
of being quantum.
使用 .text 属性,我们可以获取任何属性的内容。
示例2
print("Title Values:")
for title in root.iter('title'):
print(title.text)
输出:
Title Values:
XML Developer's Guide
Midnight Rain
Maeve Ascendant
Oberon's Legacy
The Sundered Grail
Lover Birds
Splish Splash
Creepy Crawlies
Paradox Lost
这种打印XML文件的方法不被推荐。然而,XPath是最常用和推荐的方式。它代表XML路径语言,是一种用于快速和简便地搜索XML的查询语言。它具有类似路径的语法,用于在XML文档中标识和导航节点。
ElementTree提供了 findall() 方法,用于遍历所引用元素的直接子元素。
让我们理解以下示例。
示例
import xml.etree.ElementTree as ET
tree = ET.parse('book.xml')
root = tree.getroot()
for val in root.findall("./book/[price='5.95']"):
print(val.attrib)
输出:
{'id': 'bk102'}
{'id': 'bk103'}
{'id': 'bk104'}
{'id': 'bk105'}
有三本书的价格都是5.95元。这种方法在大型XML文件中查找特定结果非常有效快速。现在,我们找到了流派是浪漫的书籍。
示例2:
for val in root.findall("./book/[genre='Romance']"):
print(val.attrib)
输出:
{'id': 'bk106'}
{'id': 'bk107'}
修改XML
根据需求,我们可以修改XML文件。让我们看一下下面的示例。
示例再次打印出书籍的标题
for title in root.iter('title'):
print(title.text)
输出:
XML Developer's Guide
Midnight Rain
Maeve Ascendant
Oberon's Legacy
The Sundered Grail
Lover Birds
Splish Splash
Creepy Crawlies
Paradox Lost
现在我们将把 ‘Midnight Rain’ 的标题替换为 the Alchemist 。
mod_title = root.find("./book/[title='Midnight Rain']")
print(mod_title)
mod_title.attrib["title"] = "The Alchemist"
print(mod_title.attrib)
输出:
<Element 'book' at 0x0000024822762770>
{'id': 'bk102', 'title': 'The Alchemist'}
一旦我们修改了XML文件,我们将把这个更改写回XML。让我们理解下面的示例。
示例
tree.write("book.xml")
tree = ET.parse('book.xml')
root = tree.getroot()
for title in root.iter('title'):
print(title.attrib)
输出:
XML Developer's Guide
The Alchemist
Maeve Ascendant
Oberon's Legacy
The Sundered Grail
Lover Birds
Splish Splash
Creepy Crawlies
Paradox Lost
示例2:
for description in root.iter('description'):
new_desc = str(description.text)+'This is a author view'
description.text = str(new_desc)
description.set('updated', 'yes')
tree.write('book.xml')
输出:
<catalog>
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications
with XML.</description>
</book>
<book id="bk102">
<author>Ralls, Kim</author>
<title>Midnight Rain</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-12-16</publish_date>
<description>A former architect battles corporate zombies,
an evil sorceress, and her own childhood to become queen
of the world.</description>
</book>
上面的代码将把新的描述添加到book.xml文件中。我们只展示了两本书的输出,但它将反映在整个文件数据中。
结论
在本教程中,我们解释了一些重要的概念。XML文件遵循由标签构建的树结构,它们指定应该在那里定义哪些值。智能结构化帮助我们轻松地读取和写入XML。使用开放和关闭括号,标签表示父子关系。
属性进一步描述了如何验证标签或允许布尔标签。如教程中所讨论的, ElementTree 是一个强大的Python库,可以让我们解析和浏览XML文档。这个库将XML文档分解为树结构,提供了一种简单的方法来处理XML文档。现在我们可以在项目中使用这个库并解析文档。