Python 使用Beautiful Soup提取属性值

要使用Beautiful Soup来提取属性值，我们需要解析HTML文档，然后提取所需的属性值。Beautiful Soup是一个用于解析HTML和XML文档的Python库。BeautifulSoup提供了多种方法来搜索和导航解析树，从而轻松从文档中提取数据。在本文中，我们将使用Beautiful Soup来提取属性值。

步骤

您可以按照下面给出的算法来使用Beautiful Soup在Python中提取属性值。

使用bs4库中的BeautifulSoup类解析HTML文档。
使用适当的Beautiful Soup方法（如find()或find_all()）找到包含要提取属性的HTML元素。
使用条件语句或has_attr()方法检查元素上是否存在属性。
如果属性存在，则使用方括号（[]）和属性名称作为键提取其值。
如果属性不存在，则适当处理错误。

安装Beautiful Soup

在使用Beautiful Soup库之前，您需要使用Python软件包管理器即pip命令进行安装。要安装Beautiful Soup，请在终端或命令提示符中输入以下命令。

pip install beautifulsoup4

提取属性值

要从HTML标签中提取属性值，我们首先需要使用BeautifulSoup解析HTML文档。然后使用Beautiful Soup的方法提取HTML文档中特定标签的属性值。

示例1：使用find()方法和方括号提取href属性

在下面的示例中，我们首先创建了一个HTML文档，并将其作为字符串传递给Beautiful Soup构造函数，指定解析器类型为html.parser。接下来，我们使用soup对象的find()方法找到’a’标签。这将返回HTML文档中第一个出现的’a’标签。最后，我们使用方括号来提取’a’标签中的href属性值。这将返回href属性的值作为字符串。

from bs4 import BeautifulSoup

# Parse the HTML document
html_doc = """
<html>
<body>
   <a href="https://www.google.com">Google</a>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

# Find the 'a' tag
a_tag = soup.find('a')

# Extract the value of the 'href' attribute
href_value = a_tag['href']

print(href_value)

输出

https://www.google.com

示例2：使用attrs查找具有特定属性的元素

在下面的示例中，我们使用find_all()方法来查找所有具有href属性的a标签。参数 attrs 用于指定我们要查找的属性。参数 {‘href’: True} 指定我们要查找具有任何值的href属性的元素。

from bs4 import BeautifulSoup

# Parse the HTML document
html_doc = """
<html>
<body>
   <a href="https://www.google.com">Google</a>
   <a href="https://www.python.org">Python</a>
   <a>No Href</a>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

# Find all 'a' tags with an 'href' attribute
a_tags_with_href = soup.find_all('a', attrs={'href': True})
for tag in a_tags_with_href:
   print(tag['href'])

输出

https://www.google.com
https://www.python.org

示例3：使用find_all()方法找到元素的所有出现

有时候，您可能想要在网页上找到所有HTML元素的出现。您可以使用find_all()方法来实现这一目的。在下面的示例中，我们使用find_all()方法来找到具有类容器的所有div标签。然后，我们循环遍历每个div标签，找到其中的h1和p标签。

from bs4 import BeautifulSoup

# Parse the HTML document
html_doc = """
<html>
<body>
   <div class="container">
      <h1>Heading 1</h1>
      <p>Paragraph 1</p>
   </div>
   <div class="container">
      <h1>Heading 2</h1>
      <p>Paragraph 2</p>
   </div>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

# Find all 'div' tags with class='container'
div_tags = soup.find_all('div', class_='container')
for div in div_tags:
   h1 = div.find('h1')
   p = div.find('p')
   print(h1.text, p.text)

输出

Heading 1 Paragraph 1
Heading 2 Paragraph 2

示例4：使用select()通过CSS选择器查找元素

在下面的示例中，我们使用 select() 方法来查找带有类名为container的div标签内的所有h1标签。CSS选择器 ‘div.container h1’ 用来实现这个功能。点号（.）用来表示类名，而空格用来表示后代选择器。

from bs4 import BeautifulSoup

# Parse the HTML document
html_doc = """
<html>
<body>
   <div class="container">
      <h1>Heading 1</h1>
      <p>Paragraph 1</p>
   </div>
   <div class="container">
      <h1>Heading 2</h1>
      <p>Paragraph 2</p>
   </div>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

# Find all 'h1' tags inside a 'div' tag with class='container'
h1_tags = soup.select('div.container h1')
for h1 in h1_tags:
   print(h1.text)