如何使用BeautifulSoup包来解析Python中网页的数据？

在现代数据科学中，数据获取是非常重要的部分。而对于大多数网站，数据都是以 HTML 形式发送的。要能够使用数据，我们需要对此进行解析，这些解析通常需要花费大量的时间和精力。
然而，Python 中的 BeautifulSoup 包让这个过程变得容易得多。无论您是想要抓取一组简单的数据还是多个站点的数据，该库都是标准 Python 网络爬虫工具。

更多Python教程，请阅读：Python 教程

BeautifulSoup 的概述

BeautifulSoup 是一款 Python 库，用于解析 HTML 和 XML 文档。该工具可将 HTML 文档解析为树形格式，使其易于导航和查找内容。因此，它专为 web 爬虫而设计，因为它能够快速浏览文档结构和元素定位，轻松解析 html 和 xml 等文档。

下面是一个简单的例子，我们可以从 URL 上使用 BeautifulSoup 来获取 HTML 内容：

from bs4 import BeautifulSoup
import requests

# URL获取要解析的HTML内容
url = 'https://en.wikipedia.org/wiki/Web_scraping'
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')

print(soup.prettify())

在上面的段代码中，导入了 requests 和 BeautifulSoup 包，然后我们将要解析的 URL 给定为’https://en.wikipedia.org/wiki/Web_scraping’。带有 requests.get（）的语句首先获取 URL，并存储在reqs变量中。然后，我们创建一个对象soup，它使用从reqs.text中获取的HTML内容。

Beautiful Soup 的用途

该库有四种对象：

Beautiful Soup
Tag
NavigableString
Comment

BeautifulSoup 对象：

这是入口对象，整个文档被加入到此对象中，并创建该对象时，要用文档类型、解析器类型等来进行初始化：

soup = BeautifulSoup ( markup, 'html.parser')

在指定解析程序后，它将处理遇到的所有标记。

Tag 对象：

HTML 文件中的每个元素都是标记，可以使用这个对象来解析标记。我们可以使用tag对象来访问标记的名称、属性和其他特性。

例如：下面是一个 HTML 片段:

<html>
   <head>
      <title>Page Title</title>
   </head>
   <body>
      <h2>A Header</h2>
      <p class="sample">A paragraph with <a href="http://example.com">a link</a>.</p>
      <p class="sample">Another paragraph.</p>
   </body>
</html>

现在，我们使用 BeautifulSoup 解析这个 HTML ，这里是我们获取上述 HTML 标记的代码：

from bs4 import BeautifulSoup

html_doc = """
<html>
   <head>
      <title>Page Title</title>
   </head>
   <body>
      <h2>A Header</h2>
      <p class="sample">A paragraph with <a href="http://example.com">a link</a>.</p>
      <p class="sample">Another paragraph.</p>
   </body>
</html>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())

结果如下：

<html>
 <head>
  <title>
   Page Title
  </title>
 </head>
 <body>
  <h2>
   A Header
  </h2>
  <p class="sample">
   A paragraph with
   <a href="http://example.com">
    a link
   </a>
   .
  </p>
  <p class="sample">
   Another paragraph.
  </p>
 </body>
</html>

可以看到，使用该代码段将 HTML 文档解析为具有标记的树，并用于提取文本确实非常简单。我们可以看到，每个标记标识为一个 Tag 对象，我们可以使用这些对象来访问标记的名称、属性和其他特性。例如，我们可以使用以下代码来访问 HTML 中所有段落标记中的文本内容：

from bs4 import BeautifulSoup

html_doc = """
<html>
   <head>
      <title>Page Title</title>
   </head>
   <body>
      <h2>A Header</h2>
      <p class="sample">A paragraph with <a href="http://example.com">a link</a>.</p>
      <p class="sample">Another paragraph.</p>
   </body>
</html>
"""
soup = BeautifulSoup(html_doc, 'html.parser')

for p in soup.find_all('p'):
    print(p.text)

该代码将输出以下内容：

A paragraph with a link.
Another paragraph.

NavigableString 对象：

这是文本内容的对象，通过使用 BeautifulSoup 对象的 find / find_all 函数和其他函数，我们可以轻松访问这些对象。

例如，如果我们要从以下 HTML 中输出“一些文本”，代码应如下所示：

<p><b>一些文本</b></p>

现在，如果我们要提取

标签之间的文本内容并输出它，我们可以使用以下代码：

from bs4 import BeautifulSoup

html_doc = """
<html>
   <head>
      <title>Page Title</title>
   </head>
   <body>
      <h2>A Header</h2>
      <p class="sample"><b>A paragraph with</b> <a href="http://example.com">a link</a>.</p>
      <p class="sample">Another paragraph.</p>
   </body>
</html>
"""
soup = BeautifulSoup(html_doc, 'html.parser')

p_tag = soup.find('p')

if p_tag:
    text_content = p_tag.text
    print(text_content)
else:
    print("No 'p' tags found.")

该代码将输出以下内容：

A paragraph with a link.

Comment 对象：

这是标记注释的对象。如果原始 HTML 文件中有注释内容，则使用该对象来解析注释。

例如，HTML 文件中的以下部分是一个注释：

<!-- This is a comment. -->

如果我们想要访问该注释中的文本内容，可以使用以下代码：

from bs4 import BeautifulSoup

html_doc = """
<html>
   <head>
      <title>Page Title</title>
   </head>
   <body>
      <h2>A Header</h2>
      <p class="sample"><b>A paragraph with</b> <a href="http://example.com">a link</a>.</p>
      <p class="sample">Another paragraph.</p>
      <!-- This is a comment. -->
   </body>
</html>
"""
soup = BeautifulSoup(html_doc, 'html.parser')

comment = soup.find(text=lambda text: isinstance(text, Comment))

if comment:
    print(comment)
else:
    print("No comments found.")

该代码将输出以下内容：

This is a comment.

使用 Beautiful Soup 解析数据

现在我们已经了解了 Beautiful Soup 这个库，接下来我们将进一步介绍如何使用 BeautifulSoup 解析数据。以爬取 https://www.pythonscraping.com/pages/page3.html 网站为例：

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://www.pythonscraping.com/pages/page3.html")
soup = BeautifulSoup(html, 'html.parser')

for child in soup.find("table", {"id":"giftList"}).children:
    print(child)

该代码将输出以下内容：

<tr id="gift1" class="gift">
<td class="thumb"><img src="../img/gifts/img1.jpg"/></td>
<td class="descrip">
<h2>A Concordance of One's Life</h2>
<span class="rating">Three Stars</span>
<span class="category">Books</span>
<span class="price"> $29.99</span> <p> A look at the fictional diaries of a woman living in a small town in England in the late 1800s. </p> <div> <a href="http://www.pythonscraping.com" class="btn btn-primary" role="button"> Buy Now! </a> </div> </tr> <tr id="gift2" class="gift"> <td class="thumb"><img src="../img/gifts/img2.jpg"/></td> <td class="descrip"> <h2>The Prynce in the Kitchen</h2> <span class="rating">Four Stars</span> <span class="category">Kitchen</span> <span class="price">$ 19.99</span>
<p>
Everyone's favorite royal now has his own coloring book! This fun book will provide 
hours of entertainment and stress relief for princesses and princes everywhere.
</p>
<div>
    <a href="http://www.pythonscraping.com" class="btn btn-primary" role="button">
        Buy Now!
    </a>
</div>
</td>
</tr>
<tr id="gift3" class="gift">
<td class="thumb"><img src="../img/gifts/img3.jpg"/></td>
<td class="descrip">
<h2>How to Sharpen Pencils</h2>
<span class="rating">Three Stars</span>
<span class="category">Books</span>
<span class="price"> $6.99</span> <p> Learn how to sharpen pencils with this exciting new book! From the same author who brought you <em>How to Microwave Water for Tea</em>, this exciting new book is sure to change your life in ways you never thought possible. </p> <div> <a href="http://www.pythonscraping.com" class="btn btn-primary" role="button"> Buy Now! </a> </div> </td> </tr> <tr id="gift4" class="gift"> <td class="thumb"><img src="../img/gifts/img4.jpg"/></td> <td class="descrip"> <h2>Knitting for Gold</h2> <span class="rating">Four Stars</span> <span class="category">Crafts</span> <span class="price">$ 15.50</span>
<p>
Want to knit your own gold medal? This is the book for you! An exhaustive guide to 
helping anyone understand the Olympic Games' most obscure sport.
</p>
<div>
    <a href="http://www.pythonscraping.com" class="btn btn-primary" role="button">
        Buy Now!
    </a>
</div>
</td>
</tr>
<tr id="gift5" class="gift">
<td class="thumb"><img src="../img/gifts/img3.jpg"/></td>
<td class="descrip">
<h2>How to Sharpen Pencils</h2>
<span class="rating">Three Stars</span>
<span class="category">Books</span>
<span class="price">$6.99</span>
<p>
Learn how to sharpen pencils with this exciting new book! From the same author 
who brought you <em>How to Microwave Water for Tea</em>, this exciting new book
is sure to change your life in ways you never thought possible.
</p>
<div>
    <a href="http://www.pythonscraping.com" class="btn btn-primary" role="button">
        Buy Now!
    </a>
</div>
</td>
</tr>

上述代码的核心是我们使用 BeautifulSoup 对象的 find() 方法来查找 HTML 中的 table 标记，并传递一个 id 属性值。在这个例子中，我们的表格有一个唯一的 id 属性值“giftList”。我们还可以使用类名称或其他属性来查找标记。

for child in soup.find("table", {"id":"giftList"}).children:
    print(child)

在上面的代码中，我们遍历表格中的每个行（ tr ），并使用标记的 children 属性来访问每个行中的每个标记（ td 或 th ）。这样我们可以访问每个标记的文本内容。

此外，我们还可以使用其他方法来查找标记，例如 find_all() 方法。下面是使用该方法来查找所有价格标记的示例代码：

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://www.pythonscraping.com/pages/page3.html")
soup = BeautifulSoup(html, 'html.parser')

price_list = soup.find_all("span", {"class":"price"})
for price in price_list:
    print(price.text)

该代码将输出以下内容：