BeautifulSoup – 安装|极客笔记

BeautifulSoup 教程, 由于BeautifulSoup不是一个标准的python库，我们需要先安装它。我们将安装BeautifulSoup 4库（也被称为BS4），这是最新的一个库。

为了隔离我们的工作环境，以免干扰现有的设置，让我们首先创建一个虚拟环境。

创建一个Python虚拟环境（可选）

虚拟环境允许我们为一个特定的项目创建一个孤立的python工作副本，而不影响外部设置。

安装任何python软件包的最好方法是使用pip，然而，如果pip还没有安装（你可以在你的命令或shell提示符中使用 "pip -version "来检查它），你可以通过以下命令来安装 −

Linux 环境

$sudo apt-get install python-pip

Windows 环境

要在windows中安装pip，请执行以下操作 −

从https://bootstrap.pypa.io/get-pip.py 或从github下载get-pip.py到你的电脑。
打开命令提示符，导航到包含get-pip.py文件的文件夹。
执行下面的命令 −

>python get-pip.py

就这样，pip现在已经安装在你的windows机器上了。

你可以通过运行下面的命令来验证你的pip安装情况 −

>pip --version
pip 19.2.3 from c:\users\yadur\appdata\local\programs\python\python37\lib\site-packages\pip (python 3.7)

安装Python虚拟环境

在你的命令提示符中运行以下命令 −

>pip install virtualenv

运行后，你会看到下面的屏幕截图 −

创建一个Python虚拟环境

下面的命令将在你的当前目录下创建一个虚拟环境（"myEnv"）。 −

>virtualenv myEnv

截图

创建一个Python虚拟环境

要激活你的虚拟环境，运行以下命令 −

>myEnv\Scripts\activate

创建一个Python虚拟环境

在上面的截图中，你可以看到我们有 "myEnv "作为前缀，这告诉我们，我们在虚拟环境 "myEnv "下。

要从虚拟环境中出来，运行deactivate。

(myEnv) C:\Users\yadur>deactivate
C:\Users\yadur>

由于我们的虚拟环境已经准备好了，现在让我们安装 beautifulsoup。

安装 BeautifulSoup

由于BeautifulSoup不是一个标准库，我们需要安装它。我们将使用BeautifulSoup 4软件包（被称为bs4）。

Linux 环境

要使用系统软件包管理器在Debian或Ubuntu linux上安装bs4，请运行以下命令 −

$sudo apt-get install python-bs4 (for python 2.x)
$sudo apt-get install python3-bs4 (for python 3.x)

你可以使用easy_install或pip来安装bs4（如果你发现使用系统打包器安装有问题的话）。

$easy_install beautifulsoup4
$pip install beautifulsoup4

(如果你使用python3，你可能需要分别使用easy_install3或pip3)

Windows 环境

在windows下安装beautifulsoup4是非常简单的，特别是如果你已经安装了pip。

>pip install beautifulsoup4

安装 BeautifulSoup

所以现在beautifulsoup4已经安装在我们的机器上了。让我们来谈谈安装后遇到的一些问题。

安装后遇到的问题

在windows机器上，你可能会遇到，主要通过以下方式安装错误的版本 −

错误: ImportError “No module named HTMLParser”, 那么你必须在Python 3下运行Python 2版本的代码。
错误: ImportError “No module named html.parser” , 那么你必须在Python 2下运行Python 3版本的代码。

摆脱上述两种情况的最好方法是重新安装BeautifulSoup，完全删除现有的安装。

如果你在ROOT_TAG_NAME = u'[document]’这一行得到SyntaxError "Invalid syntax "，那么你需要将python 2的代码转换为python 3，只需安装包 −

$ python3 setup.py install

或者通过在bs4目录下手动运行python的2到3转换脚本 −

$ 2to3-3.2 -w bs4

安装解析器

默认情况下，BeautifulSoup支持Python标准库中包含的HTML解析器，然而它也支持许多外部的第三方Python解析器，如lxml解析器或html5lib解析器。

要安装 lxml 或 html5lib 解析器，请使用以下命令 −

Linux 环境

$apt-get install python-lxml
$apt-get insall python-html5lib

Windows 环境

$pip install lxml
$pip install html5lib

安装解析器

一般来说，用户使用lxml是为了追求速度，如果你使用老版本的python 2（2.7.3版本之前）或python 3（3.2.2之前），建议使用lxml或html5lib解析器，因为python的内置HTML解析器在处理老版本方面不是很好。

运行 BeautifulSoup

现在是时候在一个html网页中测试我们的Beautiful Soup包了（取网页–https://www.tutorialspoint.com/index.htm，你可以选择你想要的任何其他网页），并从中提取一些信息。

在下面的代码中，我们正试图从网页中提取标题 −

from bs4 import BeautifulSoup
import requests
url = "https://www.tutorialspoint.com/index.htm"
req = requests.get(url)
soup = BeautifulSoup(req.text, "html.parser")
print(soup.title)

输出

<title>H2O, Colab, Theano, Flutter, KNime, Mean.js, Weka, Solidity, Org.Json, AWS QuickSight, JSON.Simple, Jackson Annotations, Passay, Boon, MuleSoft, Nagios, Matplotlib, Java NIO, PyTorch, SLF4J, Parallax Scrolling, Java Cryptography</title>

一个常见的任务是提取一个网页中的所有URLs。为此，我们只需要添加下面这行代码 −

for link in soup.find_all('a'):
    print(link.get('href'))

输出

https://www.tutorialspoint.com/index.htm
https://www.tutorialspoint.com/about/about_careers.htm
https://www.tutorialspoint.com/questions/index.php
https://www.tutorialspoint.com/online_dev_tools.htm
https://www.tutorialspoint.com/codingground.htm
https://www.tutorialspoint.com/current_affairs.htm
https://www.tutorialspoint.com/upsc_ias_exams.htm
https://www.tutorialspoint.com/tutor_connect/index.php
https://www.tutorialspoint.com/whiteboard.htm
https://www.tutorialspoint.com/netmeeting.php
https://www.tutorialspoint.com/index.htm
https://www.tutorialspoint.com/tutorialslibrary.htm
https://www.tutorialspoint.com/videotutorials/index.php
https://store.tutorialspoint.com
https://www.tutorialspoint.com/gate_exams_tutorials.htm
https://www.tutorialspoint.com/html_online_training/index.asp
https://www.tutorialspoint.com/css_online_training/index.asp
https://www.tutorialspoint.com/3d_animation_online_training/index.asp
https://www.tutorialspoint.com/swift_4_online_training/index.asp
https://www.tutorialspoint.com/blockchain_online_training/index.asp
https://www.tutorialspoint.com/reactjs_online_training/index.asp
https://www.tutorix.com
https://www.tutorialspoint.com/videotutorials/top-courses.php
https://www.tutorialspoint.com/the_full_stack_web_development/index.asp
….
….
https://www.tutorialspoint.com/online_dev_tools.htm
https://www.tutorialspoint.com/free_web_graphics.htm
https://www.tutorialspoint.com/online_file_conversion.htm
https://www.tutorialspoint.com/netmeeting.php
https://www.tutorialspoint.com/free_online_whiteboard.htm
http://www.tutorialspoint.com
https://www.facebook.com/tutorialspointindia
https://plus.google.com/u/0/+tutorialspoint
http://www.twitter.com/tutorialspoint
http://www.linkedin.com/company/tutorialspoint
https://www.youtube.com/channel/UCVLbzhxVTiTLiVKeGV7WEBg
https://www.tutorialspoint.com/index.htm
/about/about_privacy.htm#cookies
/about/faq.htm
/about/about_helping.htm
/about/contact_us.htm

同样，我们可以使用beautifulsoup4提取有用的信息。

现在让我们进一步了解上述例子中的 "汤"。