Python 使用Requests和BeautifulSoup下载PDF

Request和BeautifulSoup是可以下载任何在线文件或PDF的Python库。请求库用于发送HTTP请求和接收响应。BeautifulSoup库用于解析响应中收到的HTML，以获取可下载的pdf链接。在本文中，我们将了解如何在Python中使用Request和Beautiful Soup来下载PDF。

安装依赖库

在Python中使用BeautifulSoup和Request库之前，我们需要使用pip命令在系统中安装这些库。要安装request和BeautifulSoup和Request库，请在终端中运行以下命令。

pip install requests
pip install beautifulsoup4

使用Request和Beautiful Soup下载PDF文件

要从互联网上下载PDF文件，首先需要使用request库找到pdf文件的URL。然后我们可以使用Beautiful Soup来解析HTML响应并提取PDF文件的链接。解析后得到的基础URL和PDF链接将被组合以获得PDF文件的URL。现在，我们可以使用request方法发送Get请求来下载文件。

示例

在下面的代码中，将包含PDF文件URL的页面的有效URL放在”https://example.com/document.pdf”的位置。

import requests
from bs4 import BeautifulSoup

# Step 1: Fetch the PDF URL
url = 'https://example.com/document.pdf'
response = requests.get(url)

if response.status_code == 200:
   # Step 2: Parse the HTML to get the PDF link
   soup = BeautifulSoup(response.text, 'html.parser')
   link = soup.find('a')['href']

   # Step 3: Download the PDF
   pdf_url = url + link
   pdf_response = requests.get(pdf_url)

   if pdf_response.status_code == 200:
      with open('document.pdf', 'wb') as f:
         f.write(pdf_response.content)
      print('PDF downloaded successfully.')
   else:
      print('Error:', pdf_response.status_code)
else:
   print('Error:', response.status_code)