Python 如何将Scrapy项目转换为JSON格式

网页抓取是从网站上提取数据的一种过程。它涉及解析HTML或XML代码并从中提取相关信息。Scrapy是一款流行的基于Python的网页抓取框架，它允许您轻松构建网页抓取器来从网站上提取结构化数据。Scrapy提供了一个强大高效的框架，用于构建可以从网站上提取数据并以多种格式存储的网络爬虫。

Scrapy的一个关键功能是能够使用自定义的Item类来解析和存储数据。这些Item类定义了将从网站上提取的数据结构。每个item类包含一组字段，这些字段对应将被提取的数据。一旦数据被提取出来，它就会被填充到Item类的实例中。

当您提取出数据并填充到Item实例之后，您可能需要将数据导出为各种格式以进行进一步的分析或存储。JSON是一种流行的数据格式，既易读又易于以编程方式处理。它是一种轻量级和基于文本的格式，广泛用于Web上的数据交换。JSON受到大多数编程语言的支持，并且在Web应用和API中被广泛使用。

当构建网页抓取器时，将Scrapy Item实例转换为JSON格式是一个常见的需求。Scrapy提供了内置的方法来将Item实例转换为JSON格式，但也有提供了额外功能来处理Python中JSON数据的外部库。在本文中，我们将探讨如何使用Scrapy的内置方法和外部库将Scrapy Item实例转换为JSON格式。我们还将讨论在Python中处理JSON数据时要避免的一些最佳实践和常见问题。

我们可以使用不同的方法来将scrapy项目转换为JSON。

方法一：使用Scrapy的内置JSON导出器

Scrapy提供了一个内置的JSON导出器，可用于将Scrapy Item实例转换为JSON格式。您可以使用scrapy.exporters.JsonItemExporter类将项目导出为JSON文件。

考虑下面的代码示例。

示例

import scrapy
from scrapy.exporters import JsonItemExporter

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['http://www.example.com']

    def parse(self, response):
        item = {
            'title': response.css('title::text').get(),
            'description': response.css('meta[name="description"]::attr(content)').get()
        }
        yield item

    def closed(self, reason):
        items = list(self.crawler.stats.get_value('item_scraped_count').values())[0]
        filename = 'data.json'
        with open(filename, 'wb') as file:
            exporter = JsonItemExporter(file)
            exporter.start_exporting()
            for item in self.crawler.stats.get_value('items'):
                exporter.export_item(item)
            exporter.finish_exporting()
        self.log(f'Saved file {filename}, containing {items} items')

解释

导入所需的模块：scrapy用于构建爬虫和JsonItemExporter用于将项目导出为JSON。
定义一个名为MySpider的新爬虫，使用CSS选择器从网站提取标题和描述，并将它们存储在一个名为item的字典中。
将item字典yield给Scrapy，Scrapy将自动将其填充到scrapy.Item类的实例中。
当爬虫完成网站爬取时，将调用closed方法。在该方法中，我们检索爬取的项目并使用JsonItemExporter将它们保存到JSON文件中。
运行爬虫时，它将从网站提取标题和描述，并将结果保存到名为data.json的JSON文件中。

输出

[{  "title": "Example Domain",   "description": "Example Domain. This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission."}]

方法二：使用Python内置的JSON

示例

import scrapy
import json

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['http://www.example.com']

    def parse(self, response):
        item = {
            'title': response.css('title::text').get(),
            'description': response.css('meta[name="description"]::attr(content)').get()
        }
        yield item

    def closed(self, reason):
        items = list(self.crawler.stats.get_value('item_scraped_count').values())[0]
        filename = 'data.json'
        with open(filename, 'w') as file:
            json.dump(self.crawler.stats.get_value('items'), file, indent=4)
        self.log(f'Saved file {filename}, containing {items} items')

输出

[{  "title": "Example Domain",   "description": "Example Domain. This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission."}]