Scrapy-Redis结合POST请求获取数据的方法示例

您现在的位置：首页 > 技术文档 > Python科学计算

来源：中文源码网浏览：352 次日期：2024-04-27 06:08:45

【下载文档: Scrapy-Redis结合POST请求获取数据的方法示例.txt 】

Scrapy-Redis结合POST请求获取数据的方法示例
前言
通常我们在一个站站点进行采集的时候，如果是小站的话我们使用scrapy本身就可以满足。
但是如果在面对一些比较大型的站点的时候，单个scrapy就显得力不从心了。
要是我们能够多个Scrapy一起采集该多好啊人多力量大。
很遗憾Scrapy官方并不支持多个同时采集一个站点，虽然官方给出一个方法：
**将一个站点的分割成几部分交给不同的scrapy去采集**
似乎是个解决办法，但是很麻烦诶！毕竟分割很麻烦的哇
下面就改轮到我们的额主角Scrapy-Redis登场了！
能看到这篇文章的小伙伴肯定已经知道什么是Scrapy以及Scrapy-Redis了，基础概念这里就不再介绍。默认情况下Scrapy-Redis是发送GET请求获取数据的，对于某些使用POST请求的情况需要重写make_request_from_data函数即可，但奇怪的是居然没在网上搜到简洁明了的答案，或许是太简单了？。
这里我以httpbin.org这个网站为例，首先在settings.py中添加所需配置，这里需要根据实际情况进行修改：
SCHEDULER = "scrapy_redis.scheduler.Scheduler" #启用Redis调度存储请求队列
SCHEDULER_PERSIST = True #不清除Redis队列、这样可以暂停/恢复爬取
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" #确保所有的爬虫通过Redis去重
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.SpiderPriorityQueue'
REDIS_URL = "redis://127.0.0.1:6379"
爬虫代码如下：
# -*- coding: utf-8 -*-
import scrapy
from scrapy_redis.spiders import RedisSpider
class HpbSpider(RedisSpider):
name = 'hpb'
redis_key = 'test_post_data'
def make_request_from_data(self, data):
"""Returns a Request instance from data coming from Redis.
By default, ``data`` is an encoded URL. You can override this method to
provide your own message decoding.
Parameters
----------
data : bytes
Message from redis.
"""
return scrapy.FormRequest("http://www.httpbin.org/post",
formdata={"data":data},callback=self.parse)
def parse(self, response):
print(response.body)
这里为了简单直接进行输出，真实使用时可以结合pipeline写数据库等。
然后启动爬虫程序scrapy crawl hpb，由于我们还没向test_post_data中写数据，所以启动后程序进入等待状态。然后模拟向队列写数据：
import redis
rd = redis.Redis('127.0.0.1',port=6379,db=0)
for _ in range(1000):
rd.lpush('test_post_data',_)
此时可以看到爬虫已经开始获取程序了：
2019-05-06 16:30:21 [hpb] DEBUG: Read 8 requests from 'test_post_data'
2019-05-06 16:30:21 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None)
2019-05-06 16:30:21 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None)
2019-05-06 16:30:21 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None)
2019-05-06 16:30:21 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None)
2019-05-06 16:30:21 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None)
2019-05-06 16:30:21 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None)
2019-05-06 16:30:21 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None)
2019-05-06 16:30:21 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None)
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "data": "0"\n }, \n "headers": {\n "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n "Accept-Encoding": "gzip,deflate", \n "Accept-Language": "en", \n "Content-Length": "6", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "www.httpbin.org", \n "User-Agent": "Scrapy/1.5.1 (+http://scrapy.org)"\n }, \n "json": null, \n "origin": "1.2.3.48, 1.2.3.48", \n "url": "http://www.httpbin.org/post"\n}\n'
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "data": "1"\n }, \n "headers": {\n "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n "Accept-Encoding": "gzip,deflate", \n "Accept-Language": "en", \n "Content-Length": "6", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "www.httpbin.org", \n "User-Agent": "Scrapy/1.5.1 (+http://scrapy.org)"\n }, \n "json": null, \n "origin": "1.2.3.48, 1.2.3.48", \n "url": "http://www.httpbin.org/post"\n}\n'
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "data": "3"\n }, \n "headers": {\n "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n "Accept-Encoding": "gzip,deflate", \n "Accept-Language": "en", \n "Content-Length": "6", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "www.httpbin.org", \n "User-Agent": "Scrapy/1.5.1 (+http://scrapy.org)"\n }, \n "json": null, \n "origin": "1.2.3.48, 1.2.3.48", \n "url": "http://www.httpbin.org/post"\n}\n'
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "data": "2"\n }, \n "headers": {\n "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n "Accept-Encoding": "gzip,deflate", \n "Accept-Language": "en", \n "Content-Length": "6", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "www.httpbin.org", \n "User-Agent": "Scrapy/1.5.1 (+http://scrapy.org)"\n }, \n "json": null, \n "origin": "1.2.3.48, 1.2.3.48", \n "url": "http://www.httpbin.org/post"\n}\n'
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "data": "4"\n }, \n "headers": {\n "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n "Accept-Encoding": "gzip,deflate", \n "Accept-Language": "en", \n "Content-Length": "6", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "www.httpbin.org", \n "User-Agent": "Scrapy/1.5.1 (+http://scrapy.org)"\n }, \n "json": null, \n "origin": "1.2.3.48, 1.2.3.48", \n "url": "http://www.httpbin.org/post"\n}\n'
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "data": "5"\n }, \n "headers": {\n "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n "Accept-Encoding": "gzip,deflate", \n "Accept-Language": "en", \n "Content-Length": "6", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "www.httpbin.org", \n "User-Agent": "Scrapy/1.5.1 (+http://scrapy.org)"\n }, \n "json": null, \n "origin": "1.2.3.48, 1.2.3.48", \n "url": "http://www.httpbin.org/post"\n}\n'
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "data": "6"\n }, \n "headers": {\n "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n "Accept-Encoding": "gzip,deflate", \n "Accept-Language": "en", \n "Content-Length": "6", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "www.httpbin.org", \n "User-Agent": "Scrapy/1.5.1 (+http://scrapy.org)"\n }, \n "json": null, \n "origin": "1.2.3.48, 1.2.3.48", \n "url": "http://www.httpbin.org/post"\n}\n'
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "data": "7"\n }, \n "headers": {\n "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n "Accept-Encoding": "gzip,deflate", \n "Accept-Language": "en", \n "Content-Length": "6", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "www.httpbin.org", \n "User-Agent": "Scrapy/1.5.1 (+http://scrapy.org)"\n }, \n "json": null, \n "origin": "1.2.3.48, 1.2.3.48", \n "url": "http://www.httpbin.org/post"\n}\n'
2019-05-06 16:31:09 [scrapy.extensions.logstats] INFO: Crawled 1001 pages (at 280 pages/min), scraped 0 items (at 0 items/min)
2019-05-06 16:32:09 [scrapy.extensions.logstats] INFO: Crawled 1001 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-05-06 16:33:09 [scrapy.extensions.logstats] INFO: Crawled 1001 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
至于数据重复的问题，如果POST的数据重复，这个请求就不会发送出去。如果有特殊情况POST发送同样的数据回得到不同返回值，添加dont_filter=True是没用的，在RFPDupeFilter类中并没考虑这个参数，需要重写。
总结
以上就是这篇文章的全部内容了，希望本文的内容对大家的学习或者工作具有一定的参考学习价值，谢谢大家对中文源码网的支持。

上一篇：Tensorflow分类器项目自定义数据读入的实现

下一篇：python斐波那契数列的计算方法

点击排行

您现在的位置：首页 > 技术文档 > Python科学计算

Scrapy-Redis结合POST请求获取数据的方法示例

相关内容