Python语言技术文档

微信小程序技术文档

php语言技术文档

jsp语言技术文档

asp语言技术文档

C#/.NET语言技术文档

html5/css技术文档

javascript

点击排行

您现在的位置:首页 > 技术文档 > Python网络爬虫

python实例教程rabbitmq实例教程网络爬虫示例

来源:中文源码网    浏览:283 次    日期:2024-04-25 06:45:24
【下载文档:  python实例教程rabbitmq实例教程网络爬虫示例.txt 】


python使用rabbitmq实现网络爬虫示例
编写tasks.py复制代码 代码如下:from celery import Celeryfrom tornado.httpclient import HTTPClientapp = Celery('tasks')app.config_from_object('celeryconfig')@app.taskdef get_html(url): http_client = HTTPClient() try: response = http_client.fetch(url,follow_redirects=True) return response.body except httpclient.HTTPError as e: return None http_client.close()
编写celeryconfig.py复制代码 代码如下:CELERY_IMPORTS = ('tasks',)BROKER_URL = 'amqp://guest@localhost:5672//'CELERY_RESULT_BACKEND = 'amqp://'
编写spider.py复制代码 代码如下:from tasks import get_htmlfrom queue import Queuefrom bs4 import BeautifulSoupfrom urllib.parse import urlparse,urljoinimport threadingclass spider(object): def __init__(self): self.visited={} self.queue=Queue() def process_html(self, html): pass #print(html) def _add_links_to_queue(self,url_base,html): soup = BeautifulSoup(html) links=soup.find_all('a') for link in links: try: url=link['href'] except: pass else: url_com=urlparse(url) if not url_com.netloc: self.queue.put(urljoin(url_base,url)) else: self.queue.put(url_com.geturl()) def start(self,url): self.queue.put(url) for i in range(20): t = threading.Thread(target=self._worker) t.daemon = True t.start() self.queue.join() def _worker(self): while 1: url=self.queue.get() if url in self.visited: continue else: result=get_html.delay(url) try: html=result.get(timeout=5) except Exception as e: print(url) print(e) self.process_html(html) self._add_links_to_queue(url,html)
self.visited[url]=True self.queue.task_done()s=spider()s.start("//www.zwyuanma.com/")
由于html中某些特殊情况的存在,程序还有待完善。

相关内容