这是我脚本的一部分:
crawler = Crawler(Settings(settings)) crawler.configure() spider = crawler.spiders.create(spider_name) crawler.crawl(spider) crawler.start() log.start() reactor.run() print "It can't be printed out!"
它的工作原理应该是:访问页面,抓取所需信息,并将输出json存储在我告诉它的位置(通过FEED_URI)。但是当spider完成他的工作(我可以在输出json中通过数字看到它)时,我的脚本将无法恢复执行。也许这不是一个棘手的问题。答案应该在扭曲反应堆的某个地方。如何释放线程执行?
spider完成后,你将需要停止反应器。你可以通过侦听spider_closed信号来完成此操作:
spider_closed
from twisted.internet import reactor from scrapy import log, signals from scrapy.crawler import Crawler from scrapy.settings import Settings from scrapy.xlib.pydispatch import dispatcher from testspiders.spiders.followall import FollowAllSpider def stop_reactor(): reactor.stop() dispatcher.connect(stop_reactor, signal=signals.spider_closed) spider = FollowAllSpider(domain='scrapinghub.com') crawler = Crawler(Settings()) crawler.configure() crawler.crawl(spider) crawler.start() log.start() log.msg('Running reactor...') reactor.run() # the script will block here until the spider is closed log.msg('Reactor stopped.')
命令行日志输出可能类似于:
stav@maia:/srv/scrapy/testspiders$ ./api 2013-02-10 14:49:38-0600 [scrapy] INFO: Running reactor... 2013-02-10 14:49:47-0600 [followall] INFO: Closing spider (finished) 2013-02-10 14:49:47-0600 [followall] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 23934,...} 2013-02-10 14:49:47-0600 [followall] INFO: Spider closed (finished) 2013-02-10 14:49:47-0600 [scrapy] INFO: Reactor stopped. stav@maia:/srv/scrapy/testspiders$