我正在处理Scrapy,Privoxy和Tor。我已经全部安装并正常工作。但是Tor每次都使用相同的IP连接,因此我很容易被禁止。是否可以告诉Tor每X秒或连接重新连接一次?
谢谢!
编辑有关配置:对于用户代理程序池,我这样做:http : //tangww.com/2013/06/UsingRandomAgent/ (我不得不放入_ init _.py文件,如注释中所述),并且对于Privoxy和Tor,我遵循http://www.andrewwatters.com/privoxy/(我必须使用终端手动创建私有用户和私有组)。有效
_ init _.py
我的spider是这样的:
from scrapy.contrib.spiders import CrawlSpider from scrapy.selector import Selector from scrapy.http import Request class YourCrawler(CrawlSpider): name = "spider_name" start_urls = [ 'https://example.com/listviews/titles.php', ] allowed_domains = ["example.com"] def parse(self, response): # go to the urls in the list s = Selector(response) page_list_urls = s.xpath('///*[@id="tab7"]/article/header/h2/a/@href').extract() for url in page_list_urls: yield Request(response.urljoin(url), callback=self.parse_following_urls, dont_filter=True) # Return back and go to bext page in div#paginat ul li.next a::attr(href) and begin again next_page = response.css('ul.pagin li.presente ~ li a::attr(href)').extract_first() if next_page is not None: next_page = response.urljoin(next_page) yield Request(next_page, callback=self.parse) # For the urls in the list, go inside, and in div#main, take the div.ficha > div.caracteristicas > ul > li def parse_following_urls(self, response): #Parsing rules go here for each_book in response.css('main#main'): yield { 'editor': each_book.css('header.datos1 > ul > li > h5 > a::text').extract(), }
在settings.py中,我有一个用户代理轮换和privoxy:
DOWNLOADER_MIDDLEWARES = { #user agent 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None, 'spider_name.comm.rotate_useragent.RotateUserAgentMiddleware' :400, #privoxy 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110, 'spider_name.middlewares.ProxyMiddleware': 100 }
在middlewares.py中,我添加了:
class ProxyMiddleware(object): def process_request(self, request, spider): request.meta['proxy'] = 'http://127.0.0.1:8118' spider.log('Proxy : %s' % request.meta['proxy'])
我认为就这些…
编辑II-
好的,我更改了Middlewares.py文件,就像博客@TomášLinhart所说的那样:
至
from stem import Signal from stem.control import Controller class ProxyMiddleware(object): def process_request(self, request, spider): request.meta['proxy'] = 'http://127.0.0.1:8118' spider.log('Proxy : %s' % request.meta['proxy']) def set_new_ip(): with Controller.from_port(port=9051) as controller: controller.authenticate(password='tor_password') controller.signal(Signal.NEWNYM)
但是现在真的很慢,并且似乎没有更改ip…我做的还好还是有问题?
编辑:基于具体要求(对于每个请求或之后新的IP Ñ请求),把适当的呼叫到set_new_ip在process_request中间件的方法。但是请注意,对set_new_ip函数的调用不必始终确保新的IP(存在指向FAQ的链接以及相关说明)。
set_new_ip
process_request
EDIT2:具有ProxyMiddleware类的模块如下所示:
ProxyMiddleware
from stem import Signal from stem.control import Controller def _set_new_ip(): with Controller.from_port(port=9051) as controller: controller.authenticate(password='tor_password') controller.signal(Signal.NEWNYM) class ProxyMiddleware(object): def process_request(self, request, spider): _set_new_ip() request.meta['proxy'] = 'http://127.0.0.1:8118' spider.log('Proxy : %s' % request.meta['proxy'])