我的任务是使用Scrapy从网站提取pdf文件。我对Python并不陌生,但是Scrapy对我来说是一个新手。我一直在试验控制台和一些基本的spider。我找到并修改了以下代码:
import urlparse import scrapy from scrapy.http import Request class pwc_tax(scrapy.Spider): name = "pwc_tax" allowed_domains = ["www.pwc.com"] start_urls = ["http://www.pwc.com/us/en/tax-services/publications/research-and-insights.html"] def parse(self, response): base_url = "http://www.pwc.com/us/en/tax-services/publications/research-and-insights.html" for a in response.xpath('//a[@href]/@href'): link = a.extract() if link.endswith('.pdf'): link = urlparse.urljoin(base_url, link) yield Request(link, callback=self.save_pdf) def save_pdf(self, response): path = response.url.split('/')[-1] with open(path, 'wb') as f: f.write(response.body)
我在命令行运行以下代码
scrapy crawl mySpider
我什么也没回来 我没有创建可抓取的项目,因为我想抓取并下载文件,没有元数据。我将不胜感激。
我已经更新了你的代码,这实际上是可行的:
import urlparse import scrapy from scrapy.http import Request class pwc_tax(scrapy.Spider): name = "pwc_tax" allowed_domains = ["www.pwc.com"] start_urls = ["http://www.pwc.com/us/en/tax-services/publications/research-and-insights.html"] def parse(self, response): for href in response.css('div#all_results h3 a::attr(href)').extract(): yield Request( url=response.urljoin(href), callback=self.parse_article ) def parse_article(self, response): for href in response.css('div.download_wrapper a[href$=".pdf"]::attr(href)').extract(): yield Request( url=response.urljoin(href), callback=self.save_pdf ) def save_pdf(self, response): path = response.url.split('/')[-1] self.logger.info('Saving PDF %s', path) with open(path, 'wb') as f: f.write(response.body)