我正在使用相当简单的网络抓取工具抓取23770个网页scrapy。我对scrapy甚至python还是很陌生,但是设法写了一个spider来完成这项工作。但是,它确实很慢(爬网23770页大约需要28个小时)。
scrapy
我查看了scrapy网页和邮件列表以及stackoverflow,但似乎找不到针对初学者可以理解的编写快速爬网程序的通用建议。也许我的问题不是spider本身,而是我的运行方式。欢迎所有建议!
stackoverflow
如果需要,我在下面列出了我的代码。
from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from scrapy.item import Item, Field import re class Sale(Item): Adresse = Field() Pris = Field() Salgsdato = Field() SalgsType = Field() KvmPris = Field() Rum = Field() Postnummer = Field() Boligtype = Field() Kvm = Field() Bygget = Field() class HouseSpider(BaseSpider): name = 'House' allowed_domains = ["http://boliga.dk/"] start_urls = ['http://www.boliga.dk/salg/resultater?so=1&type=Villa&type=Ejerlejlighed&type=R%%C3%%A6kkehus&kom=&amt=&fraPostnr=&tilPostnr=&iPostnr=&gade=&min=&max=&byggetMin=&byggetMax=&minRooms=&maxRooms=&minSize=&maxSize=&minsaledate=1992&maxsaledate=today&kode=&p=%d' %n for n in xrange(1, 23770, 1)] def parse(self, response): hxs = HtmlXPathSelector(response) sites = hxs.select("id('searchresult')/tr") items = [] for site in sites: item = Sale() item['Adresse'] = site.select("td[1]/a[1]/text()").extract() item['Pris'] = site.select("td[2]/text()").extract() item['Salgsdato'] = site.select("td[3]/text()").extract() Temp = site.select("td[4]/text()").extract() Temp = Temp[0] m = re.search('\r\n\t\t\t\t\t(.+?)\r\n\t\t\t\t', Temp) if m: found = m.group(1) item['SalgsType'] = found else: item['SalgsType'] = Temp item['KvmPris'] = site.select("td[5]/text()").extract() item['Rum'] = site.select("td[6]/text()").extract() item['Postnummer'] = site.select("td[7]/text()").extract() item['Boligtype'] = site.select("td[8]/text()").extract() item['Kvm'] = site.select("td[9]/text()").extract() item['Bygget'] = site.select("td[10]/text()").extract() items.append(item) return items
谢谢!
这里是一些可以尝试的事情: