Scrape Hidden Pages If Search Yields More Results Than Displayed
Some of the search queries entered under https://www.comparis.ch/carfinder/default would yield more than 1'000 results (shown dynamically on the search page). The results however o
Solution 1:
It seems that your website loads data when the client is browsing. There are probably a number of ways to fix this. One option could be to utilize Scrapy Splash.
Assuming you use scrapy, you can do the following:
- Start a Splash server using docker - make a note of the
- In
settings.py
addSPLASH_URL = <splash-server-ip-address>
- In
settings.py
add to middlewares
this code:
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
- Import
from scrapy_splash import SplashRequest
in your spider.py - Set
start_url
in your spider.py to iterate over the pages
E.g. like this
base_url = 'https://www.comparis.ch/carfinder/marktplatz/occasion'
start_urls = [
base_url + str('?page=') + str(page) % page for page in range(0,100)
]
- Redirect the url to the splash server by modifing
def start_requests(self):
E.g. like this
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse,
endpoint='render.html',
args={'wait': 0.5},
)
- Parse the response like you do now.
Let me know how that works out for you.
Post a Comment for "Scrape Hidden Pages If Search Yields More Results Than Displayed"