Skip to content Skip to sidebar Skip to footer

Scrape Hidden Pages If Search Yields More Results Than Displayed

Some of the search queries entered under https://www.comparis.ch/carfinder/default would yield more than 1'000 results (shown dynamically on the search page). The results however o

Solution 1:

It seems that your website loads data when the client is browsing. There are probably a number of ways to fix this. One option could be to utilize Scrapy Splash.

Assuming you use scrapy, you can do the following:

  1. Start a Splash server using docker - make a note of the
  2. In settings.py add SPLASH_URL = <splash-server-ip-address>
  3. In settings.py add to middlewares

this code:

DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
  1. Import from scrapy_splash import SplashRequest in your spider.py
  2. Set start_url in your spider.py to iterate over the pages

E.g. like this

base_url = 'https://www.comparis.ch/carfinder/marktplatz/occasion'
start_urls = [
     base_url + str('?page=') + str(page) % page for page in range(0,100)      
    ]
  1. Redirect the url to the splash server by modifing def start_requests(self):

E.g. like this

def start_requests(self):
    for url in self.start_urls:
        yield SplashRequest(url, self.parse,
            endpoint='render.html',
            args={'wait': 0.5},
        )
  1. Parse the response like you do now.

Let me know how that works out for you.


Post a Comment for "Scrape Hidden Pages If Search Yields More Results Than Displayed"