Skip to content Skip to sidebar Skip to footer

Scrape Hidden Pages If Search Yields More Results Than Displayed

Some of the search queries entered under would yield more than 1'000 results (shown dynamically on the search page). The results however o

Solution 1:

It seems that your website loads data when the client is browsing. There are probably a number of ways to fix this. One option could be to utilize Scrapy Splash.

Assuming you use scrapy, you can do the following:

  1. Start a Splash server using docker - make a note of the
  2. In add SPLASH_URL = <splash-server-ip-address>
  3. In add to middlewares

this code:

    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
  1. Import from scrapy_splash import SplashRequest in your
  2. Set start_url in your to iterate over the pages

E.g. like this

base_url = ''start_urls = [
     base_url + str('?page=') + str(page) % page for page in range(0,100)      
  1. Redirect the url to the splash server by modifing def start_requests(self):

E.g. like this

    for url inself.start_urls:yield SplashRequest(url, self.parse,
            args={'wait': 0.5},
  1. Parse the response like you do now.

Let me know how that works out for you.

Post a Comment for "Scrape Hidden Pages If Search Yields More Results Than Displayed"