Skip to content Skip to sidebar Skip to footer

How To Accelerate Webscraping Using The Combination Of Request And Beautifulsoup In Python?

The objective is to scrape multiple pages using BeautifulSoup whose input is from the requests.get module. The steps are: First load up the html using requests page = requests.get(

Solution 1:

realpython.com has a nice article about speeding up python scripts up with concurrency.

https://realpython.com/python-concurrency/

Using their example for threading, you can set the number of workers to execute multiple threads which increase the number of requests you can make at once.

from bs4 import BeautifulSoup as Soup
    import concurrent.futures
    import requests
    import threading
    import time
    
    defget_each_page(page_soup):
        returndict(paper_author=page_soup.find(attrs={"itemprop": "name"}).text,
                    paper_title=page_soup.find(attrs={"itemprop": "name"}).text)
    
    defget_session():
        ifnothasattr(thread_local, "session"):
            thread_local.session = requests.Session()
        return thread_local.session
    
    defdownload_site(url_to_pass):
        session = get_session()
        page = session.get('https://oatd.org/oatd/' + url_to_pass, timeout=10)
        print(f"{page.status_code}: {page.reason}")
        if page.status_code == 200:
            all_website_scrape.append(get_each_page(Soup(page.text, 'html.parser')))
    
    defdownload_all_sites(sites):
        with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
            executor.map(download_site, sites)
    
    if __name__ == "__main__":
        list_of_url = ['record?record=handle\:11012\%2F16478&q=eeg'] * 100# In practice, there will be 100 diffrent unique sub-href. But for illustration purpose, we purposely duplicate the url
        all_website_scrape = []
        thread_local = threading.local()
        start_time = time.time()
        download_all_sites(list_of_url)
        duration = time.time() - start_time
        print(f"Downloaded {len(all_website_scrape)} in {duration} seconds")

Solution 2:

You maybe can use the threading module. You can make the script multi threaded and go much faster. https://docs.python.org/3/library/threading.html

But if you are willing to change your mind ill recommend scrapy

Post a Comment for "How To Accelerate Webscraping Using The Combination Of Request And Beautifulsoup In Python?"