How To Accelerate Webscraping Using The Combination Of Request And Beautifulsoup In Python?
The objective is to scrape multiple pages using BeautifulSoup whose input is from the requests.get module. The steps are: First load up the html using requests page = requests.get(
Solution 1:
realpython.com has a nice article about speeding up python scripts up with concurrency.
https://realpython.com/python-concurrency/
Using their example for threading, you can set the number of workers to execute multiple threads which increase the number of requests you can make at once.
from bs4 import BeautifulSoup as Soup
import concurrent.futures
import requests
import threading
import time
defget_each_page(page_soup):
returndict(paper_author=page_soup.find(attrs={"itemprop": "name"}).text,
paper_title=page_soup.find(attrs={"itemprop": "name"}).text)
defget_session():
ifnothasattr(thread_local, "session"):
thread_local.session = requests.Session()
return thread_local.session
defdownload_site(url_to_pass):
session = get_session()
page = session.get('https://oatd.org/oatd/' + url_to_pass, timeout=10)
print(f"{page.status_code}: {page.reason}")
if page.status_code == 200:
all_website_scrape.append(get_each_page(Soup(page.text, 'html.parser')))
defdownload_all_sites(sites):
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
executor.map(download_site, sites)
if __name__ == "__main__":
list_of_url = ['record?record=handle\:11012\%2F16478&q=eeg'] * 100# In practice, there will be 100 diffrent unique sub-href. But for illustration purpose, we purposely duplicate the url
all_website_scrape = []
thread_local = threading.local()
start_time = time.time()
download_all_sites(list_of_url)
duration = time.time() - start_time
print(f"Downloaded {len(all_website_scrape)} in {duration} seconds")
Solution 2:
You maybe can use the threading module. You can make the script multi threaded and go much faster. https://docs.python.org/3/library/threading.html
But if you are willing to change your mind ill recommend scrapy
Post a Comment for "How To Accelerate Webscraping Using The Combination Of Request And Beautifulsoup In Python?"