Download Pdfs With Python

November 29, 2022 Post a Comment

I am trying to download several PDFs which are located in different hyperlinks in a single URL. My approach was first to retrieve the the URLs with contained the 'fileEntryId' text

Solution 1:

Create a folder anywhere and put the script in that folder. When you run the script, you should get the downloaded pdf files within the folder. If for some reason the script doesn't work for you, make sure to check whether your bs4 version is up to date as I've used pseudo css selectors to target the required links.

import requests
from bs4 import BeautifulSoup

link = 'https://www.contraloria.gov.co/resultados/proceso-auditor/auditorias-liberadas/regalias/auditorias-regalias-liberadas-ano-2015'

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
    res = s.get(link)
    soup = BeautifulSoup(res.text,"lxml")
    for item in soup.select("table.table > tbody.table-data td.first > a[href*='fileEntryId']"):
        inner_link = item.get("href")
        resp = s.get(inner_link)
        soup = BeautifulSoup(resp.text,"lxml")
        pdf_link = soup.select_one("a.taglib-icon:contains('Descargar')").get("href")
        file_name = pdf_link.split("/")[-1].split("?")[0]
        with open(f"{file_name}.pdf","wb") as f:
            f.write(s.get(pdf_link).content)

Python Dummy

Download Pdfs With Python

Solution 1:

Post a Comment for "Download Pdfs With Python"