Skip to content Skip to sidebar Skip to footer

How To Extract Url From Html Anchor Element Using Python3?

I want to extract URL from web page HTML source. Example: xyz.com source code: Download XYZ I want to e

Solution 1:

You can use built-in xml.etree.ElementTree instead:

>>>import xml.etree.ElementTree as ET>>>url = '<a  href="/example/hello/get/9f676bac2bb3.zip">XYZ</a>'>>>ET.fromstring(url).attrib.get('href')
'/example/hello/get/9f676bac2bb3.zip'

This works on this particular example, but xml.etree.ElementTree is not an HTML parser. Consider using BeautifulSoup:

>>>from bs4 import BeautifulSoup>>>BeautifulSoup(url).a.get('href')
'/example/hello/get/9f676bac2bb3.zip'

Or, lxml.html:

>>>import lxml.html>>>lxml.html.fromstring(url).attrib.get('href')
'/example/hello/get/9f676bac2bb3.zip'

Personally, I prefer BeautifulSoup - it makes html-parsing easy, transparent and fun.


To follow the link and download the file, you need to make a full url including the schema and domain (urljoin() would help) and then use urlretrieve(). Example:

>>>BASE_URL = 'http://example.com'>>>from urllib.parse import urljoin>>>from urllib.request import urlretrieve>>>href = BeautifulSoup(url).a.get('href')>>>urlretrieve(urljoin(BASE_URL, href))

UPD (for the different html posted in comments):

>>> from bs4 import BeautifulSoup
>>> data = '<html><head><body><example><example2><ahref="/example/hello/get/9f676bac2bb3.zip">XYZ</a></example2></example></body></head></html>'
>>> href = BeautifulSoup(data).find('a', text='XYZ').get('href')
'/example/hello/get/9f676bac2bb3.zip'

Post a Comment for "How To Extract Url From Html Anchor Element Using Python3?"