Skip to content Skip to sidebar Skip to footer

Using Python Regex To Extract Certain Urls From Text

So I have the HTML from an NPR page, and I want to use regex to extract just certain URLs for me (these call the URLs to specific stories nested within the page). The actual links

Solution 1:

Through a tool which is specially designed for parsing html and xml files [BeautifulSoup],

>>> from bs4 import BeautifulSoup
>>> s = """<a href="">
<a href="">
<a href="">
<a href="">
<a href="">""">>> soup = BeautifulSoup(s) # or pass the file directly into BS like >>> soup = BeautifulSoup(open('/Users/shannonmcgregor/Desktop/npr.txt'))>>> atag = soup.find_all('a')
>>> links = [i['href'] for i in atag]
>>> import re
>>> for i in links:
        if re.match(r'.*(parallels|thetwo-way|a-marines).*', i):

Solution 2:

You can use function to match the regex in the line and prints the line if it matches as

>>>file  = open('/Users/shannonmcgregor/Desktop/npr.txt', 'r')>>>for line in file:...if'<a href=[^>]*(parallels|thetwo-way|a-marines)', line):...print line

will give an output as


Solution 3:

You can do this by using a lookahead:

<a href="?\'?((?=[^"\'>]*(?:thetwo\-way|parallels|a\-marines))[^"\'>]+)

Regular expression visualization

Debuggex Demo

Post a Comment for "Using Python Regex To Extract Certain Urls From Text"