Scraping Badly Coded Html
I scraped a website that has hundreds of pages of badly organized HTML. I used BeautifulSoup to capture all the content of a div on each page. The excerpt of that list is: mylist =
Solution 1:
Usually,when you use select on a BeatifulSoup
object,you get a list of Tag
s.
And you can use select
/getText
on the Tag
s again.
For exsample:
SEP='(--*--SEP--*--)'
mylist=soup.select('div')
between_br=[[j for j in i.getText(SEP).split(SEP) ifnot j.isspace()] for i in mylist]
Solution 2:
from bs4 import BeautifulSoup
mylist = [['<div id="headingData">006951446<br/>Algonquin Gas Transmission, LLC<br/>Critical notice<br/>12/30/2019<br/>09:00:00 AM<br/>12/31/2019<br/>09:00:00 AM<br/>92112<br/>Initiate<br/>Capacity Constraint<br/>12/29/2019<br/>03:02:38 PM<br/> <br/><br/>No response required<br/> <br/> <br/>AGT Pipeline Conditions for 12/30/2019<br/></div>'],
['<div id="headingData">006951446<br/>Algonquin Gas Transmission, LLC<br/>Critical notice<br/>12/29/2019<br/>09:00:00 AM<br/>12/30/2019<br/>09:00:00 AM<br/>92086<br/>Initiate<br/>Capacity Constraint<br/>12/28/2019<br/>02:55:39 PM<br/> <br/><br/>No response required<br/> <br/> <br/>AGT Pipeline Conditions for 12/29/2019<br/></div>'],
['<div id="headingData">006951446<br/>Algonquin Gas Transmission, LLC<br/>Critical notice<br/>12/28/2019<br/>09:00:00 AM<br/>12/29/2019<br/>09:00:00 AM<br/>92074<br/>Initiate<br/>Capacity Constraint<br/>12/27/2019<br/>03:14:16 PM<br/> <br/><br/>No response required<br/> <br/> <br/>AGT Pipeline Conditions for 12/28/2019<br/></div>']]
for item in mylist:
soup = BeautifulSoup(*item, 'html.parser')
print(*[a.get_text(strip=True, separator="|").split("|") for a in soup])
Output:
['006951446', 'Algonquin Gas Transmission, LLC', 'Critical notice', '12/30/2019', '09:00:00 AM', '12/31/2019', '09:00:00 AM', '92112', 'Initiate', 'Capacity Constraint', '12/29/2019', '03:02:38 PM', 'No response required', 'AGT Pipeline Conditions for
12/30/2019']
['006951446', 'Algonquin Gas Transmission, LLC', 'Critical notice', '12/29/2019', '09:00:00 AM', '12/30/2019', '09:00:00 AM', '92086', 'Initiate', 'Capacity Constraint', '12/28/2019', '02:55:39 PM', 'No response required', 'AGT Pipeline Conditions for
12/29/2019']
['006951446', 'Algonquin Gas Transmission, LLC', 'Critical notice', '12/28/2019', '09:00:00 AM', '12/29/2019', '09:00:00 AM', '92074', 'Initiate', 'Capacity Constraint', '12/27/2019', '03:14:16 PM', 'No response required', 'AGT Pipeline Conditions for
12/28/2019']
Solution 3:
Without seeing the rest of your code, it might be hard to give an exact answer but beautifulsoup is a great package for this. You should be able to continue using the bs4
package to comb through the HTML using a combination of BeautifulSoup
methods.(e.g. find
/find_all
/select
etc.)
Solution 4:
Solutions using library SimplifiedDoc.
from simplified_scrapy import SimplifiedDoc,req,utils
mylist = [['<divid="headingData">006951446<br/>Algonquin Gas Transmission, LLC<br/>Critical notice<br/>12/30/2019<br/>09:00:00 AM<br/>12/31/2019<br/>09:00:00 AM<br/>92112<br/>Initiate<br/>Capacity Constraint<br/>12/29/2019<br/>03:02:38 PM<br/><br/><br/>No response required<br/><br/><br/>AGT Pipeline Conditions for 12/30/2019<br/></div>'],
['<divid="headingData">006951446<br/>Algonquin Gas Transmission, LLC<br/>Critical notice<br/>12/29/2019<br/>09:00:00 AM<br/>12/30/2019<br/>09:00:00 AM<br/>92086<br/>Initiate<br/>Capacity Constraint<br/>12/28/2019<br/>02:55:39 PM<br/><br/><br/>No response required<br/><br/><br/>AGT Pipeline Conditions for 12/29/2019<br/></div>'],
['<divid="headingData">006951446<br/>Algonquin Gas Transmission, LLC<br/>Critical notice<br/>12/28/2019<br/>09:00:00 AM<br/>12/29/2019<br/>09:00:00 AM<br/>92074<br/>Initiate<br/>Capacity Constraint<br/>12/27/2019<br/>03:14:16 PM<br/><br/><br/>No response required<br/><br/><br/>AGT Pipeline Conditions for 12/28/2019<br/></div>']]
values = []
# First way
for item in mylist:
doc = SimplifiedDoc(item[0])
tmp = doc.selects('br').nextText()
tmp.insert(0,doc.div.firstText())
values.append(tmp)
values = []
# Second way
for item in mylist:
doc = SimplifiedDoc(item[0])
brs = doc.selects('br')
tmp = [br.previousText() for br in brs]
values.append(tmp)
print(values)
Result:
[['006951446', 'Algonquin Gas Transmission, LLC', 'Critical notice', '12/30/2019', '09:00:00 AM', '12/31/2019', '09:00:00 AM', '92112', 'Initiate', 'Capacity Constraint', '12/29/2019', '03:02:38 PM', '', '', 'No response required', '', '', 'AGT Pipeline Conditions for 12/30/2019'], ['006951446', 'Algonquin Gas Transmission, LLC', 'Critical notice', '12/29/2019', '09:00:00 AM', '12/30/2019', '09:00:00 AM', '92086', 'Initiate', 'Capacity Constraint', '12/28/2019', '02:55:39 PM', '', '', 'No response required', '', '', 'AGT Pipeline Conditions for 12/29/2019'], ['006951446', 'Algonquin Gas Transmission, LLC', 'Critical notice', '12/28/2019', '09:00:00 AM', '12/29/2019', '09:00:00 AM', '92074', 'Initiate', 'Capacity Constraint', '12/27/2019', '03:14:16 PM', '', '', 'No response required', '', '', 'AGT Pipeline Conditions for 12/28/2019']]
Post a Comment for "Scraping Badly Coded Html"