Parsing Web Page's Search Results With Python
Solution 1:
When I wrote parsers I've had problems with bs, in some cases, it didn't find that found lxml and vice versa, because of broken html. Try to use lxml.html.
Solution 2:
Your problem may be with encoding. I think that bs4
works with utf-8
and you have a different encoding set on your machine as default(an encoding that contains spanish letters). So urllib requests the page in your default encoding,thats okay so data is there in the source, it even prints out okay, but when you pass it to utf-8
based bs4
that characters are lost. Try looking for setting a different encoding in bs4
and if possible set it to your default. This is just a guess though, take it easy.
I recommend using regular expressions
. I have used them for all my web crawlers. If this is usable for you depends on the dynamicity of the website. But that problem is there even when you use bs4
. You just write all your re
manually and let it do the magic. You would have to work with the bs4
similiar way when looking foor information you want.
Post a Comment for "Parsing Web Page's Search Results With Python"