Parsing A Non-escaped Apostrophe In A Single-quoted Attribute Value With Beautifulsoup

Question

From a webpage, I want to get all the links and title strings. I use BeautifulSoup 4 for scraping. The links on the webpage look like this:

Solution 1:

The problem is not BeautifulSoup but your HTML, which is invalid. According to the HTML specification, a single-quoted attribute value has the following syntax:

The attribute name, followed by zero or more space characters, followed by a single U+003D EQUALS SIGN character, followed by zero or more space characters, followed by a single U+0027 APOSTROPHE character ('), followed by the attribute value, which, in addition to the requirements given above for attribute values, must not contain any literal U+0027 APOSTROPHE characters ('), and finally followed by a second single U+0027 APOSTROPHE character (').

While all of the parsers supported by BeautifulSoup will try to parse the invalid HTML in your question, none of them will do what you want:

>>> BeautifulSoup(src, "html.parser")

<ahref="http://www.example1.com"title="A small secret for better estimates #4/16/2014 8:10:30 AM"> Example 1 </a><a #4=""2014=""4:36:07=""9=""am'=""href="http://www.example2.com"make=""me=""t=""think=""title="Don"> Example 2</a>

>>> BeautifulSoup(src, "lxml")

<html><body><ahref="http://www.example1.com"title="A small secret for better estimates #4/16/2014 8:10:30 AM"> Example 1 </a><aam=""href="http://www.example2.com"make=""me=""t=""think=""title="Don"> Example 2</a></body></html>

>>> BeautifulSoup(src, "html5lib")

<html><head></head><body><ahref="http://www.example1.com"title="A small secret for better estimates #4/16/2014 8:10:30 AM"> Example 1 </a><a #4=""2014=""4:36:07=""9=""am'=""href="http://www.example2.com"make=""me=""t=""think=""title="Don"> Example 2</a></body></html>

Neither will any modern browser:

Firefox Firefox developer tools

Chrome Chrome developer tools

IE 11 IE11 developer tools

If you want to represent an apostrophe inside a single-quoted attribute value, you need to use the ' character entity reference:

>>> BeautifulSoup("""
... <a href='http://www.example1.com' title='A small secret for better estimates #4/16/2014 8:10:30 AM'> Example 1 </a>
... <a href='http://www.example2.com' title='Don&apos;t make me think #4/9/2014 4:36:07 AM'> Example 2</a>
... """)

<html><body><ahref="http://www.example1.com"title="A small secret for better estimates #4/16/2014 8:10:30 AM"> Example 1 </a><ahref="http://www.example2.com"title="Don't make me think #4/9/2014 4:36:07 AM"> Example 2</a></body></html>

Alternatively, you can use a double-quoted attribute value:

>>> BeautifulSoup("""
... <a href='http://www.example1.com' title='A small secret for better estimates #4/16/2014 8:10:30 AM'> Example 1 </a>
... <a href='http://www.example2.com' title="Don't make me think #4/9/2014 4:36:07 AM"> Example 2</a>
... """)

<html><body><ahref="http://www.example1.com"title="A small secret for better estimates #4/16/2014 8:10:30 AM"> Example 1 </a><ahref="http://www.example2.com"title="Don't make me think #4/9/2014 4:36:07 AM"> Example 2</a></body></html>

Html5 Code Review

Parsing A Non-escaped Apostrophe In A Single-quoted Attribute Value With Beautifulsoup

Solution 1:

Post a Comment for "Parsing A Non-escaped Apostrophe In A Single-quoted Attribute Value With Beautifulsoup"