I need to sanitize HTML submitted by the user by closing any open tags with correct nesting order. I have been looking for an algorithm or Python code to do this but haven't found anything except some half-baked implementations in PHP, etc.
For example, something like
This is of course - generally - a "bad idea"(TM). Fixing the tags for the user may and may not yield what he intended. I'd rather validate the input, reject the update and tell the user as much as I could about what I think is wrong (suggesting fixes, but not doing them automatically). BESIDES! Your example shows my point:
is NOT ALLOWED inside
, so your "fix" actually repairs nothing.
Aug 2 '13 at 8:27
Unfortunately, I know of no way to keep the <p> tag in the example. Tidy interprets it as an empty paragraph rather than an unclosed one, so doing
Ultimately, of course, the <p> tag in your example is redundant, so you might be fine with losing it.
Finally, Tidy can also do indenting:
print tidy.parseString(html, show_body_only=True, indent=True)
The reason tidy sees it as an empty element is because p-elements are not allowed to contain ul-elements.Nov 16 '08 at 9:46P-elements can only contain inline elements like a, abbr, acronym, b, bdo, big, br, button, cite, code, del, dfn, em, i, img, input, ins, kbd, label, map, object, q, samp, script select, small, span, strong, sub, sup, textarea, tt and var.Nov 16 '08 at 9:47I would recommend having lxml installed when using BeautifulSoup as this appears to greatly help with markup repair (pip install lxml). BeautifulSoup will automatically choose lxml first for html parsing if available
– warrenApr 5 '15 at 1:23You're the man. I've been trying to figure out why I couldn't parser a particular website. Adding 'html5lib' to soup(page_html, 'html.parser'') did wonders. :)
– James DeanSep 29 '17 at 2:45
Just now, I got a html which lxml and pyquery didn't work good on , seems there are some errors in the html.
Since Tidy is not easy to install in windows, I choose BeautifulSoup.
But I found that:
Which real solve my problem is soup = BeautifulSoup(page, 'html5lib').
You should install html5lib first, then can use it as a parser in BeautifulSoup.
html5lib parser seems work much better than others.