添加URL
分享
By using our site, you acknowledge that you have read and understand our Cookie Policy , Privacy Policy , and our Terms of Service .

I need to sanitize HTML submitted by the user by closing any open tags with correct nesting order. I have been looking for an algorithm or Python code to do this but haven't found anything except some half-baked implementations in PHP, etc.

For example, something like

<li>Foo This is of course - generally - a "bad idea"(TM). Fixing the tags for the user may and may not yield what he intended. I'd rather validate the input, reject the update and tell the user as much as I could about what I think is wrong (suggesting fixes, but not doing them automatically). BESIDES! Your example shows my point: <ul> is NOT ALLOWED inside <p> , so your "fix" actually repairs nothing. Tomasz Gandor Aug 2 '13 at 8:27

Unfortunately, I know of no way to keep the <p> tag in the example. Tidy interprets it as an empty paragraph rather than an unclosed one, so doing

print tidy.parseString(html, show_body_only=True, drop_empty_paras=False)

Ultimately, of course, the <p> tag in your example is redundant, so you might be fine with losing it.

Finally, Tidy can also do indenting:

print tidy.parseString(html, show_body_only=True, indent=True)
                The reason tidy sees it as an empty element is because p-elements are not allowed to contain ul-elements.
                Nov 16 '08 at 9:46
                P-elements can only contain inline elements like a, abbr, acronym, b, bdo, big, br, button, cite, code, del, dfn, em, i, img, input, ins, kbd, label, map, object, q, samp, script select, small, span, strong, sub, sup, textarea, tt and var.
                Nov 16 '08 at 9:47
                I would recommend having lxml installed when using BeautifulSoup as this appears to greatly help with markup repair (pip install lxml). BeautifulSoup will automatically choose lxml first for html parsing if available
                    – warren
                Apr 5 '15 at 1:23
                You're the man. I've been trying to figure out why I couldn't parser a particular website. Adding 'html5lib' to soup(page_html, 'html.parser'') did wonders. :)
                    – James Dean
                Sep 29 '17 at 2:45

Just now, I got a html which lxml and pyquery didn't work good on , seems there are some errors in the html. Since Tidy is not easy to install in windows, I choose BeautifulSoup. But I found that:

Which real solve my problem is soup = BeautifulSoup(page, 'html5lib').
You should install html5lib first, then can use it as a parser in BeautifulSoup. html5lib parser seems work much better than others.

Hope this can help someone.

soup = bs4.BeautifulSoup(html, 'html5lib')
f_html = soup.prettify()
print(f'Formatted html::: {f_html}')
        

Thanks for contributing an answer to Stack Overflow!

  • Please be sure to answer the question. Provide details and share your research!

But avoid

  • Asking for help, clarification, or responding to other answers.
  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

  • Please be sure to answer the question. Provide details and share your research!

But avoid

  • Asking for help, clarification, or responding to other answers.
  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies. Not the answer you're looking for? Browse other questions tagged or ask your own question. pandas dataframe rendered as html shows up weirdly (with literal tags) when included in jekyll site Why don't the Democrats make a deal to give Trump his border wall in exchange for campaign finance reform? How many stages are there in Samadhi? Does Patanjali yoga sutras mention different stages of Samadhi? Do all songs have to be in a major or minor scale? Can a song have random notes that don't belong to any major or minor scale? site design / logo © 2019 Stack Exchange Inc; user contributions licensed under cc by-sa 3.0 with attribution required. rev 2019.1.4.32537