← Previous
Next →

Strip tags with html5lib

John Tantalo
14 Dec 2009

There are a couple posts out there that discuss stripping tags with html5lib, but they seem intent on preserving the "acceptable elements" such as <span> and <code>.

This is fine unless you really want to friggin' strip out the tags, like I needed for Emend. The following is my solution.

Source code for stripping tags with html5lib and unit test.

For example,

>>> from strip_tags import strip_tags
>>> strip_tags('<p>foo</p> <script>bar</script>')
u'foo bar' 

Thanks go to Edward O’Connor for pointing me towards html5lib in the first place. It's a huge improvement over HTMLParser.