At OERPUB we are starting to transform any HTML and especially HTML5 content to a structured HTML5 for further processing into various output formats (html5, epub, pdf, xml).
The problem with transforming HTML5 to a more structured HTML5 is the parsing. The way to make this transformation is currently XSLT and we need XML compatible (X)HTML5 for this.
In the past we used HTML tidy which handles all HTML soup and transforms it to valid XHTML. The only problem is that it is dated, the last version is from 2008 and it does not support HTML5 (and we also want MathML support).
After some searching we’ve found http://w3c.github.com/tidy-html5%C2%A0which is a fork of tidy with support of HTML5. After my quick tests it seems compatible with the old tidy and (more important for me) also compatible to pytidylib so that I can still use my old python code but with new HTML5 tidy options. :)
Here are instructions on how to replace the old HTML tidy which is included in Ubuntu (tested on 10.04 & 12.04) with the new HTML5 tidy:
Remove all old tidy implementations
Get git, libtool and automake if you do not have them already
Clone tidy-html5 repository in a directory of your choice
Building the libtidy shared library and install libtidy and tidy program (copied from here)
So tidy and libtidy are now installed but Ubuntu will not find
libtidy by default because libtidy installed to the folder
/usr/local/lib which is normally not searched for
runtime libraries. So we have to edit ldconfig’s search folders.
Open (with root/sudo rights) the file “/etc/ld.so.conf”. Example:
and add this line to the file
Finally restart ldconfig and you are set!
Just a side note:
To remove HTML5 tidy and install old tidy again just go to its cloned directory and type