therealmarv

I'm building things.

Installing Tidy-html5 on Ubuntu, First Step to Get Valid & Structured HTML5

| Comments

At OERPUB we are starting to transform any HTML and especially HTML5 content to a structured HTML5 for further processing into various output formats (html5, epub, pdf, xml).

The problem with transforming HTML5 to a more structured HTML5 is the parsing. The way to make this transformation is currently XSLT and we need XML compatible (X)HTML5 for this.

In the past we used HTML tidy which handles all HTML soup and transforms it to valid XHTML. The only problem is that it is dated, the last version is from 2008 and it does not support HTML5 (and we also want MathML support).

After some searching we’ve found http://w3c.github.com/tidy-html5%C2%A0which is a fork of tidy with support of HTML5. After my quick tests it seems compatible with the old tidy and (more important for me) also compatible to pytidylib so that I can still use my old python code but with new HTML5 tidy options. :)

Here are instructions on how to replace the old HTML tidy which is included in Ubuntu (tested on 10.04 & 12.04) with the new HTML5 tidy:

Remove all old tidy implementations

1
sudo apt-get remove libtidy-0.99-0 tidy

Get git, libtool and automake if you do not have them already

1
sudo apt-get install git-core automake libtool

Clone tidy-html5 repository in a directory of your choice

1
2
git clone https://github.com/w3c/tidy-html5
cd tidy-html5

Building the libtidy shared library and install libtidy and tidy program (copied from here)

1
2
sh build/gnuauto/setup.sh && ./configure && make
sudo make install

So tidy and libtidy are now installed but Ubuntu will not find libtidy by default because libtidy installed to the folder /usr/local/lib which is normally not searched for runtime libraries. So we have to edit ldconfig’s search folders. Open (with root/sudo rights) the file “/etc/ld.so.conf”. Example:

1
gksu gedit /etc/ld.so.conf

and add this line to the file /etc/ld.so.conf

1
/usr/local/lib

Finally restart ldconfig and you are set!

1
sudo ldconfig

Just a side note:
To remove HTML5 tidy and install old tidy again just go to its cloned directory and type

1
2
sudo make uninstall
sudo apt-get install tidy libtidy-0.99-0

Comments