Web sites are rarely maintained by the same person over a long period of time. Much more often, web pages pass from developer to developer. And since the skill (and style) of developers can differ greatly, a site’s HTML can grow to be inconsistent, even ugly, sloppy, and non-compliant with standards.
If you inherit such dubious HTML, you may long for a tool to help clean up and validate the code. Luckily, such a tool exists: it’s called HTML Tidy
and it’s available from http://tidy.sourceforge.net/
. HTML Tidy can automatically fix a wide range of coding errors and can also tidy up sloppy editing into nice-looking markup (often called pretty printing
). HTML Tidy can even make the (often extremely ugly) output from specialized HTML editors such as Frontpage
readable by a human.
HTML Tidy was originally written by Dave Raggett and is licensed under the W3C license.
After downloading and unpacking the source tarball, run…
$ /bin/sh build/gnuauto/setup.sh
… from the top source directory. You can now use the normal ./configure&&make&&make install process to complete the installation, leaving you with the utility tidy.
In addition to fixing a wide range of coding problems, tidy can also highlight things that you need to work on manually. tidy lists each item with the line number and column, so you can easily see where the problem lies in your markup. To be safe, tidy won’t generate a cleaned up version when there are problems that it’s unsure how to handle. Instead, it logs these as errors rather than warnings.
Some things that tidy can automatically fix include missing or mismatched end tags, end tags in the wrong order, and missing slashes (/) in anchor tags. tidy understands and can help correct both HTML and XHTML. It has limited support for XML and doesn’t recognize CDATA sections or DTD subsets, among other features. It’s aware of and can cope with ASP, JSTE, and PHP, but does not understand the scripting languages themselves. Because of this it can get easily confused and may report missing attributes when they appear within such code. Nested quotation marks on single lines within a scripting language can also throw off tidy.
Now that you understand a little about what tidy does, let’s start to use the program. stdin is the default input and stdout is the default output. By default, errors are sent to stderr, but you can use the –f filename.out option to redirect the errors to a file.
The two main modes you can run tidy in are –m, which modifies the original input files, and –o file, which writes output to file file. So, the command…
$ tidy –o linuxmag.out –f linuxmag.err linuxmag.html
… takes the file linuxmag.html as input and writes the tidied output to linuxmag.out and errors to linuxmag.err.
If you don’t use the –f option to redirect errors to a file, you’ll be unable to pipe the output of tidy to a utility like less to get pagination. You can avoid this problem by redirecting stderr to stdout with:
$ tidy –o linuxmag.out linuxmag.html 2>&1 | less
To make the usage of multiple advanced options easier, tidy has a –config option to specify a configuration file. Refer to the “Quick Reference” on the HTML Tidy web site for a full list of options. In addition, tidy supports many other command line options such as –clean, which replaces FONT, NOBR, and CENTER tags with CSS, and –indent, which indents element content. For a full list of command line options type tidy –?.
The tidy source also comes with TidyLib, which as the name suggests is a thread safe library version of tidy. In addition to TidyLib, bindings for C++,Perl, COM,.NET, and PHP are also available. As of PHP 5, tidy is a standard extension that can be enabled with the ––with-tidy compile option.
Using tidy allows you to quickly and easily cleanup HTML code and validate it to the latest standards. While it can’t fix everything, it will alert you to what it cannot fix, so you can handle those cases by hand. The rest is automatic. It can now even handle the often unruly Word 2000 HTML output and even allows you to correct and pretty print on the fly in the latest version of PHP.
Jeremy Garcia is the founder and administrator of LinuxQuestions.org, a free, friendly, and active Linux community. Please send questions and feedback to