When it’s just too hard to automate text conversion, try adding some human judgement using editing macros.
Need to convert some text from one programming language (or markup language or format) to another? It can betempting (and fun) to write a script that converts the whole job “automagically,” or maybe you can find a converter somewhere. But if the original file was hand-coded or is so complex that parts of the conversion are devilishly hard to get right, automating the task can take more time to perfect than it saves you. And unless you have an automated validator to check that the conversion was done correctly, you may have to inspect the result line-by-line anyway.
So, if doing the job automatically makes your head hurt too much (or if you’d rather be playing a game of Freeciv, http://www.happypenguin.org/), consider automating just part of the job — or doing all of it semi-automatically.
This month, let’s look at one way to do a job like this: with a set of handwritten macros for a text editor. (A macro lets you type a key or two to perform one or more editing steps in a flash.) We’ll use vi, because your columnist’s fingers are well-trained for it, but another programmable editor, like Emacs, should be at least as good. The important thing here is the way to design and use these macros, not which editor you choose.
Let’s look at a specific example: a small part of a job to convert a web site coded in HTML (the busy Google Guide site, http://www.googleguide.com/) into DocBook XML format (see http://www.docbook.org/). Again, this article isn’t just about converting HTML into DocBook. It shows techniques for writing macros for any type of editing job.
First Step: Analyze the Job
Let’s say that you’ve tried a to completely automate the conversion. You’ve written a sed or perl (or whatever) script, or you’ve tried an XSLT converter, but your solution keeps failing miserably. Or worse, it fails subtly in ways that are hard to spot.
Try breaking the job into pieces. Search through the original files. Try the Swiss Army Knife of pagers, less, and some of its handy searching and highlighting features, such as the –a and –j options. Or use grep or egrep with the handy options –C, –A, –B, and ––color==always and pipe that output to less –R.
Decide what parts of the job are easy to automate. For instance, maybe every second-level heading in an HTML file nicely uses both <H2> and </H2> tags. Or maybe only a couple of them are missing the closing tags, which you can quickly add with a text editor. Then a quick-and-dirty script in a loop can blast through the files, finishing that part of the job in a flash and leaving the rest to do by hand.
Listing One shows a fragment of Google Guide’s hand-coded HTML 3.0 before conversion. Listing Two shows the same text after the semi-automated conversion.