DEMORONISER(1)DEMORONISER(1)NAMEdemoroniser - correct moronic and gratuitously incompatible HTML gener‐
ated by Microsoft applications
SYNOPSISdemoroniser [ -q ] [ -u ] [ -wcols ] [ infile ] [ outfile ]
DESCRIPTION
Many slick, high profile corporate Web sites I visit seemed to exhibit
terrible grammar completely inconsistent with the obvious investment in
graphics and design. Apostrophes and quote marks were frequently omit‐
ted, and every couple of paragraphs words were run together which
should have been separated by a punctuation mark of some kind.
This remained a mystery to me until I wanted to convert a presentation
I'd developed in 1996 using Microsoft PowerPoint into a set of Web
pages. A friend was kind enough to run the presentation through Power‐
Point's ``Save as HTML'' feature (I have abandoned all use of Microsoft
products, so I did not have a current version of PowerPoint which in‐
cludes this feature). When I got the PowerPoint-generated HTML back
and viewed it in my browser, I discovered that it contained precisely
the same grammatical errors I'd noted on so many Web sites, and which
certainly were not present in my original presentation.
A little detective work revealed that, as is usually the case when you
encounter something shoddy in the vicinity of a computer, Microsoft in‐
competence and gratuitous incompatibility were to blame. Western lan‐
guage HTML documents are written in the ISO 8859-1 Latin-1 character
set, with a specified set of escapes for special characters. Blithely
ignoring this prescription, as usual, Microsoft use their own "exten‐
sion" to Latin-1, in which a variety of characters which do not appear
in Latin-1 are inserted in the range 0x82 through 0x95--this having the
merit of being incompatible with both Latin-1 and Unicode, which re‐
serve this region for additional control characters.
These characters include open and close single and double quotes, em
and en dashes, an ellipsis and a variety of other things you've been
dying for, such as a capital Y umlaut and a florin symbol. Well, okay,
you say, if Microsoft want to have their own little incompatible char‐
acter set, why not? Because it doesn't stop there--in their inimitable
fashion (who would want to?)--they aggressively pollute the Web pages
of unknowing and innocent victims worldwide with these characters, with
the result that the owners of these pages look like semi-literate mo‐
rons when their pages are viewed on non-Microsoft platforms (or on Mi‐
crosoft platforms, for that matter, if the user has selected as the
browser's font one of the many TrueType fonts which do not include the
incompatible Microsoft characters).
You see, ``state of the art'' Microsoft Office applications sport a
nifty feature called ``smart quotes.'' (Rule of thumb--every time Mi‐
crosoft use the word ``smart,'' be on the lookout for something dumb).
This feature is on by default in both Word and PowerPoint, and can be
disabled only by finding the little box buried among the dozens of be‐
wildering option panels these products contain. If enabled, and you
type the string,
"Halt," he cried, "this is the police!"
``smart quotes'' transforms the ASCII quote characters automatically
into the incompatible Microsoft opening and closing quotes. ASCII sin‐
gle and double quotes are similarly transformed (even though ASCII al‐
ready contains apostrophe and single open quote characters), and double
hyphens are replaced by the incompatible em dash symbol. What other
horrors occur, I know not. If the user notices this happening at all,
their reaction might be ``Thank you Billy-boy--that looks ever so much
nicer,'' not knowing they've been set up to look like a moron to folks
all over the world.
You see, when you export a document as text for hand-editing into HTML,
or avail yourself of the ``Save as HTML'' features in newer versions of
Office applications, these incompatible, Microsoft-specific characters
remain in place. When viewed by a user on a non-Microsoft platform,
they will not be displayed properly--most browsers seem to just drop
them, as opposed to including a symbol indicating an undisplayable
character. Hence, the apparently ungrammatical text, which the author
of the page, editing on a Microsoft platform, will never be aware of.
Having no desire to hand-edit the HTML for a long presentation to cor‐
rect a raft of Microsoft-induced incompatibilities, I wrote a Perl pro‐
gram, the demoroniser, to transform Microsoft's ``junk HTML'' into at
least a starting point for something I'd consider presentable on my
site. In addition to replacing the incompatible characters with HTML-
compliant equivalents wherever possible (a few rarely-encountered char‐
acters which can't be translated result in warning messages if encoun‐
tered), the following sloppy or downright wrong HTML is corrected.
· The missing semicolon at the end of numeric character escapes
(=) is supplied.
· Numeric renderings of special characters (< > &) are
replaced with readable equivalents.
· Unquoted <table> tags containing non-alphanumeric characters
are quoted.
· PowerPoint's mis-nesting of <font> and <strong> tags is cor‐
rected.
· PowerPoint's boneheaded use of <ul> and </ul> tags to accom‐
plish paragraph breaks is corrected and the proper <p> tags
inserted.
· Missing <tr> tags in text-only slides are inserted.
· Nugatory </p> tags are removed.
· Unmatched <li> tags in headings are removed.
· Idiot ``paragraph-long lines'' are broken into something
suitable for editing with a normal text editor.
OPTIONS-q Quiet: don't print warnings for untranslated characters.
-u Print how-to-call information and a summary of options.
-wcols Wrap output lines at column cols. By default, lines are
wrapped at column 72. A cols specification of 0 disables
line wrapping. demoroniser attempts to wrap lines so as to
preserve their meaning. Lines are broken at white space
whenever possible. If this cannot be done, a line longer
than the cols specification will remain in the output HTML.
BUGSdemoroniser is a Perl script. In order to use it, you must have Perl
installed on your system. demoroniser was developed using Perl 4.0,
patch level 36.
FILES
If no outfile is specified, output is written to standard output. If
no infile is specified, input is read from standard input.
SEE ALSOperl(1)AUTHOR
John Walker
WWW: http://www.fourmilab.ch/
This program is in the public domain.
4th Berkeley Distribution 16 SEP 2003 DEMORONISER(1)