html2xhtml(1)html2xhtml(1)NAMEhtml2xhtml - Converts HTML files to XHTML
SYNTAXhtml2xhtml [ filename ] [ options ]
DESCRIPTION
Html2xhtml is a command-line tool that converts HTML files to XHTML
files. The path of the HTML input file can be provided as a command-
line argument. If not, it is read from stdin.
Xhtml2xhtml tries always to generate valid XHTML files. It is able to
correct many common errors in input HTML files without loose of infor‐
mation. However, for some errors, html2xhtml may decide to loose some
information in order to generate a valid XHTML output. This can be
avoided with the -e option, which allows html2xhtml to generate non-
valid output in these cases.
Html2xhtml can generate the XHTML output compliant to one of the fol‐
lowing document types: XHTML 1.0 (Transitional, Strict and Frameset),
XHTML 1.1, XHTML Basic and XHTML Mobile Profile.
OPTIONS
The command line options/arguments are:
filename Read the HTML input from filename (optional argu‐
ment). If this argument is not provided, the HTML
input is read from standard input.
-o filename Output XHTML file. The file is overwritten if it
exists. If not provided, the output is written to
standard output.
-e Instructs the program to propagate input chunks to
the output even if it is unable to adapt them to
the output XHTML doctype. Using this option, the
XHTML output may be non-valid. Not using this
option, some input data could be removed from the
output in some [rare] cases.
-t output-doctype Doctype of the output XHTML file. If not specified,
the program selects automatically either XHTML 1.0
Transitional or XHTML 1.0 Frameset depending on the
input. Current available doctypes are:
o transitional XHTML 1.0 Transitional
o frameset XHTML 1.0 Frameset
o strict XHTML 1.0 Strict
o 1.1 XHTML 1.1
o basic-1.0 XHTML Basic 1.0
o basic-1.1 XHTML Basic 1.1
o mp XHTML Mobile Profile
o print-1.0 XHTML Print 1.0
--ics input_charset Character set of the input document. This option
overrides the default input character set detection
mechanism.
--ocs output_charset
Character set for the output XHTML document. If
this option is not present, the character set of
the input is used as default.
--lcs Dump the list of available character set aliases
and exit html2xhtml. No conversion is performed
when this option is present.
-l line_length Number of characters per line. The default value is
80. It must be greater or equal to 40, otherwise
the parameter is ignored.
-b tab_length Tab length in number of characters. It must be a
number between 0 and 16, otherwise the parameter is
ignored. Use 0 to avoid indentation in the output.
--preserve-space-comments
Use this option to preserve white spaces, tabs and
ends of lines in HTML comments. The default, if not
provided, is to rearrange spacing.
--no-protect-cdata Enclose CDATA sections in "script" and "style" fol‐
lowing the XHTML 1.0 specification (using
"<!CDATA[[" and "]]>"). It might be incompatible
with some browsers. The default in this version is
to enclose CDATA sections using "//<!CDATA[[" and
"//]]>", because major browsers handle it properly.
--compact-block-elements
No white spaces or line breaks are written between
the start tag of a block element and the start tag
of its first enclosed inline element (or character
data) and between the end tag of its last enclosed
inline element (or character data) and the end tag
of the block element. By default, if this option is
not set, a new line character and indentation is
written between them.
--compact-empty-elm-tags
Do not write a whitespace before the slash for
empty element tags (i.e. write "<br/>" instead of
the default "<br />"). Note that although both
notations are correct in XML, the XHTML 1.0 stan‐
dard recommends the latter to improve compatibility
with old browsers.
--empty-elm-tags-always
By default, empty element tags are written only for
elements declared as empty in the DTD. This option
makes any element not having content to be written
with the empty element tag, even if it is not
declared as empty in the DTD. This option may cause
problems when the XHTML document is opened by
browsers in HTML (tag soup) mode.
--dos-eol Write the output XHTML file with DOS--style (CRLF)
end of line, instead of the default UNIX--style end
of line. Both end of line styles are allowed by
the XML recommendation.
--generate-snippet Treat the input as an HTML fragment instead of a
full document. The output will also be a snippet
and will not contain either the XML and doctype
declarations or the html, head and body elements.
--help Show a brief help message and exit.
--version Show the version number and exit.
NOTE ON CHARACTER SETS
Since version 1.1.2, html2xhtml is able to parse and write HTML and
XHTML documents using the most popular character sets / encodings. It
is also able to read the input document using a given character set and
generate an output that uses a different character set. The iconv
implementation in the GNU C library is used with that purpose.
Any IANA-registered character set that is supported by the iconv
library may be used. When naming a character set, any IANA--approved
alias for it may be used. The full list of aliases recognised by
html2xhtml can be obtained with the --lcs command-line option.
If the character set of the input document is not specified, html2xhtml
tries to guess it automatically. If the character set of the output
document is not specified, html2xhtml writes the output using the same
character set as the input document.
NOTE ON END OF LINE CHARACTES
By default, the UNIX-style one-byte end of line is used. It can be
changed to DOS-style CRLF end of line with the --dos-eol command-line
option.
However, when the program is compiled in the MinGW environment and the
output is sent to standard output, the output is automatically con‐
verted by the environment to CRLF by default. Do not use the --dos-eol
command-line option in that situation. When the output is sent to a
file with the -o command-line option, the output is as expected (UNIX-
style by default), and the --dos-eol option may be used.
ACKNOWLEDGMENTS
Program developer up to current version:
Jesus Arias Fisteus <jaf@it.uc3m.es>
The first working version of this program has been developed as
a Master Thesis at the University of Vigo (Spain) [http://www.uvigo.es],
advised by:
Rebeca Diaz Redondo
Ana Fernandez Vilas
Copyright 2000-2001 by Jesus Arias Fisteus, Rebeca Diaz Redondo, Ana
Fernandez Vilas.
Copyright 2002-2009 by Jesus Arias Fisteus
html2xhtml(1)