UNICHARAMBIGS(5)UNICHARAMBIGS(5)NAMEunicharambigs - Tesseract unicharset ambiguities
DESCRIPTION
The unicharambigs file (a component of traineddata, see
combine_tessdata(1) ) is used by Tesseract to represent possible
ambiguities between characters, or groups of characters.
The file contains a number of lines, laid out as follow:
[num] <TAB> [char(s)] <TAB> [num] <TAB> [char(s)] <TAB> [num]
Field one the number of characters
contained in field two
Field two the character sequence to
be replaced
Field three the number of characters
contained in field four
Field four the character sequence
used to replace field two
Field five contains either 1 or 0. 1
denotes a mandatory
replacement, 0 denotes an
optional replacement.
Characters appearing in fields two and four should appear in
unicharset. The numbers in fields one and three refer to the number of
unichars (not bytes).
EXAMPLE
2 ' ' 1 " 1
1 m 2 r n 0
3 i i i 1 m 0
In this example, all instances of the 2 character sequence '' will
always be replaced by the 1 character sequence "; a 1 character
sequence m may be replaced by the 2 character sequence rn, and the 3
character sequence may be replaced by the 1 character sequence m.
HISTORY
The unicharambigs file first appeared in Tesseract 3.00; prior to that,
a similar format, called DangAmbigs (dangerous ambiguities) was used:
the format was almost identical, except only mandatory replacements
could be specified, and field 5 was absent.
BUGS
This is a documentation "bug": it’s not currently clear what should be
done in the case of ligatures (such as fi) which may also appear as
regular letters in the unicharset.
SEE ALSOtesseract(1), unicharset(5)AUTHOR
The Tesseract OCR engine was written by Ray Smith and his research
groups at Hewlett Packard (1985-1995) and Google (2006-present).
02/09/2012 UNICHARAMBIGS(5)