Jaak Ristioja
2014-05-19 22:09:26 UTC
Hi!
1) According to http://www.crosswire.org/wiki/DevTools:conf_Files the
\u control word should be followed by a 16-bit signed integer. The
wiki page doesn't mention this, but I assume it is in ASCII in decimal
form.
The RTFHTML filter code appears to incorrectly parse the following
strings:
"\u-999999" -> getUTF8FromUniChar(48577)
"\u-99999" -> getUTF8FromUniChar(31073)
"\u-0001" -> getUTF8FromUniChar(65535)
"\u-00" -> getUTF8FromUniChar(0)
"\u-0" -> getUTF8FromUniChar(0)
"\u00" -> getUTF8FromUniChar(0)
"\u001" -> getUTF8FromUniChar(1)
"\u99999" -> getUTF8FromUniChar(34463)
"\u-" -> getUTF8FromUniChar(0)
"\u--" -> getUTF8FromUniChar(0)
"\u--2" -> getUTF8FromUniChar(0)
"\u-a" -> getUTF8FromUniChar(0)
I think all these should instead fail.
2) In case an exception is thrown, text might contain a partial result
or the original value.
3) For control word \pard (and similarly for \par and \qc) it
incorrectly parses \pardx as \pard and "x", where it should instead
fail due to an invalid control word \pardx.
4) \par incorrectly appends a newline.
5) "a\qc b" is converted to "a<center> b", but should instead be
"a<center>b</center>" (' ' RTF delimiter output, missing HTML
</center> tag)
6) "a\par b" is converted to "a<p/> b", but should probably be
"<p>a</p><p>b</p>" (' ' RTF delimiter output, missing HTML <p> and
</p> tags.
7) Weird combinations of \par, \pard and \qc result in broken HTML
fragments or HTML fragments with unbalanced start and end tags.
8) Unsupported control sequences do not cause the function to fail,
but are passed to output as plain text (including the backslash).
8) Unescaped '{', '}' and '\' characters are not handled properly (to
pass these from RTF one would need to use the control symbols "\{",
"\}" and "\\" respectively).
Maybe I'll get around to fix this someday during daytime. To save me
extra work, I'd appreciate any comments on this before I start any
coding, especially if the Sword library needs to deviate from the RTF
specifications.
Blessings,
Jaak
PS: I'm glad there are no memory errors in this function. :)
PPS: Please forgive me for having studied formal languages.
1) According to http://www.crosswire.org/wiki/DevTools:conf_Files the
\u control word should be followed by a 16-bit signed integer. The
wiki page doesn't mention this, but I assume it is in ASCII in decimal
form.
The RTFHTML filter code appears to incorrectly parse the following
strings:
"\u-999999" -> getUTF8FromUniChar(48577)
"\u-99999" -> getUTF8FromUniChar(31073)
"\u-0001" -> getUTF8FromUniChar(65535)
"\u-00" -> getUTF8FromUniChar(0)
"\u-0" -> getUTF8FromUniChar(0)
"\u00" -> getUTF8FromUniChar(0)
"\u001" -> getUTF8FromUniChar(1)
"\u99999" -> getUTF8FromUniChar(34463)
"\u-" -> getUTF8FromUniChar(0)
"\u--" -> getUTF8FromUniChar(0)
"\u--2" -> getUTF8FromUniChar(0)
"\u-a" -> getUTF8FromUniChar(0)
I think all these should instead fail.
2) In case an exception is thrown, text might contain a partial result
or the original value.
3) For control word \pard (and similarly for \par and \qc) it
incorrectly parses \pardx as \pard and "x", where it should instead
fail due to an invalid control word \pardx.
4) \par incorrectly appends a newline.
5) "a\qc b" is converted to "a<center> b", but should instead be
"a<center>b</center>" (' ' RTF delimiter output, missing HTML
</center> tag)
6) "a\par b" is converted to "a<p/> b", but should probably be
"<p>a</p><p>b</p>" (' ' RTF delimiter output, missing HTML <p> and
</p> tags.
7) Weird combinations of \par, \pard and \qc result in broken HTML
fragments or HTML fragments with unbalanced start and end tags.
8) Unsupported control sequences do not cause the function to fail,
but are passed to output as plain text (including the backslash).
8) Unescaped '{', '}' and '\' characters are not handled properly (to
pass these from RTF one would need to use the control symbols "\{",
"\}" and "\\" respectively).
Maybe I'll get around to fix this someday during daytime. To save me
extra work, I'd appreciate any comments on this before I start any
coding, especially if the Sword library needs to deviate from the RTF
specifications.
Blessings,
Jaak
PS: I'm glad there are no memory errors in this function. :)
PPS: Please forgive me for having studied formal languages.