Discussion:
[sword-devel] RTFHTML filter bugs
Jaak Ristioja
2014-05-19 22:09:26 UTC
Permalink
Hi!

1) According to http://www.crosswire.org/wiki/DevTools:conf_Files the
\u control word should be followed by a 16-bit signed integer. The
wiki page doesn't mention this, but I assume it is in ASCII in decimal
form.

The RTFHTML filter code appears to incorrectly parse the following
strings:

"\u-999999" -> getUTF8FromUniChar(48577)
"\u-99999" -> getUTF8FromUniChar(31073)
"\u-0001" -> getUTF8FromUniChar(65535)
"\u-00" -> getUTF8FromUniChar(0)
"\u-0" -> getUTF8FromUniChar(0)
"\u00" -> getUTF8FromUniChar(0)
"\u001" -> getUTF8FromUniChar(1)
"\u99999" -> getUTF8FromUniChar(34463)
"\u-" -> getUTF8FromUniChar(0)
"\u--" -> getUTF8FromUniChar(0)
"\u--2" -> getUTF8FromUniChar(0)
"\u-a" -> getUTF8FromUniChar(0)

I think all these should instead fail.

2) In case an exception is thrown, text might contain a partial result
or the original value.

3) For control word \pard (and similarly for \par and \qc) it
incorrectly parses \pardx as \pard and "x", where it should instead
fail due to an invalid control word \pardx.

4) \par incorrectly appends a newline.

5) "a\qc b" is converted to "a<center> b", but should instead be
"a<center>b</center>" (' ' RTF delimiter output, missing HTML
</center> tag)

6) "a\par b" is converted to "a<p/> b", but should probably be
"<p>a</p><p>b</p>" (' ' RTF delimiter output, missing HTML <p> and
</p> tags.

7) Weird combinations of \par, \pard and \qc result in broken HTML
fragments or HTML fragments with unbalanced start and end tags.

8) Unsupported control sequences do not cause the function to fail,
but are passed to output as plain text (including the backslash).

8) Unescaped '{', '}' and '\' characters are not handled properly (to
pass these from RTF one would need to use the control symbols "\{",
"\}" and "\\" respectively).

Maybe I'll get around to fix this someday during daytime. To save me
extra work, I'd appreciate any comments on this before I start any
coding, especially if the Sword library needs to deviate from the RTF
specifications.


Blessings,
Jaak

PS: I'm glad there are no memory errors in this function. :)
PPS: Please forgive me for having studied formal languages.
David Haslam
2014-05-20 18:01:41 UTC
Permalink
Take care with Right to Left languages such as Hebrew.

i.e. After any patches to the filter, please include some testing for BiDi
text in the About= field and others.

David



--
View this message in context: http://sword-dev.350566.n4.nabble.com/RTFHTML-filter-bugs-tp4653969p4653970.html
Sent from the SWORD Dev mailing list archive at Nabble.com.
Jaak Ristioja
2014-05-20 22:27:36 UTC
Permalink
I've never done BiDi, but I'm not sure I need to take that into account
while fixing the RTF parsing. As I currently understand it, this
particular piece of code does not support any part from the RTF spec
dealing with bidirectional text handling. Hence all BiDi information
contained in the configuration file strings (e.g. About=) is contained
either in the plain ASCII text or the \u<num> Unicode escapes which this
algorithm should pass through unmodified.

...except for HTML entities which should actually be escaped. This bug
in the algorithm I previously failed to notice. Additionally I forgot
that non-ASCII characters in the input string should also lead to
parsing failure.

Jaak
Post by David Haslam
Take care with Right to Left languages such as Hebrew.
i.e. After any patches to the filter, please include some testing for BiDi
text in the About= field and others.
David
--
View this message in context: http://sword-dev.350566.n4.nabble.com/RTFHTML-filter-bugs-tp4653969p4653970.html
Sent from the SWORD Dev mailing list archive at Nabble.com.
_______________________________________________
sword-devel mailing list: sword-devel at crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 2916 bytes
Desc: OpenPGP digital signature
URL: <http://www.crosswire.org/pipermail/sword-devel/attachments/20140521/043a7201/attachment.sig>
Chris Burrell
2014-05-21 12:19:49 UTC
Permalink
I believe some conf files have direct unicode (rather than escaped
sequences) in them and that is preferred.
Post by Jaak Ristioja
I've never done BiDi, but I'm not sure I need to take that into account
while fixing the RTF parsing. As I currently understand it, this
particular piece of code does not support any part from the RTF spec
dealing with bidirectional text handling. Hence all BiDi information
contained in the configuration file strings (e.g. About=) is contained
either in the plain ASCII text or the \u<num> Unicode escapes which this
algorithm should pass through unmodified.
...except for HTML entities which should actually be escaped. This bug
in the algorithm I previously failed to notice. Additionally I forgot
that non-ASCII characters in the input string should also lead to
parsing failure.
Jaak
Post by David Haslam
Take care with Right to Left languages such as Hebrew.
i.e. After any patches to the filter, please include some testing for
BiDi
Post by David Haslam
text in the About= field and others.
David
--
http://sword-dev.350566.n4.nabble.com/RTFHTML-filter-bugs-tp4653969p4653970.html
Post by David Haslam
Sent from the SWORD Dev mailing list archive at Nabble.com.
_______________________________________________
sword-devel mailing list: sword-devel at crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
_______________________________________________
sword-devel mailing list: sword-devel at crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/sword-devel/attachments/20140521/46272c02/attachment.html>
Jaak Ristioja
2014-05-21 12:59:45 UTC
Permalink
So this means that actually we want non-standard RTF (someone should
update the wiki). Should we assume UTF-8? Are you sure we don't have any
modules with ISO-8859-something encoded values?

If we choose any ASCII superset encoding we have to consider at least
the two points:

* Since the RTF control words and delimeters are specified in ASCII
only, we need to decide whether how the bytes of the superset act as
delimeters and parts of "RTF" control words. For example, whether the
Unicode letter, number, spacing, punctuation, control etc characters
constitute parts of RTF control words or act as delimiters.

* In case of encodings where characters may consist of multiple bytes
(e.g. the variable-length UTF-8) we must consider the character
bondaries. We can't just pass through any non-ASCII byte values. For
example, the following bit sequence wouldn't make sense:

11100010 01011100 10000010 01110001 10101100 01100011

which is an UTF-8 encoded Euro sign, ?, interleaved with bytes of the
ASCII string "\qc". It just doesn't make sense, whereas the following
sequences would be correct:

11100010 10000010 10101100 01011100 01110001 01100011 (?\qc)
01011100 01110001 01100011 11100010 10000010 10101100 (\qc?)

So depending on the encoding it were correct to detect such cases,
otherwise we end up with invalid Unicode output.

Blessings,
Jaak
Post by Chris Burrell
I believe some conf files have direct unicode (rather than escaped
sequences) in them and that is preferred.
On 20 May 2014 23:28, "Jaak Ristioja" <jaak at ristioja.ee
I've never done BiDi, but I'm not sure I need to take that into account
while fixing the RTF parsing. As I currently understand it, this
particular piece of code does not support any part from the RTF spec
dealing with bidirectional text handling. Hence all BiDi information
contained in the configuration file strings (e.g. About=) is contained
either in the plain ASCII text or the \u<num> Unicode escapes which this
algorithm should pass through unmodified.
...except for HTML entities which should actually be escaped. This bug
in the algorithm I previously failed to notice. Additionally I forgot
that non-ASCII characters in the input string should also lead to
parsing failure.
Jaak
Post by David Haslam
Take care with Right to Left languages such as Hebrew.
i.e. After any patches to the filter, please include some testing
for BiDi
Post by David Haslam
text in the About= field and others.
David
--
http://sword-dev.350566.n4.nabble.com/RTFHTML-filter-bugs-tp4653969p4653970.html
Post by David Haslam
Sent from the SWORD Dev mailing list archive at Nabble.com.
_______________________________________________
sword-devel mailing list: sword-devel at crosswire.org
<mailto:sword-devel at crosswire.org>
Post by David Haslam
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
_______________________________________________
sword-devel mailing list: sword-devel at crosswire.org
<mailto:sword-devel at crosswire.org>
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
_______________________________________________
sword-devel mailing list: sword-devel at crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
Jaak Ristioja
2014-05-21 13:16:17 UTC
Permalink
To sum up, we would need to agree on and specify a RTF subset which is
Unicode-aware (UTF-8 only?), and implement an Unicode-aware transducer
for it.
Post by Jaak Ristioja
So this means that actually we want non-standard RTF (someone
should update the wiki). Should we assume UTF-8? Are you sure we
don't have any modules with ISO-8859-something encoded values?
If we choose any ASCII superset encoding we have to consider at
* Since the RTF control words and delimeters are specified in ASCII
only, we need to decide whether how the bytes of the superset act
as delimeters and parts of "RTF" control words. For example,
whether the Unicode letter, number, spacing, punctuation, control
etc characters constitute parts of RTF control words or act as
delimiters.
* In case of encodings where characters may consist of multiple
bytes (e.g. the variable-length UTF-8) we must consider the
character bondaries. We can't just pass through any non-ASCII byte
values. For example, the following bit sequence wouldn't make
11100010 01011100 10000010 01110001 10101100 01100011
which is an UTF-8 encoded Euro sign, ?, interleaved with bytes of
the ASCII string "\qc". It just doesn't make sense, whereas the
11100010 10000010 10101100 01011100 01110001 01100011 (?\qc)
01011100 01110001 01100011 11100010 10000010 10101100 (\qc?)
So depending on the encoding it were correct to detect such cases,
otherwise we end up with invalid Unicode output.
Blessings, Jaak
Post by Chris Burrell
I believe some conf files have direct unicode (rather than
escaped sequences) in them and that is preferred.
On 20 May 2014 23:28, "Jaak Ristioja" <jaak at ristioja.ee
I've never done BiDi, but I'm not sure I need to take that into
account while fixing the RTF parsing. As I currently understand
it, this particular piece of code does not support any part from
the RTF spec dealing with bidirectional text handling. Hence all
BiDi information contained in the configuration file strings
(e.g. About=) is contained either in the plain ASCII text or the
\u<num> Unicode escapes which this algorithm should pass through
unmodified.
...except for HTML entities which should actually be escaped.
This bug in the algorithm I previously failed to notice.
Additionally I forgot that non-ASCII characters in the input
string should also lead to parsing failure.
Jaak
Post by David Haslam
Take care with Right to Left languages such as Hebrew.
i.e. After any patches to the filter, please include some
testing
for BiDi
Post by David Haslam
text in the About= field and others.
David
http://sword-dev.350566.n4.nabble.com/RTFHTML-filter-bugs-tp4653969p4653970.html
Sent from the SWORD Dev mailing list archive at Nabble.com.
Post by Chris Burrell
Post by David Haslam
_______________________________________________ sword-devel
mailing list: sword-devel at crosswire.org
<mailto:sword-devel at crosswire.org>
Post by David Haslam
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
_______________________________________________ sword-devel
mailing list: sword-devel at crosswire.org
<mailto:sword-devel at crosswire.org>
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
_______________________________________________ sword-devel
mailing list: sword-devel at crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
_______________________________________________ sword-devel
mailing list: sword-devel at crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel Instructions
to unsubscribe/change your settings at above page
DM Smith
2014-05-21 14:45:05 UTC
Permalink
The encoding of the conf is either cp1252 (the default, but called latin 1) or utf-8. The encoding of the conf matches that of the module. This may cause the conf to be read twice once for the default and once for UTF-8, if the module encoding is set to UTF-8.

There have been confs that are incorrect with regard to this rule.

In Him,
DM
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
So this means that actually we want non-standard RTF (someone should
update the wiki). Should we assume UTF-8? Are you sure we don't have any
modules with ISO-8859-something encoded values?
If we choose any ASCII superset encoding we have to consider at least
* Since the RTF control words and delimeters are specified in ASCII
only, we need to decide whether how the bytes of the superset act as
delimeters and parts of "RTF" control words. For example, whether the
Unicode letter, number, spacing, punctuation, control etc characters
constitute parts of RTF control words or act as delimiters.
* In case of encodings where characters may consist of multiple bytes
(e.g. the variable-length UTF-8) we must consider the character
bondaries. We can't just pass through any non-ASCII byte values. For
11100010 01011100 10000010 01110001 10101100 01100011
which is an UTF-8 encoded Euro sign, ?, interleaved with bytes of the
ASCII string "\qc". It just doesn't make sense, whereas the following
11100010 10000010 10101100 01011100 01110001 01100011 (?\qc)
01011100 01110001 01100011 11100010 10000010 10101100 (\qc?)
So depending on the encoding it were correct to detect such cases,
otherwise we end up with invalid Unicode output.
Blessings,
Jaak
Post by Chris Burrell
I believe some conf files have direct unicode (rather than escaped
sequences) in them and that is preferred.
On 20 May 2014 23:28, "Jaak Ristioja" <jaak at ristioja.ee
I've never done BiDi, but I'm not sure I need to take that into account
while fixing the RTF parsing. As I currently understand it, this
particular piece of code does not support any part from the RTF spec
dealing with bidirectional text handling. Hence all BiDi information
contained in the configuration file strings (e.g. About=) is contained
either in the plain ASCII text or the \u<num> Unicode escapes which this
algorithm should pass through unmodified.
...except for HTML entities which should actually be escaped. This bug
in the algorithm I previously failed to notice. Additionally I forgot
that non-ASCII characters in the input string should also lead to
parsing failure.
Jaak
Post by David Haslam
Take care with Right to Left languages such as Hebrew.
i.e. After any patches to the filter, please include some testing
for BiDi
Post by David Haslam
text in the About= field and others.
David
--
http://sword-dev.350566.n4.nabble.com/RTFHTML-filter-bugs-tp4653969p4653970.html
Post by David Haslam
Sent from the SWORD Dev mailing list archive at Nabble.com.
_______________________________________________
sword-devel mailing list: sword-devel at crosswire.org
<mailto:sword-devel at crosswire.org>
Post by David Haslam
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
_______________________________________________
sword-devel mailing list: sword-devel at crosswire.org
<mailto:sword-devel at crosswire.org>
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
_______________________________________________
sword-devel mailing list: sword-devel at crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQgcBAEBAgAGBQJTfKM/AAoJELozJlbjIn79gXpAAMxwoq17dvVzCikAplQUjON0
xDJXlDFfKK14w8xj11NSUvJEPjVWlwTi82WzEplQBKfkxtFY09010ZB5IKotEtSP
dcJMjzc4FmuJmPifB7s3gtEOQ81OThMArlnq/aFHvGj6+5D8qjFkQiqOzSJeaORS
C8dPobXSnJkJ/g3zKCdJf/k5msphFbmuIQOD4Ovco2ZHHlukL8QNd8pt3RcPN4Hy
BMxYx9glw3+YJK5Jj63isdsmOGLeRory3PDcHZoPJzu8zssW78Chlsgoh+xWlfkn
zI5PdP1ARhq7K/kUnPp7jXx3LDFiEbmPjrNBi/A03k+n7s2oZWdxm9uBfEEq5VpB
DpdCA19msaEE+fOWOyAAvvZstnCxYrrd01j+HxXUGoA4JHBBVQo01H5udfOdbiBu
nSI5M0GUKBjSSfLSmrh2oTC0qniVMRw4t+IAIJU1chjfBCsoNAx6xTiDE8x+hpjd
A+s8wvgBU0gNbqeOMvWXkHeOWSu7O0oPEp0vVl+6fUPPFDHGR1+2vPXLnCcbASwj
pEJwls9IBis7touUlIt4stlois1Imtw8zKGXXU8h0UmSgRHK0G2Ck8clNptClkMY
+9xP+TGXZI0q+WlzA7M4aD2puQAiJ0iJTm/kV+QGF/1RiaWNGWTG7Oxfufz5XdDn
xqTrAkYoVw3a+ZRgZPs4YbyK3ysVqncvAOFKuqLcEEwiA4zEYztGxPMAhcypQJFH
n6ORlF3/Kmkukj3eapanznmcvoZ+H/APKNWmo2b+TZ10WABCtZVDO+pd1Ed+l2U5
EytGhMYEqNSMqV109k3It9Ll7a8GVQa6k7AX8/BSXlh6/GaaoIzkSgGJBFAU8Zsj
dW7u6O7wBOTBmE+lUUrwA3igveDhTDhzjORE7Ek74xkhoNVwh1DmqWwJGZbIGb5R
47yWwxql4pqS4jq3M+TM8SUZaeY/NTjRTn+WLFBGahKVH5Gg/NiB6onfBBRLyYwK
iorFYngEhpKDNJBPp8rfSIg4NxhbupwG9B1Bbrdg6Kj+E+kGsXDuDkBWQEgf1Jwv
3XbiDBEjUf2wr4TdbUx9GrwrBNP7q9YW0RmbQGlvIahVwtr3/PJGhiU/kS47fAZf
HQMac1US7eYgtW5hzH/YG+41cCI9J0byZBEuSJS2GuSd0LD0Of4bPLxyOxiXqvTU
kwSPIQwsBOZpFIA5Qfc35x5KxVqCGUYBvXhglpZtZGlGr8uIPpshc1gz9ukCejuz
754upiYTlCzocKpvPbER9QpMZFYb+iDTdc4bU8whmxkP8ATKSDQmYIqUS2ohLKV8
co5X0741kRaG5oNOBBrM7kn/9nWgFNspFBkJAvGLbD8h6R8S11cu7INrXzJjxv/e
bCAxGXb2UQXXUe18FCYeqUvl5VdQOQt3f7gja3XbitCKkJjUA6i7t1+5vjuMQsAY
NFliiFxNeNjNE4hIIpvA7G3N+2t0W8IjGsystXm6ONN0lM78eLZLLlsrfkPi8NgR
Nydc78zEJfGr8APkiYleIYTi6ftgtDrI9927wNWqgIPqO4vqA1TZngX8wx6YPJou
uF8cSnI0PlcOfEKtsBgZedOpbZlqAt61wvMGMW0YUfiL5LhuP95KQekqDMMBDCQX
mGMehJHRJ5PvoDt8485lGOWdwXn6T7PlakZ1UCtYeMV0Nx2PfPBfU7bnCwSRFQKg
vpUhPCkW5qpvlkBLOpPLwkqcZGiSyLL/YSGp6cVExeeQVHc2hI169zGY9dUHBEMN
CaKwI9Wjn5V95bax3gsMlHnY9c1TB/6yLWnVEJAilm5ijgWW5KxstWoJMd/OptY8
QvbsOA7K36HfwOwNCblQCGbUrPjikhXTw8ew1aap4OHqGIKUWCMm3z/eHOPRU5mD
Ce2Z86vwYb9T2PcyqUiZOs1WW9TBZx70Hr2JQmRwgMyWpT4DERjofP83IA8vxZdP
9uKT4j+EBUGoI2zGgE2lapLL/VWrzt6OBMv5iUmR4OIFLdnHevAAy5w53c4+tWjs
SNmjAz8tW5FWiVFR99FQBN6KWXIjKdJGQl+zccOlE0zBQe2grnqFmUeuuBbPiojb
Wch+hqrKDX/VLr/gIP9EErMJ7ZvZ7st+gwPZlFwC7Evf3OCrUnRYIbMI6iLGLoZ6
c9YLbK67hj1Ho+X99XTeoQj8l2V14TSRCFZBmO7Os5L2kXOEiw0yeV8Dn87LJPFp
4VcfgFGLi9FRnI36K4+h5JWoyhrGhNHrHsO60Xs2U3a02fRfeUgn/T1Xf0xXbVMC
gX8zJ3aC15pUy/dJaqJ4HIszzPe5ErO7J9GB7AhjVnx8pEE0xayoJkA4VM0YF8Lk
b/IF04rm/dNlsLL7zRzdGpr2uo9esMzFJDYcHnhInhaE7t2iGR4+cgUdRJKA7NJW
ZumxNz3a1EjeZHRLqRxfT8O6Cc55hG4GwVO7JxUnXJtRMx+ENXZslf4ExGdhcTdf
ntjsfngGemyKYv8aMJ9pDlLFVyR+91xSpFp8QYRDtcP14y5Dfh/jh4Kmdu0BqTzt
Wt0KUUZQlx8Qu8XJbatPiieDmjtQ8HPmhsHQAA+QmLzrhEmakrAjTfpWq5eNYQeQ
ei6tawFllPyuNrez2BOP3nfXuSBlfn2+yBfi3H1mJc8urrFwDtt/zqTHdoOtyCNO
PVaqMROmVzgdKg7yyXTBek3UBe8TxMWigvepRvxkGlmMZQkW42/5ft0269esY/bw
tuy57vDPyvQfrJzpN62y
=RNpJ
-----END PGP SIGNATURE-----
_______________________________________________
sword-devel mailing list: sword-devel at crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/sword-devel/attachments/20140521/d179c777/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4145 bytes
Desc: not available
URL: <http://www.crosswire.org/pipermail/sword-devel/attachments/20140521/d179c777/attachment.p7s>
Jaak Ristioja
2014-05-22 08:13:13 UTC
Permalink
I think I don't understand what you're saying. The frontend should
read the configuration twice? Sword should? Huh? I don't understand.
Why this complexity?! Do you mean that:

* Sword reads the entire configuration file as CP1252 encoded
* On failure, re-read the configuration file as UTF-8 encoded

???

If this is the case, then this is error prone (even when reading only
parts of the configuration), because CP1252 and UTF-8 overlap. Hence
data encoded as UTF-8 might be parsed correctly as valid CP1252, even
though it was intended to be UTF-8. I mean I find it likely that valid
UTF-8 strings might be accepted by a perfectly correct CP1252 encoding
checker as valid CP1252.

Jaak
Post by DM Smith
The encoding of the conf is either cp1252 (the default, but called
latin 1) or utf-8. The encoding of the conf matches that of the
module. This may cause the conf to be read twice once for the
default and once for UTF-8, if the module encoding is set to
UTF-8.
There have been confs that are incorrect with regard to this rule.
In Him, DM
On May 21, 2014, at 8:59 AM, Jaak Ristioja <jaak at ristioja.ee
So this means that actually we want non-standard RTF (someone
should update the wiki). Should we assume UTF-8? Are you sure we
don't have any modules with ISO-8859-something encoded values?
If we choose any ASCII superset encoding we have to consider at
* Since the RTF control words and delimeters are specified in ASCII
only, we need to decide whether how the bytes of the superset act
as delimeters and parts of "RTF" control words. For example,
whether the Unicode letter, number, spacing, punctuation, control
etc characters constitute parts of RTF control words or act as
delimiters.
* In case of encodings where characters may consist of multiple
bytes (e.g. the variable-length UTF-8) we must consider the
character bondaries. We can't just pass through any non-ASCII byte
values. For example, the following bit sequence wouldn't make
11100010 01011100 10000010 01110001 10101100 01100011
which is an UTF-8 encoded Euro sign, ?, interleaved with bytes of
the ASCII string "\qc". It just doesn't make sense, whereas the
11100010 10000010 10101100 01011100 01110001 01100011 (?\qc)
01011100 01110001 01100011 11100010 10000010 10101100 (\qc?)
So depending on the encoding it were correct to detect such cases,
otherwise we end up with invalid Unicode output.
Blessings, Jaak
Post by David Haslam
Post by Chris Burrell
I believe some conf files have direct unicode (rather than
escaped sequences) in them and that is preferred.
On 20 May 2014 23:28, "Jaak Ristioja" <jaak at ristioja.ee
I've never done BiDi, but I'm not sure I need to take that
into account while fixing the RTF parsing. As I currently
understand it, this particular piece of code does not
support any part from the RTF spec dealing with bidirectional
text handling. Hence all BiDi information contained in the
configuration file strings (e.g. About=) is contained either
in the plain ASCII text or the \u<num> Unicode escapes which
this algorithm should pass through unmodified.
...except for HTML entities which should actually be
escaped. This bug in the algorithm I previously failed to
notice. Additionally I forgot that non-ASCII characters in
the input string should also lead to parsing failure.
Jaak
Post by David Haslam
Take care with Right to Left languages such as Hebrew.
i.e. After any patches to the filter, please include some
testing
for BiDi
Post by David Haslam
text in the About= field and others.
David
http://sword-dev.350566.n4.nabble.com/RTFHTML-filter-bugs-tp4653969p4653970.html
Sent from the SWORD Dev mailing list archive at Nabble.com
Post by DM Smith
Post by David Haslam
Post by Chris Burrell
Post by David Haslam
<http://Nabble.com>.
_______________________________________________
sword-devel mailing list: sword-devel at crosswire.org
<mailto:sword-devel at crosswire.org>
<mailto:sword-devel at crosswire.org>
Post by David Haslam
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above
page
_______________________________________________ sword-devel
mailing list: sword-devel at crosswire.org
<mailto:sword-devel at crosswire.org>
<mailto:sword-devel at crosswire.org>
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above
page
_______________________________________________ sword-devel
mailing list: sword-devel at crosswire.org
<mailto:sword-devel at crosswire.org>
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above
page
_______________________________________________ sword-devel
mailing list: sword-devel at crosswire.org
<mailto:sword-devel at crosswire.org>
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
_______________________________________________ sword-devel
mailing list: sword-devel at crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel Instructions
to unsubscribe/change your settings at above page
DM Smith
2014-05-22 11:18:53 UTC
Permalink
The Encoding field drives the encoding of the file. When not present use the default.

The front end should never read the file. It is the engine's responsibility to do the reading. It is not the reading of the file that may need to be done twice but rather the byte stream/buffer from the file. How it gets the byte stream/buffer for the second (failure) case is its business.

It could *always* read it twice. First time as binary to read the ASCII content of the Encoding= field. The second time to do the charset conversion. But I'm not recommending that.

Btw I work on JSword which parses as it reads the stream from the file. It rewinds the stream if it is UTF-8 and rereads. It is not error prone.

This complexity is due to that's the way it is and we need to support legacy confs.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
I think I don't understand what you're saying. The frontend should
read the configuration twice? Sword should? Huh? I don't understand.
* Sword reads the entire configuration file as CP1252 encoded
* On failure, re-read the configuration file as UTF-8 encoded
???
If this is the case, then this is error prone (even when reading only
parts of the configuration), because CP1252 and UTF-8 overlap. Hence
data encoded as UTF-8 might be parsed correctly as valid CP1252, even
though it was intended to be UTF-8. I mean I find it likely that valid
UTF-8 strings might be accepted by a perfectly correct CP1252 encoding
checker as valid CP1252.
Jaak
Post by DM Smith
The encoding of the conf is either cp1252 (the default, but called
latin 1) or utf-8. The encoding of the conf matches that of the
module. This may cause the conf to be read twice once for the
default and once for UTF-8, if the module encoding is set to
UTF-8.
There have been confs that are incorrect with regard to this rule.
In Him, DM
On May 21, 2014, at 8:59 AM, Jaak Ristioja <jaak at ristioja.ee
So this means that actually we want non-standard RTF (someone
should update the wiki). Should we assume UTF-8? Are you sure we
don't have any modules with ISO-8859-something encoded values?
If we choose any ASCII superset encoding we have to consider at
* Since the RTF control words and delimeters are specified in ASCII
only, we need to decide whether how the bytes of the superset act
as delimeters and parts of "RTF" control words. For example,
whether the Unicode letter, number, spacing, punctuation, control
etc characters constitute parts of RTF control words or act as
delimiters.
* In case of encodings where characters may consist of multiple
bytes (e.g. the variable-length UTF-8) we must consider the
character bondaries. We can't just pass through any non-ASCII byte
values. For example, the following bit sequence wouldn't make
11100010 01011100 10000010 01110001 10101100 01100011
which is an UTF-8 encoded Euro sign, ?, interleaved with bytes of
the ASCII string "\qc". It just doesn't make sense, whereas the
11100010 10000010 10101100 01011100 01110001 01100011 (?\qc)
01011100 01110001 01100011 11100010 10000010 10101100 (\qc?)
So depending on the encoding it were correct to detect such cases,
otherwise we end up with invalid Unicode output.
Blessings, Jaak
Post by David Haslam
Post by Chris Burrell
I believe some conf files have direct unicode (rather than
escaped sequences) in them and that is preferred.
On 20 May 2014 23:28, "Jaak Ristioja" <jaak at ristioja.ee
I've never done BiDi, but I'm not sure I need to take that
into account while fixing the RTF parsing. As I currently
understand it, this particular piece of code does not
support any part from the RTF spec dealing with bidirectional
text handling. Hence all BiDi information contained in the
configuration file strings (e.g. About=) is contained either
in the plain ASCII text or the \u<num> Unicode escapes which
this algorithm should pass through unmodified.
...except for HTML entities which should actually be
escaped. This bug in the algorithm I previously failed to
notice. Additionally I forgot that non-ASCII characters in
the input string should also lead to parsing failure.
Jaak
Post by David Haslam
Take care with Right to Left languages such as Hebrew.
i.e. After any patches to the filter, please include some
testing
for BiDi
Post by David Haslam
text in the About= field and others.
David
http://sword-dev.350566.n4.nabble.com/RTFHTML-filter-bugs-tp4653969p4653970.html
Sent from the SWORD Dev mailing list archive at Nabble.com
Post by DM Smith
Post by David Haslam
Post by Chris Burrell
Post by David Haslam
<http://Nabble.com>.
_______________________________________________
sword-devel mailing list: sword-devel at crosswire.org
<mailto:sword-devel at crosswire.org>
<mailto:sword-devel at crosswire.org>
Post by David Haslam
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above
page
_______________________________________________ sword-devel
mailing list: sword-devel at crosswire.org
<mailto:sword-devel at crosswire.org>
<mailto:sword-devel at crosswire.org>
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above
page
_______________________________________________ sword-devel
mailing list: sword-devel at crosswire.org
<mailto:sword-devel at crosswire.org>
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above
page
_______________________________________________ sword-devel
mailing list: sword-devel at crosswire.org
<mailto:sword-devel at crosswire.org>
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
_______________________________________________ sword-devel
mailing list: sword-devel at crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel Instructions
to unsubscribe/change your settings at above page
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQgcBAEBAgAGBQJTfbGVAAoJELozJlbjIn79zn4//3Jx81Qgjoj22zshBizjqjrM
Liky9QigioZFvoqTSdCp3E51S7ruYhK0CdKl44OL+/66RbeflTbvu/YPUkJswB8Y
lb/7e5HKUrVTVB2/pIU0OeRBFK0YLZl8JyupsHg6oidBTHt1yt5TMJMv1TeXaJYs
cYh4QwPH7Cn5yH2EzfVW9rSeUKyOwDSAWM4f3DyvsAKyIIHkZyZf3DtxhY6T81/4
FB8jCYq3Jrj3jihVOe9rjRafBmIGDXuQWmT4zlwmoZrXa7MrPdx2Cxmaa4rUu98c
AK5HDS7sD/LJslxYCmsMV3VXxdG4UMeM+/oLrl237Uh1vRjALtAx9rads1j/brtV
eNAoWfSNJDf3AHZW3CrHF5yiO8bTPUh6AdpNsQtfwg2FK4kF1EfZTW6lwRH/7HES
Z2TUYRATwpTUinRZxlF3CUQCdhldNQXFk2yEBmWr1ZtziPRd+3bqZBOmg1qSjN1/
PmqOS7Vxfsw1f7OvFdnFN03KAt2C0Rqo0OBSFgujJbb08PdvdZFIfUldnBXL5Slf
AQgOQpMpP4nX0V8S+GA4k+oQBxMYg7Ow3BWyj2ugc9PZ3wR07oeB91Mi+uEQIUK4
fdhIE3POwoeGYMuQoq6CvcGQ+fq4piNETnwGEKU2Gxi8yrGmLwbUl861Nx4VW6ar
y91D9n0Yiror3ziuAqmfp3PwIQjBcxsFev4HAZw+N7uXSR8WUGpPhmW+Fv5ulhHy
fkzNe8dTvY7qYebjLbD73nLLleyLp1CC+MnJ/pPvV59WyqxOT2s37ar97u5Ktqan
3NUvq9DxNB2A9W7PN20v61kxSbFvaWjKMvbXfpN+qvvLqHf0wfAS2o6Y8/JzuHrO
wsQNNgCXyugzRv1nIyP5ZjPTo9fcOUNxp+JmC60HpbKtElYD8e5DQQjNovcj7iTu
1zZgux2tSnc++pILLdu0XLeFOM0YO10wsYUt3uyKW6ldmpfKOzwYDZK1/2IIc40F
Y4wGZLTGayOV/H5LWbFszdyTIee678YJIT/rz9nxxxZMDO9F6ZfvBTZ3zolyE9/7
/lO4VOy7vSZZRsy5ecfSsApYVugNgYBy7KED2zAl/65DwPPLOw3y9OUhAWxxJ1hl
WOetXDilRCrlHrHQx88f5fhtYwNga1+Qv9rMJy6/gsQclSNs7AQ/bweGil8o4jqN
e59YGRgOou5k9eW9wY+RAGz6QvKN2qtq3djIn/5UudHI9NDi9lvkvGttURceOYCM
Is3r21LZvgKQorAtOumxienhauK31QmmO1qQcoKE07N+/4CiMCAPfSUE/E75mA2B
j81+hPt5/R4FLfa42hN6evL3286Al+7zYcB4VEfAWHzHUT4psNqJG5B5PdtkA+zA
TbmOgqkrgYmfA37PBLvAxpps0Zn2EZ+JtH/dcznijOMeiUmk59L+rxM9nzjXsJ2B
RzuhklK2h68Y/9G0CAki917l8UWz/S113+IsYCkfvo++EZHMmjLjktkKrkMGYhlQ
eppDE3cYKEEsLKHquMj4dMJdrjc7GOpYyUd8JETlWyHF13Zy7m7MgyWihDJf3Mre
g1axaEueASaA+MU3VPV2e/uiWphBRWmo07Ye8mnIC2O0Fnxzx5/YwYKFJK8bjVDy
iEH4rDohPoJENBJKV7hUyU3D89+pzUlOGKRTqWY2HQpOc9Hhd4GBfvvfbB3HAhYg
miWImi7Itx7h3VuuVbCCcZr6EucHD8uKPFsUjN1eqkEq9GyV4hj37MxN+1taGyZi
8yIYoHBa/OcHMWq+Wg85XC+IAYyNYxGEq0D07Ap3SabASw3B8D1FpjhfXi/ZqLMr
cgLIDNF6Gecm8Gq+Fdd4mA/Rhukavu8Kh1l1QUSTvdK6iV6a2RvWVW9WmEdrIpmK
Ko++rRUdCXBVpg8m9Wx6U16+6k2heYyvWeE4iqiuAWxM6d6SDMMOZpWGF1EJwzVP
bScm+PuiJi88CMcIBnap4YYzJc9BDpORz6ca/S9s0Z6Q53kdzc3pK2AJ2W2lIpJL
jFxAEdRBZBIHT+93clejyA3TXeSHUNvF6w+CBjcgDf4f+HOeB3KrcyjwEzpKZZjG
D5IxfoxQyR2oHp8JfFb65YFvRJ8Tm1U3SsrtODDxReHqZ9WTaH1DjScLpuOe0K87
ikK/CU9M0ipMLcdjn/VU312Qz+qSze1vRJz2J58GX/gjVyi773ccm7mhzdZ+EzbD
e6XsGH0poUXyyNSL4R2YGyDlegacZbAd5J+HlLFmN+9Ln8JAviP5lCMr/D1QokmU
BlW9WiKxVU72FxwO6Ohu432iFhLhhsGGVzkxvaiRzcIzf/b3A0neTp3qvKtZWeOG
v+XjxWw1Pz5ZzVp202t5jDZ/9CGl/wLbpVwdp4OUo5L+VMUXoXXApiEfpAA2mfBC
0J5CrKc5ywMMoOAiHyi6ZDQ3d51P4YT0fZyqgZIBSNrVUIGgf6bgTEEVB1e1uXkY
Ht4JoSVEmVNT60V2mMurJSGvFbYgMNmakCktv4i+P/tHDF05oXx1gmh2td1/Xqxz
pFe2PWPKEITsDr8MkpzZ/evDKfZcfxnx/HI6GSd1joXEiqcI8DMwfI8TUMRVXppy
EsyOxOGFdlex1WzCqXTH3HHja3Dm+IC2ery9ohcyTY4LYEYSVkfsJEtz5zOamzUy
P/FztoIp0sO7vKDOxMso8YIESMly/6wOjd9zvuUGtrsgtKd32WvpizaQK3uuNS3x
5bAjQAWdEcD9uL5JF9zl
=wEfl
-----END PGP SIGNATURE-----
_______________________________________________
sword-devel mailing list: sword-devel at crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
Greg Hellings
2014-05-21 16:44:34 UTC
Permalink
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
So this means that actually we want non-standard RTF (someone should
update the wiki). Should we assume UTF-8? Are you sure we don't have any
modules with ISO-8859-something encoded values?
The wiki states that the Unicode character is preferred, at least for conf
files, over the RTF escaped value. Specifically it must be Unicode encoded
as UTF 8 or CP1252.
If we choose any ASCII superset encoding we have to consider at least
* Since the RTF control words and delimeters are specified in ASCII
only, we need to decide whether how the bytes of the superset act as
delimeters and parts of "RTF" control words. For example, whether the
Unicode letter, number, spacing, punctuation, control etc characters
constitute parts of RTF control words or act as delimiters.
* In case of encodings where characters may consist of multiple bytes
(e.g. the variable-length UTF-8) we must consider the character
bondaries. We can't just pass through any non-ASCII byte values. For
11100010 01011100 10000010 01110001 10101100 01100011
Did you literally split the individual bytes of the euro character around
the other bytes? What possibly valid encoding permits that? Is that a
valid UTF 8 sequence? If not, then the file fails to be UTF 8 encoded and
the engine either will error or otherwise behave in undefined ways due to
invalid input.

--Greg
which is an UTF-8 encoded Euro sign, ?, interleaved with bytes of the
ASCII string "\qc". It just doesn't make sense, whereas the following
11100010 10000010 10101100 01011100 01110001 01100011 (?\qc)
01011100 01110001 01100011 11100010 10000010 10101100 (\qc?)
So depending on the encoding it were correct to detect such cases,
otherwise we end up with invalid Unicode output.
Blessings,
Jaak
Post by Chris Burrell
I believe some conf files have direct unicode (rather than escaped
sequences) in them and that is preferred.
On 20 May 2014 23:28, "Jaak Ristioja" <jaak at ristioja.ee
I've never done BiDi, but I'm not sure I need to take that into account
while fixing the RTF parsing. As I currently understand it, this
particular piece of code does not support any part from the RTF spec
dealing with bidirectional text handling. Hence all BiDi information
contained in the configuration file strings (e.g. About=) is contained
either in the plain ASCII text or the \u<num> Unicode escapes which this
algorithm should pass through unmodified.
...except for HTML entities which should actually be escaped. This bug
in the algorithm I previously failed to notice. Additionally I forgot
that non-ASCII characters in the input string should also lead to
parsing failure.
Jaak
Post by David Haslam
Take care with Right to Left languages such as Hebrew.
i.e. After any patches to the filter, please include some testing
for BiDi
Post by David Haslam
text in the About= field and others.
David
--
http://sword-dev.350566.n4.nabble.com/RTFHTML-filter-bugs-tp4653969p4653970.html
Post by Chris Burrell
Post by David Haslam
Sent from the SWORD Dev mailing list archive at Nabble.com.
_______________________________________________
sword-devel mailing list: sword-devel at crosswire.org
<mailto:sword-devel at crosswire.org>
Post by David Haslam
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
_______________________________________________
sword-devel mailing list: sword-devel at crosswire.org
<mailto:sword-devel at crosswire.org>
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
_______________________________________________
sword-devel mailing list: sword-devel at crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQgcBAEBAgAGBQJTfKM/AAoJELozJlbjIn79gXpAAMxwoq17dvVzCikAplQUjON0
xDJXlDFfKK14w8xj11NSUvJEPjVWlwTi82WzEplQBKfkxtFY09010ZB5IKotEtSP
dcJMjzc4FmuJmPifB7s3gtEOQ81OThMArlnq/aFHvGj6+5D8qjFkQiqOzSJeaORS
C8dPobXSnJkJ/g3zKCdJf/k5msphFbmuIQOD4Ovco2ZHHlukL8QNd8pt3RcPN4Hy
BMxYx9glw3+YJK5Jj63isdsmOGLeRory3PDcHZoPJzu8zssW78Chlsgoh+xWlfkn
zI5PdP1ARhq7K/kUnPp7jXx3LDFiEbmPjrNBi/A03k+n7s2oZWdxm9uBfEEq5VpB
DpdCA19msaEE+fOWOyAAvvZstnCxYrrd01j+HxXUGoA4JHBBVQo01H5udfOdbiBu
nSI5M0GUKBjSSfLSmrh2oTC0qniVMRw4t+IAIJU1chjfBCsoNAx6xTiDE8x+hpjd
A+s8wvgBU0gNbqeOMvWXkHeOWSu7O0oPEp0vVl+6fUPPFDHGR1+2vPXLnCcbASwj
pEJwls9IBis7touUlIt4stlois1Imtw8zKGXXU8h0UmSgRHK0G2Ck8clNptClkMY
+9xP+TGXZI0q+WlzA7M4aD2puQAiJ0iJTm/kV+QGF/1RiaWNGWTG7Oxfufz5XdDn
xqTrAkYoVw3a+ZRgZPs4YbyK3ysVqncvAOFKuqLcEEwiA4zEYztGxPMAhcypQJFH
n6ORlF3/Kmkukj3eapanznmcvoZ+H/APKNWmo2b+TZ10WABCtZVDO+pd1Ed+l2U5
EytGhMYEqNSMqV109k3It9Ll7a8GVQa6k7AX8/BSXlh6/GaaoIzkSgGJBFAU8Zsj
dW7u6O7wBOTBmE+lUUrwA3igveDhTDhzjORE7Ek74xkhoNVwh1DmqWwJGZbIGb5R
47yWwxql4pqS4jq3M+TM8SUZaeY/NTjRTn+WLFBGahKVH5Gg/NiB6onfBBRLyYwK
iorFYngEhpKDNJBPp8rfSIg4NxhbupwG9B1Bbrdg6Kj+E+kGsXDuDkBWQEgf1Jwv
3XbiDBEjUf2wr4TdbUx9GrwrBNP7q9YW0RmbQGlvIahVwtr3/PJGhiU/kS47fAZf
HQMac1US7eYgtW5hzH/YG+41cCI9J0byZBEuSJS2GuSd0LD0Of4bPLxyOxiXqvTU
kwSPIQwsBOZpFIA5Qfc35x5KxVqCGUYBvXhglpZtZGlGr8uIPpshc1gz9ukCejuz
754upiYTlCzocKpvPbER9QpMZFYb+iDTdc4bU8whmxkP8ATKSDQmYIqUS2ohLKV8
co5X0741kRaG5oNOBBrM7kn/9nWgFNspFBkJAvGLbD8h6R8S11cu7INrXzJjxv/e
bCAxGXb2UQXXUe18FCYeqUvl5VdQOQt3f7gja3XbitCKkJjUA6i7t1+5vjuMQsAY
NFliiFxNeNjNE4hIIpvA7G3N+2t0W8IjGsystXm6ONN0lM78eLZLLlsrfkPi8NgR
Nydc78zEJfGr8APkiYleIYTi6ftgtDrI9927wNWqgIPqO4vqA1TZngX8wx6YPJou
uF8cSnI0PlcOfEKtsBgZedOpbZlqAt61wvMGMW0YUfiL5LhuP95KQekqDMMBDCQX
mGMehJHRJ5PvoDt8485lGOWdwXn6T7PlakZ1UCtYeMV0Nx2PfPBfU7bnCwSRFQKg
vpUhPCkW5qpvlkBLOpPLwkqcZGiSyLL/YSGp6cVExeeQVHc2hI169zGY9dUHBEMN
CaKwI9Wjn5V95bax3gsMlHnY9c1TB/6yLWnVEJAilm5ijgWW5KxstWoJMd/OptY8
QvbsOA7K36HfwOwNCblQCGbUrPjikhXTw8ew1aap4OHqGIKUWCMm3z/eHOPRU5mD
Ce2Z86vwYb9T2PcyqUiZOs1WW9TBZx70Hr2JQmRwgMyWpT4DERjofP83IA8vxZdP
9uKT4j+EBUGoI2zGgE2lapLL/VWrzt6OBMv5iUmR4OIFLdnHevAAy5w53c4+tWjs
SNmjAz8tW5FWiVFR99FQBN6KWXIjKdJGQl+zccOlE0zBQe2grnqFmUeuuBbPiojb
Wch+hqrKDX/VLr/gIP9EErMJ7ZvZ7st+gwPZlFwC7Evf3OCrUnRYIbMI6iLGLoZ6
c9YLbK67hj1Ho+X99XTeoQj8l2V14TSRCFZBmO7Os5L2kXOEiw0yeV8Dn87LJPFp
4VcfgFGLi9FRnI36K4+h5JWoyhrGhNHrHsO60Xs2U3a02fRfeUgn/T1Xf0xXbVMC
gX8zJ3aC15pUy/dJaqJ4HIszzPe5ErO7J9GB7AhjVnx8pEE0xayoJkA4VM0YF8Lk
b/IF04rm/dNlsLL7zRzdGpr2uo9esMzFJDYcHnhInhaE7t2iGR4+cgUdRJKA7NJW
ZumxNz3a1EjeZHRLqRxfT8O6Cc55hG4GwVO7JxUnXJtRMx+ENXZslf4ExGdhcTdf
ntjsfngGemyKYv8aMJ9pDlLFVyR+91xSpFp8QYRDtcP14y5Dfh/jh4Kmdu0BqTzt
Wt0KUUZQlx8Qu8XJbatPiieDmjtQ8HPmhsHQAA+QmLzrhEmakrAjTfpWq5eNYQeQ
ei6tawFllPyuNrez2BOP3nfXuSBlfn2+yBfi3H1mJc8urrFwDtt/zqTHdoOtyCNO
PVaqMROmVzgdKg7yyXTBek3UBe8TxMWigvepRvxkGlmMZQkW42/5ft0269esY/bw
tuy57vDPyvQfrJzpN62y
=RNpJ
-----END PGP SIGNATURE-----
_______________________________________________
sword-devel mailing list: sword-devel at crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/sword-devel/attachments/20140521/27d97a0e/attachment-0001.html>
Jaak Ristioja
2014-05-22 09:39:01 UTC
Permalink
Post by Greg Hellings
Post by Jaak Ristioja
So this means that actually we want non-standard RTF (someone
should update the wiki). Should we assume UTF-8? Are you sure we
don't have any modules with ISO-8859-something encoded values?
The wiki states that the Unicode character is preferred, at least
for conf files, over the RTF escaped value. Specifically it must be
Unicode encoded as UTF 8 or CP1252.
Do I get this right, that before parsing any (possibly RTF)
configuration fields, we must parse the Encoding= field to detect the
encoding for all other fields?

IMHO most (!!!) valid UTF-8 is valid CP1252. For example,
11000000 10000001
is a valid UTF-8 bytestream, but not a valid CP1252 bytestream, because
the last byte (0x81) is not defined in CP1252. Additionally,
10000000 00100001
is a valid CP1252 bytestream (euro sign ? and exclamation mark !), but
not a valid UTF-8 bytestream, because UTF-8 characters CAN NOT begin
with 10xxxxxx. However,
11010101 00100001
is both a valid UTF-8 bytestream (1 character) and a CP1252 bytestream
(2 characters), but
10000001 10000001
is neither valid CP1252 nor valid UTF-8.
Post by Greg Hellings
Did you literally split the individual bytes of the euro character
around the other bytes? What possibly valid encoding permits that?
Is that a valid UTF 8 sequence? If not, then the file fails to be
UTF 8 encoded and the engine either will error or otherwise behave
in undefined ways due to invalid input.
Yes I did literally split that. No valid encoding permits that. But of
course we should not assume all user input is valid. To prevent
undefined behaviour, crashes and exploits etc. If the Sword project
wants to allow code with "undefined behaviour" (with respect to the
C++) standard, I do not want to be part of this project.

I suggest we be strict in all parsing, because it could yield in
security issues, as I presented in another thread on this list. If we
want to allow non-conforming user-input, we should at minimum output a
warning, but still do parsing in a secure manner which does not cause
undefined behaviour or provide an attack vector.

Jaak
Greg Hellings
2014-05-21 17:05:28 UTC
Permalink
Greg
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hi!
1) According to http://www.crosswire.org/wiki/DevTools:conf_Files the
\u control word should be followed by a 16-bit signed integer. The
wiki page doesn't mention this, but I assume it is in ASCII in decimal
form.
It would be either CP1252 or UTF 8 like the rest of the file.
The RTFHTML filter code appears to incorrectly parse the following
"\u-999999" -> getUTF8FromUniChar(48577)
"\u-99999" -> getUTF8FromUniChar(31073)
"\u-0001" -> getUTF8FromUniChar(65535)
"\u-00" -> getUTF8FromUniChar(0)
"\u-0" -> getUTF8FromUniChar(0)
"\u00" -> getUTF8FromUniChar(0)
"\u001" -> getUTF8FromUniChar(1)
"\u99999" -> getUTF8FromUniChar(34463)
"\u-" -> getUTF8FromUniChar(0)
"\u--" -> getUTF8FromUniChar(0)
"\u--2" -> getUTF8FromUniChar(0)
"\u-a" -> getUTF8FromUniChar(0)
I think all these should instead fail.
The last three should return -, 2, and a respectively if I read the wiki
page correctly that allows a final character to use when the conversion
otherwise won't work.

Why you think the signed values that are zero prefixed should fail I don't
understand. Those which fall beyond the range of a sixteen bit integer are
the only ones I might agree should fall. However, since Unicode now
exceeds sixteen bits, think it is our limitation that ought to change.
2) In case an exception is thrown, text might contain a partial result
or the original value.
3) For control word \pard (and similarly for \par and \qc) it
incorrectly parses \pardx as \pard and "x", where it should instead
fail due to an invalid control word \pardx.
4) \par incorrectly appends a newline.
Why is a newline incorrect? Newlines are mostly ignored in HTML.
5) "a\qc b" is converted to "a<center> b", but should instead be
"a<center>b</center>" (' ' RTF delimiter output, missing HTML
</center> tag)
6) "a\par b" is converted to "a<p/> b", but should probably be
"<p>a</p><p>b</p>" (' ' RTF delimiter output, missing HTML <p> and
</p> tags.
7) Weird combinations of \par, \pard and \qc result in broken HTML
fragments or HTML fragments with unbalanced start and end tags.
I don't believe the contract of this filter guarantees valid HTML, and HTML
allows unbalanced tags. In fact it is preferred in some older HTML specs
for certain tags, p a prominent example of such tags.
8) Unsupported control sequences do not cause the function to fail,
but are passed to output as plain text (including the backslash).
8) Unescaped '{', '}' and '\' characters are not handled properly (to
pass these from RTF one would need to use the control symbols "\{",
"\}" and "\\" respectively).
The rest of your objections seem to be based on a different objective than
SWORD filter objectives. The prose is not to force compliance to a strict
spec but instead to give a "best effort" attempt at conversion. The same
way that most browsers will accept invalid input but make a best effort to
display (unescaped & characters will usually display as is and invalid
nesting such as having a div inside of a p tag still works out somewhat
reasonably) the SWORD engine is lax in what it accepts.

It follows the general maxim "be strict in what you produce but lenient in
what you accept." Crosswire produced content should not include such
invalid input, but the engine is intentionally written to make a best
attempt to handle innocuous invalid input. This is because we want to
encourage as many people as possible to use the engine even if they are not
strict in what they produce.

If there are existing modules with bad content or in places where the
filters are producing invalid output we should fix it, but we don't need to
go and get stringent about the conversion throwing errors or the like
because of an invalid control sequence or an unknown Unicode character.

--Greg
Maybe I'll get around to fix this someday during daytime. To save me
extra work, I'd appreciate any comments on this before I start any
coding, especially if the Sword library needs to deviate from the RTF
specifications.
Blessings,
Jaak
PS: I'm glad there are no memory errors in this function. :)
PPS: Please forgive me for having studied formal languages.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQgcBAEBAgAGBQJTeoEUAAoJELozJlbjIn798A1AALYn7ogi0Q3QvLPq998aj5R8
dMW/iAPIRPgmvrqpccTaaYxbP60E5Pm6Yf3XEFR6KkP01QQtM/v6S7Bxmmo28ewr
En3ZMzhldHDQUXKuaP5+Rp8ndw81SjlyeVYZlQlpcm/gWBzJpjZ4CFJuePH5/iwp
1kn3WwRJM5mp2nejOC+JIRgL8RDvEMwowSHWFKESI//YoJzS6tWKQskGI65dWngb
PYFzMpllpJpQhMKspDXh6sbJT43UlX/Kvh9G/JDrp5PUeJbBLO4xcs9kd+lbK9fP
XKCxeN6Ih63p4AR/PkwJQqYW1m/i/xSdMcozfOF5nkGyVGqW9XcLS9NEVLT4JzYg
PaU1ZiuhjxNIsF28x6ewSDadPExkOyXMDMRqHC23udPtQt4P9QMYwwsDTBn77mzt
sCK/WL486Rewl2wWJGwTaYG8HieFQF0/ZsrKFGlB7u3zzJx608SdiXxvt/w29keo
0UPzl0se0imAhSLEbwHe4keS7SGofncoCU4u1bacfRMnngCf2irpyGElFfYrlH32
bPhIQBG4pZp3noHM8O6cv/w5xCtE0nZ9ROV4pI1xzPFB4yDiCDV/LXLYV0RCHW92
/fteZAYYLqC/BQvyRi/eZ0XAM+a0L2rdm+ggFI/Vcq+VfT/gjv7UfzcwsfS/J1eA
NawubrlcvuH430K4pNIPPbwfybwV6eNkt6YbffE4cgOhFGUtMWuph6cVEn/Ic0cY
MlDR+t7p0PNQGZ0KeqpEkydhLEiQGbUPfmtTYRY64ZrwiSRT3ouHsgO88/G/Ehvt
jTce6S4XY43Bp6sAu5mjdD4+ObSWbAMBwMN92tlQ0yZ5ctvx4qVLEV/ld/QBjayG
ryzjZ0zP3uclEvDAuP/aUsX1ocS1tW7heMeyqC0tb8oUslTf9kwjx/VAZLQZyvqy
a1uYDgrHqVslKYc2BffFns33tfkia4+8Y6NkoVOmuB0wdOnCSm+QbEJT11bJVh7+
UwL/g5ih2c0/xQgvBF5sGvOANy2hJGFulehZ4qcjcsFw3YQFHnUIobnjoxuXkta1
uB690Wol18v0Xkf9+19tYx65+3h6iss/2Qw9FhiJyVFS++a3Z70NSlbC2MJz+TH2
HCp0Z+qiikP/FohZbz5hru9luTPx7uM44AGI1MFRjj1275CMWeEAZCEx4pZUkL/G
5xWDDCxN0FJorkuI3yUw7CKcN6c7hcAM5iOMO91SgpS5vIco0/H2BTVl8XDO1tt6
ngbYuGEhZhHNExn6RRk1KIOx08USJ9i+iPqB8dVT8tDGK+VAF/9M95uEhZy5d9g0
NhbpMx1EPgVk/E3+VNKBB1zgxsnkvjzCnR+F65h8A+aeDj4jvrHowIkqcdL45IVX
cWjuYmVe3uOlDMLF/q2X3Rh4tOTtGQA1ApJdfXBDzj//hFudDNgb0OJjLTuyg2tG
xgn6qPfcNcO9WKbiqBhU20FQnTUiMyEMF1pW/4OckJ3fIe86V3JhIkP5w1l6F5K4
7npniPO9gXTfDAFDbNEwwiCb2ejVPqMjRUdI/PJwvpXXRNJIiAc3+jRhhJ8xdipY
2SFnWugLkR0bC8i/Lbf9djpUSTwuxgb+GcXUCpA1S1pfWECPwL+jzQAIAGwIV3ly
dk6XlyNrmFkpC9s+/dbKfStXbGmy6tSbSACBJHyXq2OaERsbQsbXkp1DyljuIbG/
raOoq1ewuoc0Ie/6C8RA/QUcY+uvszsw/HVs3W4eUtc+YDUX+p3+ptZBE+wL4lHX
f67P5++gsI/IajT/a+cOm6tzkVPpJjdJW0yN1tAoCeAdEsP8fs7JnmOX0MddkGAK
bZyPnRYqC8tNjyvp656cYf3250W2dlkjWQQ122WjjLYRdiPIimEs2rm8IlpvIT5G
u4ejUnsfq+js1GBUyv7O3WZilDOZMFU26W6rCOvhCdwMu95Hwvqmqm7ofCJ+vbSZ
O7QkkApB54koKX3H8FjiBdeqSbk9/Ej2WVUvhEI6MwrFX4vDQR9RkRtW8tH/iQey
elV5ABcN+sLSgclgrVFXle03SkZrjWZzbKZ84k6W6g5Od9vKj9gTiKaPzddd3EK+
KbN/RtQmZcT77ceABHzdOQ0HKe6L7GI56Q3Y1eV66v6xL4QwBgroYA4Tg4dy7Ddk
TcKvUInyEXZRM1A3vkUQk5mZvatHmnOwVyi0PTVyO3isuFLoNwIp9xDhEZJsDd5B
qHHnjmlVtpE0SzD8EVrKAJAO4/fllZKd/hzv14rUSZ7ORl7PRdSzO5933dw+v6Bb
Nut2uIfzAAW1xeadYtWufE50qDVraWS+oy9Iyeat0RRdxEx7+luz7iuvTDcaUa00
+Wygu4bWGCLvO3EpEq0JK/1H3Twa2xc6FR9T1Bg8CJVsVGCizfxD0WXQuoLzOzpb
uYlaEX18UoomDHFo+8JrCZwGKBgSlUqwehhUA75Yh/S/DqfZnYzK6RUekvms0We6
dNcP8H5OY+f3rCcKF2FY1Gz6QE03GmrguRxVS2TIRPUo90XuMBMxQSihC7LLHA3d
cjQC6biOUZPq1RoeRs6xx+aLgmS0BZgYwqUl7H5RCauDx8N51On39ZWAkDXZTd1O
p0L+a526J2AjK19PKjB/OcdJcFyQBQgO6abCcBZ2ooWhFsxL4JgBX75w+WAsSBmE
kol3waKHsVC23TvPG2NoNHeh48RZfDrGy0hYIk2tymfW0KhAwpu6Ou03BlojHR4j
zl1NPiRW9SjvMEvpZtZF
=Mrt1
-----END PGP SIGNATURE-----
_______________________________________________
sword-devel mailing list: sword-devel at crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/sword-devel/attachments/20140521/4e831d5e/attachment.html>
Jaak Ristioja
2014-05-22 10:33:32 UTC
Permalink
Post by Greg Hellings
Post by Jaak Ristioja
The RTFHTML filter code appears to incorrectly parse the
"\u-999999" -> getUTF8FromUniChar(48577) "\u-99999" ->
getUTF8FromUniChar(31073) "\u-0001" -> getUTF8FromUniChar(65535)
"\u-00" -> getUTF8FromUniChar(0) "\u-0" -> getUTF8FromUniChar(0)
"\u00" -> getUTF8FromUniChar(0) "\u001" -> getUTF8FromUniChar(1)
"\u99999" -> getUTF8FromUniChar(34463) "\u-" ->
getUTF8FromUniChar(0) "\u--" -> getUTF8FromUniChar(0) "\u--2" ->
getUTF8FromUniChar(0) "\u-a" -> getUTF8FromUniChar(0)
I think all these should instead fail.
The last three should return -, 2, and a respectively if I read the
wiki page correctly that allows a final character to use when the
conversion otherwise won't work.
Ok, I missed the final character thing in both the wiki and the RTF
spec. With respect to our context, the RTF spec points that \u{num}
should immediately followed by an CP1252 character with, optionally, a
keyword-delimiting space between the control word and that character.
If another control word or control symbol follows, it is considered to
be that character.

But I think you are incorrect. "\u--", "\u--2" and "\u-a" should fail
because there is NO signed number. Just "-" is NOT a number. Neither
is "--2", which is an expression at best. I don't see any good reason
why we should allow stuff like "\u------2" or "\u++-+-++-0".
Post by Greg Hellings
Why you think the signed values that are zero prefixed should fail
I don't understand.
Because 0-prefixed representations are in most contexts considered to
be octal.
Post by Greg Hellings
Those which fall beyond the range of a sixteen bit integer are the
only ones I might agree should fall. However, since Unicode now
exceeds sixteen bits, think it is our limitation that ought to
change.
We can't change our limitation, because then we don't have RTF any
more. And as I understand you want backwards compatibility.
Post by Greg Hellings
Post by Jaak Ristioja
4) \par incorrectly appends a newline.
Why is a newline incorrect? Newlines are mostly ignored in HTML.
There is good no reason to append one. Its useless extra data. Even
for debugging purposes. Or, as a counterexample: why not instead
append a pattern of 3 spaces followed a newline repeated 4 times? - it
would be ignored anyway. IMHO, it lacks usefulness.
Post by Greg Hellings
Post by Jaak Ristioja
5) "a\qc b" is converted to "a<center> b", but should instead be
"a<center>b</center>" (' ' RTF delimiter output, missing HTML
</center> tag)
6) "a\par b" is converted to "a<p/> b", but should probably be
"<p>a</p><p>b</p>" (' ' RTF delimiter output, missing HTML <p>
and </p> tags.
7) Weird combinations of \par, \pard and \qc result in broken
HTML fragments or HTML fragments with unbalanced start and end
tags.
I don't believe the contract of this filter guarantees valid HTML,
and HTML allows unbalanced tags. In fact it is preferred in some
older HTML specs for certain tags, p a prominent example of such
tags.
IMHO you're wrong. At least it is not a valid XHTML (XML) or HTML 5
balanced fragment. I'm not completely sure about earlier HTML
standards. The HTML 5 draft provides a guide on how to handle such
invalid cases, but these are not considered "valid". And as such, one
of two things should happen:
1) we should output valid HTML,
2) users of RTFHTML must fix the output or at least ensure the
output is placed properly, so it doesn't interfere with any HTML
places after the output.
Post by Greg Hellings
Post by Jaak Ristioja
8) Unsupported control sequences do not cause the function to
fail, but are passed to output as plain text (including the
backslash).
8) Unescaped '{', '}' and '\' characters are not handled properly
(to pass these from RTF one would need to use the control symbols
"\{", "\}" and "\\" respectively).
The rest of your objections seem to be based on a different
objective than SWORD filter objectives. The prose is not to force
compliance to a strict spec but instead to give a "best effort"
attempt at conversion. The same way that most browsers will accept
invalid input but make a best effort to display (unescaped &
characters will usually display as is and invalid nesting such as
having a div inside of a p tag still works out somewhat reasonably)
the SWORD engine is lax in what it accepts.
It follows the general maxim "be strict in what you produce but
lenient in what you accept." Crosswire produced content should not
include such invalid input, but the engine is intentionally written
to make a best attempt to handle innocuous invalid input. This is
because we want to encourage as many people as possible to use the
engine even if they are not strict in what they produce.
Post by Jaak Ristioja
If there are existing modules with bad content or in places where
the filters are producing invalid output we should fix it, but we
don't need to go and get stringent about the conversion throwing
errors or the like because of an invalid control sequence or an
unknown Unicode character.
I have two big with these arguments. First, we have the current
implementation RTFHTML and the wiki page documenting its behaviour,
and these don't match. The current implementation does NOT properly
parse valid RTF. For example, it fails to take into account RTF
delimeters, e.g. "\u8364 ab" should output "?b" if the unicode
character can be displayed, and "ab" otherwise. But currently (I
think) it outputs getUTF8FromUniChar(8364) followed by " ab". Both
ignoring the following character and the delimiter.

Secondly, I've not seen a piece of documentation about accepting lax
behaviour. Even if we do accept invalid input, we should do it in a
safe and secure manner which does not result in invalid output.
Additionally, there should be a switch whether "best effort"
processing on lax input is allowed. Currently RTFHTML always succeeds
without a warning.

Blessings,
Jaak
Loading...