A major reason UTF-8 use is widespread is that it has
special properties facilitating internet use, e.g. English
representation is identical to the old ASCII, i.e. requires only one
byte per letter; and it can be unambiguously resynchronized when
recovering from a garbled sequence. The Unicode Consortium's Unicode Technical Note #12 "UTF-16 for Processing" http://www.unicode.org/notes/tn12 makes a case for the use of UTF-16 in programming systems. NetRexx seems to use this encoding for, I guess, the reasons outlined in the note. 2 A Useful Unicode Viewer SIL Viewglyph http://scripts.sil.org/cms/scripts/page.php?item_id=ViewGlyph_home -- "One can live magnificently in this world if one knows how to work and how to love." -- Leo Tolstoy _______________________________________________ Ibm-netrexx mailing list [hidden email] Online Archive : http://ibm-netrexx.215625.n3.nabble.com/ |
My vote is for UTF-8, because UTF-16 introduces the byte order problem of big-endian and little-endian. UTF-16 files require a start with a byte order marker, BOM, to indicate what they contain.
This looks like this: fifi:test rvjansen$ java HexPrint bigendianUTF16.nrx -------------------------------------------------------------------------- HexPrint Version 0.50 -------------------------------------------------------------------------- File name: bigendianUTF16.nrx File Date: Fri Jul 26 22:25:59 CEST 2013 File size: 72 bytes -------------------------------------------------------------------------- Offset <Hex> <dec> +0 +4 +8 +C -------------------------------------------------------------------------- 000000 (00000000) FEFF0073 00610079 00200022 00680065 [þÿ.s.a.y. .".h.e] 000010 (00000016) 006C006C 006F0020 00620069 0067002D [.l.l.o. .b.i.g.-] 000020 (00000032) 0065006E 00640069 0061006E 00200055 [.e.n.d.i.a.n. .U] 000030 (00000048) 00540046 002D0031 00360020 00660069 [.T.F.-.1.6. .f.i] 000040 (00000064) 006C0065 0022000A [.l.e."..] fifi:test rvjansen$ java HexPrint littleEndianUTF16.nrx -------------------------------------------------------------------------- HexPrint Version 0.50 -------------------------------------------------------------------------- File name: littleEndianUTF16.nrx File Date: Fri Jul 26 22:32:25 CEST 2013 File size: 78 bytes -------------------------------------------------------------------------- Offset <Hex> <dec> +0 +4 +8 +C -------------------------------------------------------------------------- 000000 (00000000) FFFE7300 61007900 20002200 68006500 [ÿþs.a.y. .".h.e.] 000010 (00000016) 6C006C00 6F002000 6C006900 74007400 [l.l.o. .l.i.t.t.] 000020 (00000032) 6C006500 2D006500 6E006400 69006100 [l.e.-.e.n.d.i.a.] 000030 (00000048) 6E002000 55005400 46002D00 31003600 [n. .U.T.F.-.1.6.] 000040 (00000064) 20006600 69006C00 65002200 0A00 [ .f.i.l.e."...] fifi:test rvjansen$ Short-time goal for NetRexx 3.03 is transparent UTF-8 support for environments and shells that have complete (including font-glyph) support, for UTF-8 files without BOM. Even when not necessary, some editors insist on inserting a BOM in every UTF-8 file; don't use these. I did some work on this already, and it is much appreciated if people would experiment with the 3.03 preview (a.k.a. automatic trunk build) that is on http://www.netrexx.org/downloads.nsp . This has the following changes regarding Unicode support: (1) the same rules as Java are used for accepting and rejecting characters in indentifiers, to eliminate the risk that some java methods cannot be called (2) the datatype('s') (symbol) method on type Rexx performs the same check as the RxClauser does, and as such gives a good indication if a string is suitable as a symbol (name, identifier) (3) the unicode-escaping of indirect property names (bean signatures) has been removed so they can actually pass that java compilers javac and ecj (4) option utf-8 is default, and now enables compiling and interpreting program sources containing utf-8 encoded unicode symbols, regardless of shell support and codepage. Experimentation has show that on shells with complete unicode support, option -utf8 was superfluous (MacOSX and Linux terminals with bash shells defaulted to unicode - this means the nrl remark on the euro symbol is incorrect) while on Windows and OS/2 the source was rejected as containing invalid characters. Now this has been amended to enable the translation of valid unicode source everywhere, regardless of shell and glyph support. (This means, you now can compile a valid unicode program on OS/2 and Windows without specifying -utf8, the methods will be called correctly, but there is a big chance that some output will be gibberish if it contains characters that the codepage or font does not support. But - these will be the correct gibberish values now.) I still have to test on z/Linux and z/OS, but I am interested in everyone's experiences using this prototype. For notepad-users on windows, we might need to add support for removing the UTF-8 that is not needed. Please let me know. Always use HexPrint (from the Redbook examples, included in the post-3.00 packages) to find out what really is in the file- very few shells and editors can be trusted. (e.g. in Windows, observe the difference between 'type' and copy con: ... best regards, René. On 26 jul. 2013, at 19:33, George Hovey <[hidden email]> wrote:
_______________________________________________ Ibm-netrexx mailing list [hidden email] Online Archive : http://ibm-netrexx.215625.n3.nabble.com/ |
Free forum by Nabble | Edit this page |