ibm-netrexx

Unicode issues

Classic

List

Threaded

2 messages Options

George Hovey-2

Unicode issues

1. What should NetRexx's default encoding be?

A major reason UTF-8 use is widespread is that it has special properties facilitating internet use, e.g. English representation is identical to the old ASCII, i.e. requires only one byte per letter; and it can be unambiguously resynchronized when recovering from a garbled sequence.

The Unicode Consortium's Unicode Technical Note #12 "UTF-16 for Processing"

http://www.unicode.org/notes/tn12

makes a case for the use of UTF-16 in programming systems. NetRexx seems to use this encoding for, I guess, the reasons outlined in the note.

The fact that Western communications can be carried out with single byte characters is a significant factor in web use where everything is presented in natural languages, but seems of little consequence from a programming standpoint, since the source code typically represents only a small fraction of a project's storage requirements, and is easily compressed when internet transmission is needed.

BTW, Unicode.org has various references to languages with Unicode support, but I didn't see NetRexx listed. Perhaps after we have our Unicode house in order we can bring this to their attention.

2 A Useful Unicode Viewer

A vexing issue with Unicode is that having "Unicode support" does not mean that you can display any Unicode character: that requires a font with "glyphs" (pictures) for all Unicode characters, and these are thin on the ground. So it is useful to have a viewer that can explore this issue for any specified font.

SIL Viewglyph

http://scripts.sil.org/cms/scripts/page.php?item_id=ViewGlyph_home

does this for Windows.

For example, examining Microsoft's Consolas font (which is the font of choice for coding for its extraordinary legibility in small sizes), and also available on Apple, shows that it covers the entire set of languages encompassed by ISO-8859-?, as well as Greek, Cyrillic, and a grab bag of graphics (including box drawing).

--
"One can live magnificently in this world if one knows how to work and how to love." -- Leo Tolstoy

_______________________________________________
Ibm-netrexx mailing list
[hidden email]
Online Archive : http://ibm-netrexx.215625.n3.nabble.com/

rvjansen

Re: Unicode issues

My vote is for UTF-8, because UTF-16 introduces the byte order problem of big-endian and little-endian. UTF-16 files require a start with a byte order marker, BOM, to indicate what they contain.

This looks like this:

fifi:test rvjansen$ java HexPrint bigendianUTF16.nrx

--------------------------------------------------------------------------

HexPrint Version 0.50

--------------------------------------------------------------------------

File name: bigendianUTF16.nrx

File Date: Fri Jul 26 22:25:59 CEST 2013

File size: 72 bytes

--------------------------------------------------------------------------

Offset

<Hex> <dec> +0 +4 +8 +C

--------------------------------------------------------------------------

000000 (00000000) FEFF0073 00610079 00200022 00680065 [þÿ.s.a.y. .".h.e]

000010 (00000016) 006C006C 006F0020 00620069 0067002D [.l.l.o. .b.i.g.-]

000020 (00000032) 0065006E 00640069 0061006E 00200055 [.e.n.d.i.a.n. .U]

000030 (00000048) 00540046 002D0031 00360020 00660069 [.T.F.-.1.6. .f.i]

000040 (00000064) 006C0065 0022000A [.l.e."..]

fifi:test rvjansen$ java HexPrint littleEndianUTF16.nrx

--------------------------------------------------------------------------

HexPrint Version 0.50

--------------------------------------------------------------------------

File name: littleEndianUTF16.nrx

File Date: Fri Jul 26 22:32:25 CEST 2013

File size: 78 bytes

--------------------------------------------------------------------------

Offset

<Hex> <dec> +0 +4 +8 +C

--------------------------------------------------------------------------

000000 (00000000) FFFE7300 61007900 20002200 68006500 [ÿþs.a.y. .".h.e.]

000010 (00000016) 6C006C00 6F002000 6C006900 74007400 [l.l.o. .l.i.t.t.]

000020 (00000032) 6C006500 2D006500 6E006400 69006100 [l.e.-.e.n.d.i.a.]

000030 (00000048) 6E002000 55005400 46002D00 31003600 [n. .U.T.F.-.1.6.]

000040 (00000064) 20006600 69006C00 65002200 0A00 [ .f.i.l.e."...]

fifi:test rvjansen$

Short-time goal for NetRexx 3.03 is transparent UTF-8 support for environments and shells that have complete (including font-glyph) support, for UTF-8 files without BOM. Even when not necessary, some editors insist on inserting a BOM in every UTF-8 file; don't use these.

I did some work on this already, and it is much appreciated if people would experiment with the 3.03 preview (a.k.a. automatic trunk build) that is on http://www.netrexx.org/downloads.nsp .

This has the following changes regarding Unicode support:

(1) the same rules as Java are used for accepting and rejecting characters in indentifiers, to eliminate the risk that some java methods cannot be called

(2) the datatype('s') (symbol) method on type Rexx performs the same check as the RxClauser does, and as such gives a good indication if a string is suitable as a symbol (name, identifier)

(3) the unicode-escaping of indirect property names (bean signatures) has been removed so they can actually pass that java compilers javac and ecj

(4) option utf-8 is default, and now enables compiling and interpreting program sources containing utf-8 encoded unicode symbols, regardless of shell support and codepage. Experimentation has show that on shells with complete unicode support, option -utf8 was superfluous (MacOSX and Linux terminals with bash shells defaulted to unicode - this means the nrl remark on the euro symbol is incorrect) while on Windows and OS/2 the source was rejected as containing invalid characters. Now this has been amended to enable the translation of valid unicode source everywhere, regardless of shell and glyph support. (This means, you now can compile a valid unicode program on OS/2 and Windows without specifying -utf8, the methods will be called correctly, but there is a big chance that some output will be gibberish if it contains characters that the codepage or font does not support. But - these will be the correct gibberish values now.)

I still have to test on z/Linux and z/OS, but I am interested in everyone's experiences using this prototype. For notepad-users on windows, we might need to add support for removing the UTF-8 that is not needed. Please let me know. Always use HexPrint (from the Redbook examples, included in the post-3.00 packages) to find out what really is in the file- very few shells and editors can be trusted. (e.g. in Windows, observe the difference between 'type' and copy con: ...

best regards,

René.

On 26 jul. 2013, at 19:33, George Hovey <[hidden email]> wrote:

1. What should NetRexx's default encoding be?

A major reason UTF-8 use is widespread is that it has special properties facilitating internet use, e.g. English representation is identical to the old ASCII, i.e. requires only one byte per letter; and it can be unambiguously resynchronized when recovering from a garbled sequence.

The Unicode Consortium's Unicode Technical Note #12 "UTF-16 for Processing"

   http://www.unicode.org/notes/tn12

makes a case for the use of UTF-16 in programming systems.   NetRexx seems to use this encoding for, I guess, the reasons outlined in the note.

The fact that Western communications can be carried out with single byte characters is a significant factor in web use where everything is presented in natural languages, but seems of little consequence from a programming standpoint, since the source code typically represents only a small fraction of a project's storage requirements, and is easily compressed when internet transmission is needed.

BTW, Unicode.org has various references to languages with Unicode support, but I didn't see NetRexx listed. Perhaps after we have our Unicode house in order we can bring this to their attention.

2 A Useful Unicode Viewer

A vexing issue with Unicode is that having "Unicode support" does not mean that you can display any Unicode character: that requires a font with "glyphs" (pictures) for all Unicode characters, and these are thin on the ground. So it is useful to have a viewer that can explore this issue for any specified font.

SIL Viewglyph

   http://scripts.sil.org/cms/scripts/page.php?item_id=ViewGlyph_home

does this for Windows.

For example, examining Microsoft's Consolas font (which is the font of choice for coding for its extraordinary legibility in small sizes), and also available on Apple, shows that it covers the entire set of languages encompassed by ISO-8859-?, as well as Greek, Cyrillic, and a grab bag of graphics (including box drawing).

--
"One can live magnificently in this world if one knows how to work and how to love." -- Leo Tolstoy

_______________________________________________
Ibm-netrexx mailing list
[hidden email]
Online Archive : http://ibm-netrexx.215625.n3.nabble.com/

_______________________________________________
Ibm-netrexx mailing list
[hidden email]
Online Archive : http://ibm-netrexx.215625.n3.nabble.com/