puzzled by the -utf8 option

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

puzzled by the -utf8 option

rvjansen
While working on making utf8 the default encoding, it occurred to me that I could not make the translator fail with -noutf8 and then having utf characters in variable- and method names. Did something change and is utf8 now always accepted? The value on the options statement must match the commandline setting (which is a bit awkward already)  - but it seems the option obsoleted itself already?

Please let me know your experience with this, before I dig further into the code.

best regards,

René.
_______________________________________________
Ibm-netrexx mailing list
[hidden email]
Online Archive : http://ibm-netrexx.215625.n3.nabble.com/

Reply | Threaded
Open this post in threaded view
|

Re: puzzled by the -utf8 option

George Hovey-2
Rene, can you be more specific about the characters you expected to fail?


On Wed, Jul 3, 2013 at 8:46 PM, René Jansen <[hidden email]> wrote:
While working on making utf8 the default encoding, it occurred to me that I could not make the translator fail with -noutf8 and then having utf characters in variable- and method names. Did something change and is utf8 now always accepted? The value on the options statement must match the commandline setting (which is a bit awkward already)  - but it seems the option obsoleted itself already?

Please let me know your experience with this, before I dig further into the code.

best regards,

René.
_______________________________________________
Ibm-netrexx mailing list
[hidden email]
Online Archive : http://ibm-netrexx.215625.n3.nabble.com/




--
"One can live magnificently in this world if one knows how to work and how to love."  --  Leo Tolstoy

_______________________________________________
Ibm-netrexx mailing list
[hidden email]
Online Archive : http://ibm-netrexx.215625.n3.nabble.com/

Reply | Threaded
Open this post in threaded view
|

Re: puzzled by the -utf8 option

rvjansen
I expected this to fail:

options noutf8

class testUTF8default

  method héhé() static
    
    René = '42'
    say René'€'
    €='euro'
    say €
    
    
    
  method main(args=String[]) static
    héhé()


but it did not.

René.

On 4 jul. 2013, at 06:11, George Hovey <[hidden email]> wrote:

Rene, can you be more specific about the characters you expected to fail?


On Wed, Jul 3, 2013 at 8:46 PM, René Jansen <[hidden email]> wrote:
While working on making utf8 the default encoding, it occurred to me that I could not make the translator fail with -noutf8 and then having utf characters in variable- and method names. Did something change and is utf8 now always accepted? The value on the options statement must match the commandline setting (which is a bit awkward already)  - but it seems the option obsoleted itself already?

Please let me know your experience with this, before I dig further into the code.

best regards,

René.
_______________________________________________
Ibm-netrexx mailing list
[hidden email]
Online Archive : http://ibm-netrexx.215625.n3.nabble.com/




--
"One can live magnificently in this world if one knows how to work and how to love."  --  Leo Tolstoy
_______________________________________________
Ibm-netrexx mailing list
[hidden email]
Online Archive : http://ibm-netrexx.215625.n3.nabble.com/



_______________________________________________
Ibm-netrexx mailing list
[hidden email]
Online Archive : http://ibm-netrexx.215625.n3.nabble.com/

Reply | Threaded
Open this post in threaded view
|

Re: puzzled by the -utf8 option

rvjansen
And there is progress: now I have a source file that contains Unicode but does not compile using the -utf8 option:

options utf8
class testUTF8Default

properties indirect
π = '3.1415926585979'

  method héhé() static
    René = '42'
    say René'€'
    €='euro'
    say €
    
    /* sum it up */
  method ∑() static
    return π
    
  method main(args=String[]) static
    héhé()
    say ∑()


It trips over the sigma, which is the Unicode "Sum n-ary" character.
This is the file in hex:

--------------------------------------------------------------------------
HexPrint    Version 0.50
--------------------------------------------------------------------------
File name: testUTF8Default.nrx
File Date: Thu Jul 04 14:23:35 CEST 2013
File size: 297 bytes
--------------------------------------------------------------------------
    Offset
<Hex>    <dec>     +0       +4       +8       +C
--------------------------------------------------------------------------
000000 (00000000)  6F707469 6F6E7320 75746638 0A636C61  [options utf8.cla]
000010 (00000016)  73732074 65737455 54463844 65666175  [ss testUTF8Defau]
000020 (00000032)  6C740A0A 70726F70 65727469 65732069  [lt..properties i]
000030 (00000048)  6E646972 6563740A CF80203D 2027332E  [ndirect.π = '3.]
000040 (00000064)  31343135 39323635 38353937 39270A0A  [1415926585979'..]
000050 (00000080)  20206D65 74686F64 2068C3A9 68C3A928  [  method héhé(]
000060 (00000096)  29207374 61746963 0A202020 2052656E  [) static.    Ren]
000070 (00000112)  C3A9203D 20273432 270A2020 20207361  [é = '42'.    sa]
000080 (00000128)  79205265 6EC3A927 E282AC27 0A202020  [y René'€'.   ]
000090 (00000144)  20E282AC 3D276575 726F270A 20202020  [ €='euro'.    ]
0000A0 (00000160)  73617920 E282AC0A 20202020 0A202020  [say €.    .   ]
0000B0 (00000176)  202F2A20 73756D20 69742075 70202A2F  [ /* sum it up */]
0000C0 (00000192)  0A20206D 6574686F 6420E288 91282920  [.  method ∑() ]
0000D0 (00000208)  73746174 69630A20 20202072 65747572  [static.    retur]
0000E0 (00000224)  6E20CF80 0A202020 200A2020 6D657468  [n π.    .  meth]
0000F0 (00000240)  6F64206D 61696E28 61726773 3D537472  [od main(args=Str]
000100 (00000256)  696E675B 5D292073 74617469 630A2020  [ing[]) static.  ]
000110 (00000272)  202068C3 A968C3A9 28290A20 20202073  [  héhé().    s]
000120 (00000288)  617920E2 88912829 0A                 [ay ∑().]

It does not have the utf8 encoding flag at the start but the characters seem bona fide Unicode - this is in fact what Emacs and TextMate tell me.

I am probably missing something.

best regards,

René.


On 4 jul. 2013, at 10:25, René Jansen <[hidden email]> wrote:

I expected this to fail:

options noutf8

class testUTF8default

  method héhé() static
    
    René = '42'
    say René'€'
    €='euro'
    say €
    
    
    
  method main(args=String[]) static
    héhé()


but it did not.

René.

On 4 jul. 2013, at 06:11, George Hovey <[hidden email]> wrote:

Rene, can you be more specific about the characters you expected to fail?


On Wed, Jul 3, 2013 at 8:46 PM, René Jansen <[hidden email]> wrote:
While working on making utf8 the default encoding, it occurred to me that I could not make the translator fail with -noutf8 and then having utf characters in variable- and method names. Did something change and is utf8 now always accepted? The value on the options statement must match the commandline setting (which is a bit awkward already)  - but it seems the option obsoleted itself already?

Please let me know your experience with this, before I dig further into the code.

best regards,

René.
_______________________________________________
Ibm-netrexx mailing list
[hidden email]
Online Archive : http://ibm-netrexx.215625.n3.nabble.com/




--
"One can live magnificently in this world if one knows how to work and how to love."  --  Leo Tolstoy
_______________________________________________
Ibm-netrexx mailing list
[hidden email]
Online Archive : http://ibm-netrexx.215625.n3.nabble.com/


_______________________________________________
Ibm-netrexx mailing list
[hidden email]
Online Archive : http://ibm-netrexx.215625.n3.nabble.com/



_______________________________________________
Ibm-netrexx mailing list
[hidden email]
Online Archive : http://ibm-netrexx.215625.n3.nabble.com/

Reply | Threaded
Open this post in threaded view
|

Re: puzzled by the -utf8 option

rvjansen
and for completeness sake:

fifi:test rvjansen$ nrc -keepasjava -replace -utf8 testUTF8Default.nrx 
NetRexx portable processor, version NetRexx 3.03, build 198-20130704-0151
Copyright (c) RexxLA, 2011,2013.  All rights reserved.
Parts Copyright (c) IBM Corporation, 1995,2008.
Program testUTF8Default.nrx
 14 +++   method ∑() static
    +++          ^
    +++ Error: Unexpected character found in source: '∑' (hexadecimal encoding: 2211)
Compilation of 'testUTF8Default.nrx' failed [one error]

René.

On 4 jul. 2013, at 14:34, René Jansen <[hidden email]> wrote:

And there is progress: now I have a source file that contains Unicode but does not compile using the -utf8 option:

options utf8
class testUTF8Default

properties indirect
π = '3.1415926585979'

  method héhé() static
    René = '42'
    say René'€'
    €='euro'
    say €
    
    /* sum it up */
  method ∑() static
    return π
    
  method main(args=String[]) static
    héhé()
    say ∑()


It trips over the sigma, which is the Unicode "Sum n-ary" character.
This is the file in hex:

--------------------------------------------------------------------------
HexPrint    Version 0.50
--------------------------------------------------------------------------
File name: testUTF8Default.nrx
File Date: Thu Jul 04 14:23:35 CEST 2013
File size: 297 bytes
--------------------------------------------------------------------------
    Offset
<Hex>    <dec>     +0       +4       +8       +C
--------------------------------------------------------------------------
000000 (00000000)  6F707469 6F6E7320 75746638 0A636C61  [options utf8.cla]
000010 (00000016)  73732074 65737455 54463844 65666175  [ss testUTF8Defau]
000020 (00000032)  6C740A0A 70726F70 65727469 65732069  [lt..properties i]
000030 (00000048)  6E646972 6563740A CF80203D 2027332E  [ndirect.π = '3.]
000040 (00000064)  31343135 39323635 38353937 39270A0A  [1415926585979'..]
000050 (00000080)  20206D65 74686F64 2068C3A9 68C3A928  [  method héhé(]
000060 (00000096)  29207374 61746963 0A202020 2052656E  [) static.    Ren]
000070 (00000112)  C3A9203D 20273432 270A2020 20207361  [é = '42'.    sa]
000080 (00000128)  79205265 6EC3A927 E282AC27 0A202020  [y René'€'.   ]
000090 (00000144)  20E282AC 3D276575 726F270A 20202020  [ €='euro'.    ]
0000A0 (00000160)  73617920 E282AC0A 20202020 0A202020  [say €.    .   ]
0000B0 (00000176)  202F2A20 73756D20 69742075 70202A2F  [ /* sum it up */]
0000C0 (00000192)  0A20206D 6574686F 6420E288 91282920  [.  method ∑() ]
0000D0 (00000208)  73746174 69630A20 20202072 65747572  [static.    retur]
0000E0 (00000224)  6E20CF80 0A202020 200A2020 6D657468  [n π.    .  meth]
0000F0 (00000240)  6F64206D 61696E28 61726773 3D537472  [od main(args=Str]
000100 (00000256)  696E675B 5D292073 74617469 630A2020  [ing[]) static.  ]
000110 (00000272)  202068C3 A968C3A9 28290A20 20202073  [  héhé().    s]
000120 (00000288)  617920E2 88912829 0A                 [ay ∑().]

It does not have the utf8 encoding flag at the start but the characters seem bona fide Unicode - this is in fact what Emacs and TextMate tell me.

I am probably missing something.

best regards,

René.


On 4 jul. 2013, at 10:25, René Jansen <[hidden email]> wrote:

I expected this to fail:

options noutf8

class testUTF8default

  method héhé() static
    
    René = '42'
    say René'€'
    €='euro'
    say €
    
    
    
  method main(args=String[]) static
    héhé()


but it did not.

René.

On 4 jul. 2013, at 06:11, George Hovey <[hidden email]> wrote:

Rene, can you be more specific about the characters you expected to fail?


On Wed, Jul 3, 2013 at 8:46 PM, René Jansen <[hidden email]> wrote:
While working on making utf8 the default encoding, it occurred to me that I could not make the translator fail with -noutf8 and then having utf characters in variable- and method names. Did something change and is utf8 now always accepted? The value on the options statement must match the commandline setting (which is a bit awkward already)  - but it seems the option obsoleted itself already?

Please let me know your experience with this, before I dig further into the code.

best regards,

René.
_______________________________________________
Ibm-netrexx mailing list
[hidden email]
Online Archive : http://ibm-netrexx.215625.n3.nabble.com/




--
"One can live magnificently in this world if one knows how to work and how to love."  --  Leo Tolstoy
_______________________________________________
Ibm-netrexx mailing list
[hidden email]
Online Archive : http://ibm-netrexx.215625.n3.nabble.com/


_______________________________________________
Ibm-netrexx mailing list
[hidden email]
Online Archive : http://ibm-netrexx.215625.n3.nabble.com/




_______________________________________________
Ibm-netrexx mailing list
[hidden email]
Online Archive : http://ibm-netrexx.215625.n3.nabble.com/

Reply | Threaded
Open this post in threaded view
|

Re: puzzled by the -utf8 option

rvjansen
I am going to move this to the development list. I have done some more testing, and I think I get the picture now, but there is some work to be done. This is the latest test:

options utf8
class testUTF8Default

properties indirect
π = '3.1415926585979'
pi = '3.1415926585979'
  method héhé() static
    René = '42'
    say René'€'
    €='euro'
    say €

    
    /* sum it up */
  -- method ∑() static
  --   return π
    
  method main(args=String[]) static
    héhé()
    u = testUTF8Default()
    -- say ∑()
    -- say u.getπ()
    say u.getPi()
    say '€ isLetter:' Character.isLetter('€')
    say 'π isLetter:' Character.isLetter('π')
    say '∑ isLetter:' Character.isLetter('∑')
    
NRL states that there are two character sets involved: the one to program in ("expressing the NetRexx program itself") and the one for the data. The remark that Unicode has 65536 characters, each encoded in 16 bits, is somewhat dated; it will be updated for the 3.03 version.  For the first set A-Z, a-z, numbers and the set for which Character.isLetter() is true might be used. It is recommended that the dollar and euro only be used in symbols in mechanically generated programs or where otherwise essential.

Interestingly, the euro character is not a letter (see below) but can be used as a symbol. The unicode pi character is a letter, but when used as an indirect property, has two problems: the bean naming pattern needs folding to uppercase (in Java) which is problematic for math symbols, and the program fails when interpreted, but succeeds when run from the classfile. We should make sure that the datatype'S' (symbol) returns the same thing as what the translator uses. Also, it would be good if we can decide on whether to allow all of unicode in symbols or not. Also, the status of 'euro' is problematic because it is not a letter but can be a symbol - and has been fixed in 3.03 to return 1 for the datatype('S') test.


fifi:test rvjansen$ nrc -keepasjava -replace -exec -utf8 testUTF8Default.nrx 
NetRexx portable processor, version NetRexx 3.03, build 198-20130704-0151
Copyright (c) RexxLA, 2011,2013.  All rights reserved.
Parts Copyright (c) IBM Corporation, 1995,2008.
Program testUTF8Default.nrx
  === class testUTF8Default ===
    function héhé
    function main(String[])
    method get\u03C0
    method set\u03C0(Rexx)
    method getPi
    method setPi(Rexx)
Exception in thread "main" java.lang.ClassFormatError: Illegal field name "\u03C0" in class testUTF8Default
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:791)
at java.lang.ClassLoader.defineClass(ClassLoader.java:634)
at org.netrexx.process.RxProxyLoader.loadClass(RxProxyLoader.nrx:135)
at org.netrexx.process.RxTranslator.startexec(RxTranslator.nrx:489)
at org.netrexx.process.RxTranslator.exec(RxTranslator.nrx:505)
at org.netrexx.process.NetRexxC.process(NetRexxC.nrx:246)
at org.netrexx.process.NetRexxC.main2(NetRexxC.nrx:171)
at org.netrexx.process.NetRexxC.main2(NetRexxC.nrx:160)
at org.netrexx.process.NetRexxC.main2(NetRexxC.nrx:158)
at org.netrexx.process.NetRexxC.main(NetRexxC.nrx:97)
fifi:test rvjansen$ java testUTF8Default
42€
euro
3.1415926585979
€ isLetter: 0
π isLetter: 1
∑ isLetter: 0


So this needs an ARB discussion first, and I will discuss this further on the developers list.

best regards,

René.



_______________________________________________
Ibm-netrexx mailing list
[hidden email]
Online Archive : http://ibm-netrexx.215625.n3.nabble.com/

Reply | Threaded
Open this post in threaded view
|

Re: puzzled by the -utf8 option

George Hovey-2
Rene,
I think the problems may be much deeper than you have exposed so far.

Our Language Reference, dated June 15th, 2013, seems to state that the characters that may be used in 'symbols' (i.e. identifiers) are limited to 'A-Z,a-z', digits and a few additional characters you are studying.

But the Java Language Specification allows a much broader range or characters in identifiers: (http://docs.oracle.com/javase/specs/jls/se7/html/jls-3.html#jls-3.8)

Since Class Names are identifiers, this would seem to imply that there can be Java class names that NetRexx cannot reference.  That is, NetRexx and Java may not be interoperable at the class level.

A couple of observations relating to your investigation.

   -- The NetRexx Language Reference (footnote 15) references a book that is both obsolete and out of print.
  
   --Java's treatment of unicode issues is based on the publications of the Unicode Consortium (which, I think, publishes the code that clasifies unicode code points as letters, digits etc.)  It seems we should be firmly on board this effort, as Java is. (http://www.unicode.org/reports/tr31/#Default_Identifier_Syntax).
   All the major players are members of this organization.


On Thu, Jul 4, 2013 at 9:43 AM, René Jansen <[hidden email]> wrote:
I am going to move this to the development list. I have done some more testing, and I think I get the picture now, but there is some work to be done. This is the latest test:

options utf8
class testUTF8Default

properties indirect
π = '3.1415926585979'
pi = '3.1415926585979'
  method héhé() static
    René = '42'
    say René'€'
    €='euro'
    say €

    
    /* sum it up */
  -- method ∑() static
  --   return π
    
  method main(args=String[]) static
    héhé()
    u = testUTF8Default()
    -- say ∑()
    -- say u.getπ()
    say u.getPi()
    say '€ isLetter:' Character.isLetter('€')
    say 'π isLetter:' Character.isLetter('π')
    say '∑ isLetter:' Character.isLetter('∑')
    
NRL states that there are two character sets involved: the one to program in ("expressing the NetRexx program itself") and the one for the data. The remark that Unicode has 65536 characters, each encoded in 16 bits, is somewhat dated; it will be updated for the 3.03 version.  For the first set A-Z, a-z, numbers and the set for which Character.isLetter() is true might be used. It is recommended that the dollar and euro only be used in symbols in mechanically generated programs or where otherwise essential.

Interestingly, the euro character is not a letter (see below) but can be used as a symbol. The unicode pi character is a letter, but when used as an indirect property, has two problems: the bean naming pattern needs folding to uppercase (in Java) which is problematic for math symbols, and the program fails when interpreted, but succeeds when run from the classfile. We should make sure that the datatype'S' (symbol) returns the same thing as what the translator uses. Also, it would be good if we can decide on whether to allow all of unicode in symbols or not. Also, the status of 'euro' is problematic because it is not a letter but can be a symbol - and has been fixed in 3.03 to return 1 for the datatype('S') test.


fifi:test rvjansen$ nrc -keepasjava -replace -exec -utf8 testUTF8Default.nrx 
NetRexx portable processor, version NetRexx 3.03, build 198-20130704-0151
Copyright (c) RexxLA, 2011,2013.  All rights reserved.
Parts Copyright (c) IBM Corporation, 1995,2008.
Program testUTF8Default.nrx
  === class testUTF8Default ===
    function héhé
    function main(String[])
    method get\u03C0
    method set\u03C0(Rexx)
    method getPi
    method setPi(Rexx)
Exception in thread "main" java.lang.ClassFormatError: Illegal field name "\u03C0" in class testUTF8Default
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:791)
at java.lang.ClassLoader.defineClass(ClassLoader.java:634)
at org.netrexx.process.RxProxyLoader.loadClass(RxProxyLoader.nrx:135)
at org.netrexx.process.RxTranslator.startexec(RxTranslator.nrx:489)
at org.netrexx.process.RxTranslator.exec(RxTranslator.nrx:505)
at org.netrexx.process.NetRexxC.process(NetRexxC.nrx:246)
at org.netrexx.process.NetRexxC.main2(NetRexxC.nrx:171)
at org.netrexx.process.NetRexxC.main2(NetRexxC.nrx:160)
at org.netrexx.process.NetRexxC.main2(NetRexxC.nrx:158)
at org.netrexx.process.NetRexxC.main(NetRexxC.nrx:97)
fifi:test rvjansen$ java testUTF8Default
42€
euro
3.1415926585979
€ isLetter: 0
π isLetter: 1
∑ isLetter: 0


So this needs an ARB discussion first, and I will discuss this further on the developers list.

best regards,

René.



_______________________________________________
Ibm-netrexx mailing list
[hidden email]
Online Archive : http://ibm-netrexx.215625.n3.nabble.com/





--
"One can live magnificently in this world if one knows how to work and how to love."  --  Leo Tolstoy

_______________________________________________
Ibm-netrexx mailing list
[hidden email]
Online Archive : http://ibm-netrexx.215625.n3.nabble.com/

Reply | Threaded
Open this post in threaded view
|

Re: puzzled by the -utf8 option

rvjansen
George,

thanks; you stated an important requirement which we need to work upon. I am going to look into this for 3.03.

best regards,

René.

On 4 jul. 2013, at 17:35, George Hovey <[hidden email]> wrote:

Rene,
I think the problems may be much deeper than you have exposed so far.

Our Language Reference, dated June 15th, 2013, seems to state that the characters that may be used in 'symbols' (i.e. identifiers) are limited to 'A-Z,a-z', digits and a few additional characters you are studying.

But the Java Language Specification allows a much broader range or characters in identifiers: (http://docs.oracle.com/javase/specs/jls/se7/html/jls-3.html#jls-3.8)

Since Class Names are identifiers, this would seem to imply that there can be Java class names that NetRexx cannot reference.  That is, NetRexx and Java may not be interoperable at the class level.

A couple of observations relating to your investigation.

   -- The NetRexx Language Reference (footnote 15) references a book that is both obsolete and out of print.
  
   --Java's treatment of unicode issues is based on the publications of the Unicode Consortium (which, I think, publishes the code that clasifies unicode code points as letters, digits etc.)  It seems we should be firmly on board this effort, as Java is. (http://www.unicode.org/reports/tr31/#Default_Identifier_Syntax).
   All the major players are members of this organization.


_______________________________________________
Ibm-netrexx mailing list
[hidden email]
Online Archive : http://ibm-netrexx.215625.n3.nabble.com/