extended ascii, wrong values with atoi

jmmmp's icon

Hi, I wanted to convert german text to ascii using atoi. The extended german characters like ö Ö etc. don't work very well. E.g. instead of getting 246 for ö, I get 195 182 (two values??).

Is there a more reliable ascii symbol-int converter? Or am I doing something wrong?
Funnily, the text is displayed correctly by the text object.
Also, itoa doesn't display the same characters.

System: XP, Max 5

Peter Castine's icon

Suggest reading .

Nothing wrong with the values you're getting from atoi. You're just expecting the ready-made character while text (and message, and everything else) is using composite characters.

jmmmp's icon

Hi Peter, thanks for the reply.

That could make sense, but the thing is that I the character 246 - ö - gave the numbers 195 and 182, Ã and ¶ [don't know if the characters will be correctly displayed].
Following the composite logic, it should give out o and trema, 111 + 168.

I tried with Pd with a similar object, and it did work with 246.

Btw, do you think this is an issue only of text, but not of the system? I got a windows computer to work on, but the patch will also be used on a mac.

Peter Castine's icon

I got the same results here on Mac OS. The letter ö is decomposed as {195 182}. In fact, I know this has been discussed many times and cross-platform isn't a problem.

I had to think about the decomposition for a minute… then I recalled that Max is using UTF-8. The anchor I linked to previously is a bit of a red herring, sorry. However, the Wikipedia article links to the article on UTF-8, if you want the gory details. It ain't pretty, but what atoi is doing is correct UTF-8.

One could argue that it would make more sense for itoa and atoi to work directly with 24-bit Unicode values. This appears to be what Pd is doing. But the Max behavior has now been with us since v5.0, so we're sort of stuck with it.

It would not be very hard to write an abstraction that can convert atoi's list output to the equivalent Unicode code points.

jmmmp's icon

Yes, I forgot to say that I had already noticed that max is utf-8 (got there by trial and error).
I think an abstraction to correct the values is a good idea. May I just ask, do you know where I can find reliable comparison charts between utf8, ansi, etc.? I'll google myself, but just wanted to know if you have a resource at hand.

João

Peter Castine's icon

The Wikipedia articles include tables of codepoints, and there's an article comparing coding systems. I haven't read it, but probably a good start.

Hope this helps -- P.

Peter Castine's icon

Hi João,

OK, here's one way of converting UTF-8 sequences to their Unicode values.

The "help" file has examples from the lowest and highest Unicode characters for single-byte, double-byte, and triple-byte UTF-8 sequences (well, close to the extreme values, the absolute extremes are mostly not printable characters or even used by Unicode). I've checked the calculated values against the values given in the Character View palette from the Input menu on Mac OS.

I tried some 4-byte sequences, but itoa seems to be generating incorrect UTF-8 sequences for Unicode codepoints above U-10000. So now someone's got to write a bug report:-(

Hope this helps,
— P.

(PS: nice seeing you guys at the weekend.)

1743.utf8toint.zip
zip
dimbels's icon

hey there, I am currently stuck on the same matter - parsing longer sentences to ascii in order to send them via tcp. However, the same problem with Umlaute/extended ascii occurs. I tried to fiddle around with regexp,without much sucess so far. Any new tips and pointers on the matter?
best, jost

Peter Castine's icon

You didn't say… did you try the abstraction posted? It sorted out the problems I had. You may be after a different solution, but I'm not sure what that is.

It's been a gazillion years since I looked at the abstraction (well, two-and-a-half), but I think it should still do what it says on the tin.

dimbels's icon

hey peter, thanks for the answer. I used js charCodeAt() and fromCharCode for a better conversion, which works great. Yet, I get wrong values once sending them over mxj net.tcp.send. It does not seem to swallow them so far.
I posted on that matter in the java forum, in hope for a solution: https://cycling74.com/forums/mxj-net-tcp-send-and-extended-ascii/

elby's icon

Hi,

I have a similar problem, and this post is not so old so I'm asking : is it possible to invert the utf8toint so I can get UTF-8 value from Unicode?

Regards

LR

Brahma Gupta's icon

Hi,

I have the same problem to a javascript code in a empty html page:

'ö'.charCodeAt(0); // return 165

Add this meta-tag in the head of the html page

...
'ö'.charCodeAt(0); // return 246