extended ascii, wrong values with atoi

Jan 27, 2011 at 1:35am

extended ascii, wrong values with atoi

Hi, I wanted to convert german text to ascii using atoi. The extended german characters like ö Ö etc. don’t work very well. E.g. instead of getting 246 for ö, I get 195 182 (two values??).

Is there a more reliable ascii symbol-int converter? Or am I doing something wrong?
Funnily, the text is displayed correctly by the text object.
Also, itoa doesn’t display the same characters.

System: XP, Max 5

#54623
Jan 27, 2011 at 11:07am
#196692
Jan 27, 2011 at 2:17pm

Hi Peter, thanks for the reply.

That could make sense, but the thing is that I the character 246 – ö – gave the numbers 195 and 182, Ã and ¶ [don't know if the characters will be correctly displayed].
Following the composite logic, it should give out o and trema, 111 + 168.

I tried with Pd with a similar object, and it did work with 246.

Btw, do you think this is an issue only of text, but not of the system? I got a windows computer to work on, but the patch will also be used on a mac.

#196693
Jan 27, 2011 at 4:39pm

I got the same results here on Mac OS. The letter ö is decomposed as {195 182}. In fact, I know this has been discussed many times and cross-platform isn’t a problem.

I had to think about the decomposition for a minute… then I recalled that Max is using UTF-8. The anchor I linked to previously is a bit of a red herring, sorry. However, the Wikipedia article links to the article on UTF-8, if you want the gory details. It ain’t pretty, but what atoi is doing is correct UTF-8.

One could argue that it would make more sense for itoa and atoi to work directly with 24-bit Unicode values. This appears to be what Pd is doing. But the Max behavior has now been with us since v5.0, so we’re sort of stuck with it.

It would not be very hard to write an abstraction that can convert atoi’s list output to the equivalent Unicode code points.

#196694
Jan 27, 2011 at 4:52pm

Yes, I forgot to say that I had already noticed that max is utf-8 (got there by trial and error).
I think an abstraction to correct the values is a good idea. May I just ask, do you know where I can find reliable comparison charts between utf8, ansi, etc.? I’ll google myself, but just wanted to know if you have a resource at hand.

João

#196695
Jan 27, 2011 at 8:03pm

The Wikipedia articles include tables of codepoints, and there’s an article comparing coding systems. I haven’t read it, but probably a good start.

Hope this helps — P.

#196696
Jan 31, 2011 at 1:01pm

Hi João,

OK, here’s one way of converting UTF-8 sequences to their Unicode values.

The “help” file has examples from the lowest and highest Unicode characters for single-byte, double-byte, and triple-byte UTF-8 sequences (well, close to the extreme values, the absolute extremes are mostly not printable characters or even used by Unicode). I’ve checked the calculated values against the values given in the Character View palette from the Input menu on Mac OS.

I tried some 4-byte sequences, but itoa seems to be generating incorrect UTF-8 sequences for Unicode codepoints above U-10000. So now someone’s got to write a bug report:-(

Hope this helps,
— P.

(PS: nice seeing you guys at the weekend.)

Attachments:
  1. utf8toint.zip
#196697
Aug 30, 2013 at 3:14am

hey there, I am currently stuck on the same matter – parsing longer sentences to ascii in order to send them via tcp. However, the same problem with Umlaute/extended ascii occurs. I tried to fiddle around with regexp,without much sucess so far. Any new tips and pointers on the matter?
best, jost

#263733
Aug 31, 2013 at 9:44am

You didn’t say… did you try the abstraction posted? It sorted out the problems I had. You may be after a different solution, but I’m not sure what that is.

It’s been a gazillion years since I looked at the abstraction (well, two-and-a-half), but I think it should still do what it says on the tin.

#263837
Aug 31, 2013 at 3:12pm

hey peter, thanks for the answer. I used js charCodeAt() and fromCharCode for a better conversion, which works great. Yet, I get wrong values once sending them over mxj net.tcp.send. It does not seem to swallow them so far.
I posted on that matter in the java forum, in hope for a solution: http://cycling74.com/forums/topic/mxj-net-tcp-send-and-extended-ascii/

#263860
Oct 1, 2013 at 11:16am

Hi,

I have a similar problem, and this post is not so old so I’m asking : is it possible to invert the utf8toint so I can get UTF-8 value from Unicode?

Regards

LR

#266745

You must be logged in to reply to this topic.