XML to plain text

Dec 13, 2008 at 5:31am

XML to plain text

Basically, I’m looking for a way to convert XML to plain text. Specifically, I’m working on a patch that prompts the user for the input of a url (for example, a cnn.com news story) and then returns just the main text block (the article itself) to max for parsing.

Right now, I’m using the jit.uldl object to return the website to a jit.textfile in XML format. From everything I’ve read, it seems I’ll have to use javascript to convert this to plain text. I’m wondering if there is another way to do this, as I have no experience with js or using it with Max. Does anyone at least know of a precedent or a sample patch that I could learn from?

Another specific of the project is that I would like to divide the text block itself into an array of sentences that can be accessed individually. For example, given the word “economy”, locate and print every sentence in the text block that contains the word “economy”. Should I be looking at pattr objects for this?

Thanks, I am obviously very new to this and sincerely appreciate any suggestions, hints, etc.

#41329
Dec 13, 2008 at 8:09am

You should be able to do this with the [regexp] or [jit.regexp] objects. I’m about to leave for work but when I can get back I can post some examples of how you might want to start with this project.

lh

#147082
Dec 13, 2008 at 12:26pm

Hi Nathan,

you might have a look at the [regexp] object:
some useful info concerning the syntax can be found here:

http://www.python.org/doc/2.5.2/lib/re-syntax.html

Best,

Martijn

Nathan wrote:
> Basically, I’m looking for a way to convert XML to plain text. Specifically, I’m working on a patch that prompts the user for the input of a url (for example, a cnn.com news story) and then returns just the main text block (the article itself) to max for parsing.
>
> Right now, I’m using the jit.uldl object to return the website to a jit.textfile in XML format. From everything I’ve read, it seems I’ll have to use javascript to convert this to plain text. I’m wondering if there is another way to do this, as I have no experience with js or using it with Max. Does anyone at least know of a precedent or a sample patch that I could learn from?
>
> Another specific of the project is that I would like to divide the text block itself into an array of sentences that can be accessed individually. For example, given the word “economy”, locate and print every sentence in the text block that contains the word “economy”. Should I be looking at pattr objects for this?
>
> Thanks, I am obviously very new to this and sincerely appreciate any suggestions, hints, etc.
>
>

#147083
Dec 13, 2008 at 8:11pm

have a look at [detox] in my collection. it’s a basic xml-parser
external.

various people have had success in parsing web-content based xml files
with it.

it is freely available at this location http://www.jasch.ch/dl/

cheers

/*j

#147084
Dec 14, 2008 at 5:25am

Thanks everyone for your help! This has been very informative already. Thanks jasch, these are great tools!

Right now I’m downloading to a jit.textfile matrix using jit.uldl. I can’t seem to figure out how to convert the matrix into a symbol (or symbols) to use with the detox object. Really, I can’t figure out how to get any form of the text out of the jit.textfile object. Any suggestions? Thanks!

#147085
Dec 14, 2008 at 8:49am

> Really, I can’t figure out how to get any form of the text out of
> the jit.textfile object. Any suggestions? Thanks!

connect [jit.texfile]‘s middle outlet to [jit.spill] then to [itoa]
which gives you a symbol

/*j

#147086
Dec 14, 2008 at 10:31am

thanks again jasch!

#147087
Dec 14, 2008 at 11:40am

also have a look at mzed’s weather-patch in this thread. it uses the
exact combination of objects you mentioned.
message #78731 Thu, 31 August 2006 20:39

http://www.cycling74.com/forums/index.php?t=msg&th=21535&rid=0&S=22e2f02a52001ed90e73030499bb6175

/*j

#147088
Dec 14, 2008 at 4:50pm

What about tap.xml.sax from the Tap tools – from electrotap.com:
” tap.xml.sax is a streaming XML file parser that allows you use any of a
myriad XML-based formats including music-xml, xhtml, and SVG (Scalable
Vector Graphics) files. “
Any feedback?

More fun would be to make your own lisp program within maxlispj (should be
there: http://music.columbia.edu/~brad/maxlispj )

J-F.

> also have a look at mzed’s weather-patch in this thread. it uses the
> exact combination of objects you mentioned.
> message #78731 Thu, 31 August 2006 20:39
> http://www.cycling74.com/forums/index.php?t=msg&th=21535&rid=0&S=22e2f02a52001
> ed90e73030499bb6175

#147089
Dec 16, 2008 at 8:04am

thanks for all the help!

#147090

You must be logged in to reply to this topic.