How to parse text from a website in Max
Hi folks,
So I'd like to be able to pull the text from a website (say for example an article on a magazine site) for use in a Max patch. It needs to be flexible enough to be used for any given site that it could target, and be reasonably sure that it will pull the main body of page text, without also pulling in things like menus and other text around it.
Is there a way to do this which doesn't download the site to the local computer? I just want it to scrape data from sites basically, not download it.
Well, you will always need to dowload the page one way or the other but you don't need to load images, styles sheets ir save it locally i.e if you fetch it with [maxurl]
I can't think of any way to reliably detect a "main" text without looking at the markup before hand and deciding on a way to parse out the text. The html markup structure of pages are very diverse. In theory html5 introduced a semantic markup stucture that is intended to make these tasks easier (i.e. the tag).
But first of all they are currently hardly in use and secondly there is no strict rule in html5 how and when threse tags have to used.
You might try something like looking for long text blocks within certain tags. But this won't be 100% reliable.
j
Hi Jan, thanks for your reply - that's as I thought, thanks for confirming it. Saving it locally isn't going to work, so I'll have to go down the route of dragging the text in manually. The only issue with that is that there appears to be a character limit on a symbol after it is passed out of textedit - I don't suppose you know a way around that?
i'd probably do the parsing with dict/javascript. i believe there are some examples in the helpfiles if maxurl.