Maxurl and jit.uldl are great for api requests and parsing JSON or XML responses as well as downloading files - but what happens when you want to web scrape pages that have no uniform response types and return plain html with a get request?
I tried my luck with some xml and html parsers on the forums but found them confusing and/or unsupported/outdated. Regexp only went so far for me as well. I had to do some python programming for a raspberry pi project and quickly found that scraping with the beautifulsoup module for python was the way to go. Python was really easy to get into with no serious coding background besides some very basic html/css and even less javascript experience. I've been working with max for 7 years or so and found the programming concepts I learned there easily transferrable to this new environment.
Here is a python program that grabs headlines from guardian.co.uk and spits them out via osc to max. The max patch is also attached. I hope that this saves people some time and gives people some new perspectives on integrating other programming languages with max. As we know max is really good for visual and sonic applications but other programming languages are built better for parsing html. ;)
your fren,
T
here is the python
import argparse
import random
import time
from pythonosc import osc_message_builder
from pythonosc import udp_client
#
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--ip", default="127.0.0.1",
help="The ip of the OSC server")
parser.add_argument("--port", type=int, default=8000,
help="The port the OSC server is listening on")
args = parser.parse_args()
client = udp_client.UDPClient(args.ip, args.port)
import urllib.request
import time
from bs4 import BeautifulSoup
newstext = urllib.request.Request('
http://www.guardian.co.uk/')
resp = urllib.request.urlopen(newstext)
respData = resp.read()
soup = BeautifulSoup(respData, 'html.parser')
displayText = []
i = 0
for interesting in soup.findAll(class_='js-headline-text'):
displayText.append(interesting.text)
msg = osc_message_builder.OscMessageBuilder(address="/headlines")
msg.add_arg(displayText[i])
msg = msg.build()
client.send(msg)
i += 1
time.sleep(.5)