Sharing - Web Scraping with Python and Python to Max via OSC
Maxurl and jit.uldl are great for api requests and parsing JSON or XML responses as well as downloading files - but what happens when you want to web scrape pages that have no uniform response types and return plain html with a get request?
I tried my luck with some xml and html parsers on the forums but found them confusing and/or unsupported/outdated. Regexp only went so far for me as well. I had to do some python programming for a raspberry pi project and quickly found that scraping with the beautifulsoup module for python was the way to go. Python was really easy to get into with no serious coding background besides some very basic html/css and even less javascript experience. I've been working with max for 7 years or so and found the programming concepts I learned there easily transferrable to this new environment.
Here is a python program that grabs headlines from guardian.co.uk and spits them out via osc to max. The max patch is also attached. I hope that this saves people some time and gives people some new perspectives on integrating other programming languages with max. As we know max is really good for visual and sonic applications but other programming languages are built better for parsing html. ;)
your fren,
T
here is the python
import argparse
import random
import time
from pythonosc import osc_message_builder
from pythonosc import udp_client
#
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--ip", default="127.0.0.1",
help="The ip of the OSC server")
parser.add_argument("--port", type=int, default=8000,
help="The port the OSC server is listening on")
args = parser.parse_args()
client = udp_client.UDPClient(args.ip, args.port)
import urllib.request
import time
from bs4 import BeautifulSoup
newstext = urllib.request.Request('http://www.guardian.co.uk/')
resp = urllib.request.urlopen(newstext)
respData = resp.read()
soup = BeautifulSoup(respData, 'html.parser')
displayText = []
i = 0
for interesting in soup.findAll(class_='js-headline-text'):
displayText.append(interesting.text)
msg = osc_message_builder.OscMessageBuilder(address="/headlines")
msg.add_arg(displayText[i])
msg = msg.build()
client.send(msg)
i += 1
time.sleep(.5)
Oh yeah, the python requires that you install beautifulsoup and python-osc modules. I'm using the 3.x branch of python as well.
Would this work with python 2.7. I'm having lot's of errors..
The copied code gave me problems with
import urllib.request
newstext = urllib.request.Request('http://www.guardian.co.uk/')
resp = urllib.request.urlopen(newstext)
I'm really struggling to get python on max i.e. a webscraper to max. Any help would be very much appreciated!!
I'm using 3.5 so that would explain the errors. I don't have any experience with doing this in Python 2. Happy to ease you through the process if you're still having trouble getting it done in 3 though.
Ok via virtualenv, I'm using python 3.5.0.
Import pyhon-osc, requests, request
But when running the python testosc.py
I keep getting
File "testosc.py", line 19, in
import urllib.request
ImportError: No module named request
Though when which python in pyenv, I still get python 3.5.0.
And I'm I also correct that these are two python files?
import argparse
import random
import time
from pythonosc import osc_message_builder
from pythonosc import udp_client
#
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--ip", default="127.0.0.1",
help="The ip of the OSC server")
parser.add_argument("--port", type=int, default=8000,
help="The port the OSC server is listening on")
args = parser.parse_args()
client = udp_client.UDPClient(args.ip, args.port)
and
import urllib.request
import time
from bs4 import BeautifulSoup
newstext = urllib.request.Request('http://www.guardian.co.uk/')
resp = urllib.request.urlopen(newstext)
respData = resp.read()
soup = BeautifulSoup(respData, 'html.parser')
displayText = []
i = 0
for interesting in soup.findAll(class_='js-headline-text'):displayText.append(interesting.text)
msg = osc_message_builder.OscMessageBuilder(address="/headlines")
msg.add_arg(displayText[i])
msg = msg.build()
client.send(msg)
i += 1
time.sleep(.5)
Nope that's one python file not two. Also if you're getting the module error it could be that it's not installed.