Sharing - Web Scraping with Python and Python to Max via OSC

Tommy Martinez's icon

Maxurl and jit.uldl are great for api requests and parsing JSON or XML responses as well as downloading files - but what happens when you want to web scrape pages that have no uniform response types and return plain html with a get request?

I tried my luck with some xml and html parsers on the forums but found them confusing and/or unsupported/outdated. Regexp only went so far for me as well. I had to do some python programming for a raspberry pi project and quickly found that scraping with the beautifulsoup module for python was the way to go. Python was really easy to get into with no serious coding background besides some very basic html/css and even less javascript experience. I've been working with max for 7 years or so and found the programming concepts I learned there easily transferrable to this new environment.

Here is a python program that grabs headlines from guardian.co.uk and spits them out via osc to max. The max patch is also attached. I hope that this saves people some time and gives people some new perspectives on integrating other programming languages with max. As we know max is really good for visual and sonic applications but other programming languages are built better for parsing html. ;)

your fren,
T

here is the python
import argparse
import random
import time

from pythonosc import osc_message_builder
from pythonosc import udp_client

#

if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--ip", default="127.0.0.1",
help="The ip of the OSC server")
parser.add_argument("--port", type=int, default=8000,
help="The port the OSC server is listening on")
args = parser.parse_args()

client = udp_client.UDPClient(args.ip, args.port)

import urllib.request
import time
from bs4 import BeautifulSoup

newstext = urllib.request.Request('http://www.guardian.co.uk/')
resp = urllib.request.urlopen(newstext)
respData = resp.read()
soup = BeautifulSoup(respData, 'html.parser')

displayText = []

i = 0

for interesting in soup.findAll(class_='js-headline-text'):
displayText.append(interesting.text)
msg = osc_message_builder.OscMessageBuilder(address="/headlines")
msg.add_arg(displayText[i])
msg = msg.build()
client.send(msg)
i += 1
time.sleep(.5)

python2max.maxpat
Max Patch
Tommy Martinez's icon

Oh yeah, the python requires that you install beautifulsoup and python-osc modules. I'm using the 3.x branch of python as well.

Qbrick's icon

Would this work with python 2.7. I'm having lot's of errors..

The copied code gave me problems with

import urllib.request

newstext = urllib.request.Request('http://www.guardian.co.uk/')
resp = urllib.request.urlopen(newstext)

I'm really struggling to get python on max i.e. a webscraper to max. Any help would be very much appreciated!!

Tommy Martinez's icon

I'm using 3.5 so that would explain the errors. I don't have any experience with doing this in Python 2. Happy to ease you through the process if you're still having trouble getting it done in 3 though.

Qbrick's icon

Ok via virtualenv, I'm using python 3.5.0.

Import pyhon-osc, requests, request

But when running the python testosc.py

I keep getting

File "testosc.py", line 19, in
import urllib.request
ImportError: No module named request

Though when which python in pyenv, I still get python 3.5.0.

Qbrick's icon

And I'm I also correct that these are two python files?

import argparse

import random

import time

from pythonosc import osc_message_builder
from pythonosc import udp_client

#

if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--ip", default="127.0.0.1",
help="The ip of the OSC server")
parser.add_argument("--port", type=int, default=8000,
help="The port the OSC server is listening on")
args = parser.parse_args()

client = udp_client.UDPClient(args.ip, args.port)

and

import urllib.request
import time
from bs4 import BeautifulSoup

newstext = urllib.request.Request('http://www.guardian.co.uk/')
resp = urllib.request.urlopen(newstext)
respData = resp.read()
soup = BeautifulSoup(respData, 'html.parser')

displayText = []

i = 0

for interesting in soup.findAll(class_='js-headline-text'):displayText.append(interesting.text)
msg = osc_message_builder.OscMessageBuilder(address="/headlines")
msg.add_arg(displayText[i])
msg = msg.build()
client.send(msg)
i += 1
time.sleep(.5)

Tommy Martinez's icon

Nope that's one python file not two. Also if you're getting the module error it could be that it's not installed.