Sharing - Web Scraping with Python and Python to Max via OSC


    Nov 03 2016 | 3:19 am
    Maxurl and jit.uldl are great for api requests and parsing JSON or XML responses as well as downloading files - but what happens when you want to web scrape pages that have no uniform response types and return plain html with a get request?
    I tried my luck with some xml and html parsers on the forums but found them confusing and/or unsupported/outdated. Regexp only went so far for me as well. I had to do some python programming for a raspberry pi project and quickly found that scraping with the beautifulsoup module for python was the way to go. Python was really easy to get into with no serious coding background besides some very basic html/css and even less javascript experience. I've been working with max for 7 years or so and found the programming concepts I learned there easily transferrable to this new environment.
    Here is a python program that grabs headlines from guardian.co.uk and spits them out via osc to max. The max patch is also attached. I hope that this saves people some time and gives people some new perspectives on integrating other programming languages with max. As we know max is really good for visual and sonic applications but other programming languages are built better for parsing html. ;)
    your fren, T
    here is the python import argparse import random import time
    from pythonosc import osc_message_builder from pythonosc import udp_client
    #
    if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument("--ip", default="127.0.0.1", help="The ip of the OSC server") parser.add_argument("--port", type=int, default=8000, help="The port the OSC server is listening on") args = parser.parse_args()
    client = udp_client.UDPClient(args.ip, args.port)
    import urllib.request import time from bs4 import BeautifulSoup
    newstext = urllib.request.Request('http://www.guardian.co.uk/') resp = urllib.request.urlopen(newstext) respData = resp.read() soup = BeautifulSoup(respData, 'html.parser')
    displayText = []
    i = 0
    for interesting in soup.findAll(class_='js-headline-text'): displayText.append(interesting.text) msg = osc_message_builder.OscMessageBuilder(address="/headlines") msg.add_arg(displayText[i]) msg = msg.build() client.send(msg) i += 1 time.sleep(.5)

    • Nov 03 2016 | 3:27 am
      Oh yeah, the python requires that you install beautifulsoup and python-osc modules. I'm using the 3.x branch of python as well.
    • Feb 04 2017 | 10:06 pm
      Would this work with python 2.7. I'm having lot's of errors..
      The copied code gave me problems with
      import urllib.request
      newstext = urllib.request.Request('http://www.guardian.co.uk/') resp = urllib.request.urlopen(newstext)
      I'm really struggling to get python on max i.e. a webscraper to max. Any help would be very much appreciated!!
    • Feb 04 2017 | 10:09 pm
      I'm using 3.5 so that would explain the errors. I don't have any experience with doing this in Python 2. Happy to ease you through the process if you're still having trouble getting it done in 3 though.
    • Feb 07 2017 | 10:55 am
      Ok via virtualenv, I'm using python 3.5.0.
      Import pyhon-osc, requests, request
      But when running the python testosc.py
      I keep getting
      File "testosc.py", line 19, in import urllib.request ImportError: No module named request
      Though when which python in pyenv, I still get python 3.5.0.
    • Feb 07 2017 | 11:48 am
      And I'm I also correct that these are two python files?
      import argparse
      import random
      import time
      from pythonosc import osc_message_builder from pythonosc import udp_client
      #
      if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument("--ip", default="127.0.0.1", help="The ip of the OSC server") parser.add_argument("--port", type=int, default=8000, help="The port the OSC server is listening on") args = parser.parse_args()
      client = udp_client.UDPClient(args.ip, args.port)
      and
      import urllib.request import time from bs4 import BeautifulSoup
      newstext = urllib.request.Request('http://www.guardian.co.uk/') resp = urllib.request.urlopen(newstext) respData = resp.read() soup = BeautifulSoup(respData, 'html.parser')
      displayText = []
      i = 0
      for interesting in soup.findAll(class_='js-headline-text'):displayText.append(interesting.text) msg = osc_message_builder.OscMessageBuilder(address="/headlines") msg.add_arg(displayText[i]) msg = msg.build() client.send(msg) i += 1 time.sleep(.5)
    • Feb 07 2017 | 4:37 pm
      Nope that's one python file not two. Also if you're getting the module error it could be that it's not installed.