Beautiful soup for ugly html

by Sue

Beautiful soup: You didn’t write that awful page. You’re just trying to get some data out of it. Right now, you don’t really care what HTML is supposed to look like.
Neither does this parser.

I am taking a class in data visualization and our second assignment was to take any website and strip data from it using python and the beautiful soup parser.  Now, being a jQuery lover I thought “scraping a website will be easy *COCKY GRIN*” because I am used to traversing the DOM with jQuery’s selector engine.

Then I looked at the sites we had to use and my jaw dropped.  EW. No tags around text, random iframes, random links,  just a whole lot of ugly horrible crap.

I am still working on the script I used to scrape the site.  I still get some errant div tags. I know I should use re.sub to get rid of them, but so far it breaks :/

import re, util
from BeautifulSoup import BeautifulSoup, Comment
 
# list
list=[]
 
# base url
baseurl = "http://www.plyrics.com/w/weezer.html"
 
# use cs171-util to get a soup object that represents a webpage
soup = util.mysoupopen(baseurl)
#I started with the dido lyric scraper I saw on the forums. I have never written
#in python before. I played with it for a while and came up with my own way of
#grabbing urls.
 
# titleCols- grab the HREF values of all links
titleCols = soup.findAll("a", href=True)
 
# if there are no href values, stop :)
# program!
if(len(titleCols) == 0):
    exit;
 
# for each entry
for url in titleCols:
    mc = str(url['href'])
    #print mc
    # find the urls
    m = re.search("weezer/(.*)", mc);
    if(m != None):
        test=(m.groups()[0].strip())
        list.append(test)
        #print test  just checking what it's grabbing  I use print a LOT
 
#start the scraping of the other urls
# this creates a tab-delimited file
delim = "\t"
 
# base url 2- this is the first part of my path
baseurl2 = "http://www.plyrics.com/lyrics/weezer/"
 
# while loop
c = 0
while c < len(list):    
 
# add the file name to the plyrics path
    baseurl3 = baseurl2 + list[c]
    soup2 = util.mysoupopen(baseurl3)
 
    #just checking my urls - Accidentally created infinite loop. funtimes.
    #print baseurl3
 
# grab lyrics
    lyrics = soup2.findAll("div",{"class":"body_lyr"})
 
# No lyrics? stop program!
    if(len(lyrics) == 0):
        exit;
# for each article...
    for entry in lyrics:
        mc2 = str(entry)
    # find the name of the product
        mosoup = BeautifulSoup(mc2)
        #remove comments & all other stuff
        comments = mosoup.findAll(text=lambda text:isinstance(text, Comment))
        [comment.extract() for comment in comments]
 
        for script in mosoup("script"):
            mosoup.script.extract()
        for style in mosoup("style"):
            mosoup.style.extract()
        for iframe in mosoup("iframe"):
            mosoup.iframe.extract()
        for h4 in mosoup("h4"):
            mosoup.h4.extract()
        for h5 in mosoup("h5"):
            mosoup.h5.extract()
        for h2 in mosoup("h2"):
            mosoup.h2.extract()
        for a in mosoup("a"):
            mosoup.a.extract()
        for br in mosoup("br"):
            mosoup.br.extract()
        for img in mosoup("img"):
            mosoup.img.extract()
        for h1 in mosoup("h1"):
            mosoup.h1.extract()
        for h3 in mosoup("h3"):
            mosoup.h3.extract()
        for h2 in mosoup("h2"):
            mosoup.h2.extract()
 
        f = open('output.txt','a')
        f.write(str(mosoup))
        f.close()
 
    c = c + 1

Dave says: re.sub(regex, textlookingat, replaceitem)