Tuesday, November 12, 2013

Cooking with Python: homemade alphabet soup

Do something useful! said the teacher in the old days.

A huge number of book titles for books about programming or software include the word "Cookbook". You need a book to learn Microsoft Excel (tm)? Search for Excel Cookbook. A book about Java (tm)? Search for Java Cookbook.

The wide use of Cookbook has two interesting aspects. One, it becomes so much harder to find a good cookbook that deals with preparing food. Two, just like with food cookbooks, you cannot be quite sure that the recipes work. Some programming manuals are like that food cookbook that makes you go out and buy a KitchenAid mixer, steadfastly refusing any other brand or some "deprecated" manual work using knives, spoons, cups and cutting boards.

We try to read a few of the large German online media sites every day and remembered an ancient school project for which we needed to compare the headlines from several newspapers for a week or so and say something intelligent about the differences observed.

Hey, this is the 21st century, why not go and grab headlines without lifting papers. So, we wrote a short Python program that goes and gets us the headlines of Der Spiegel online. Do the same for several others, perform a bit of analysis, and you are golden.

The tools our recipe uses: Python, BeautifulSoup 4.3.2, the requests library by Kenneth R. The Requests library is not an absolute requirement but we wanted to tell the visited website that we are Firefox and prove we like a great external library. BeautifulSoup is kind of necessary because otherwise you'll have to handle "broken" webpages yourself, which is not fun. A "broken webpage" is not to be confused with kaput. In general, it means that some tag or formatting is not good, thus causing an error. A bit like forgetting a full stop at the end of a sentence, it usually does not make a page unreadable but causes more effort to figure things out.

If you have done programming before, you won't be surprised that setting up these components took more time than writing our Spiegel visitor. In our case BeautifulSoup was the tricky one. They are not kidding on the documentation page when they warn you that it was written for Python 2 and there may be some issues.

Now, a few hours later, we run our program against the Spiegel site and get a list of headlines like this:

Öffnung in Richtung Linkspartei: Die Kehrtwende der SPD

Extremwetter-Index 2014: Die Hochrisikozonen der Erde

Sturmkatastrophe auf den Philippinen: Rebellen überfallen Hilfskonvoi

Transplantationen: Zahl der Organspender sinkt dramatisch

Bewertungssystem: Yahoo-Chefin Mayer knöpft sich "Minderleister" vor

....and so on.

Here is the code, and be grateful -- no more paper cuts:
# import the requests library by Kenneth R.,
# see http://docs.python-requests.org/en/latest/
# import BeautifulSoup v. 4


import requests
from bs4 import BeautifulSoup

# just wait a little, make us enter s + ENTER to start
while True:
        print ('To start type s')
        n = input("?: ")
        if n == 's':
            break

# change these two headers to give us a user-agent not
# announced as Python. mask the ip as localhost or use whatever
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 6.2; \
           WOW64; rv:25.0) Gecko/20100101 Firefox/25.0',
           'x-forwarded-for': '127.0.0.1'}

# go visit Der Spiegel Online
r = requests.get('http://www.spiegel.de', headers=headers)
print (r.status_code)
# print our outgoing request headers to see if  we still like them
print (r.request.headers)
print ("do something useful")
print()

# grab the web page and close the request
soup = BeautifulSoup(r.text)
r.close()

# get article headlines, they have them as a class 'article-title'
heads=soup.select('h2[class^="article-title"]')

for links in heads:
  # go down in the h2 snippet to get the attributes 'title' and 'href'
    attributes=links.next.next.attrs
    # ignore the href, just get the title
    print (attributes.get('title'))
    print ()


No comments:

Post a Comment