Lecture 9: Data Wrangling

This is a guest lecture and post by John Myles White.

In this lecture, we talked about methods for getting data. We ranked methods in terms of their ease of use. For example, the easiest method was getting a bulk download of an entire data set. We noted that there are several clearinghouses of data that link to many publicly available data sets, including data from Wikipedia, IMDB, Last.fm, and others. When these bulk downloads are not available, we noted that many web sites (e.g., NYTimes, Twitter, Google, etc.) offer API access with which you can download chunks of data at a time and slowly accumulate a large body of data. When even this is not possible, we noted that one can scrape data from sites so long as the Terms of Service allow automated access to the site using tools such as BeautifulSoup or Nokogiri.

In the second half of the lecture, we discussed how to work with the data that one acquires from websites using any of the three methods above. This data is often structured in formats like JSON and XML that must be parsed by the user using formal parsing libraries available in many popular languages like Python. Sometimes the data is in an unstructured format in which we simply want to extract basic information like phone numbers: we described the use of regular expressions as a mechanism for extracting this information. We worked through an extended example of building a regular expression that would match phone numbers.