Monthly Archives: March 2013

Lectures 7 & 8: Complexity of Counting

So far in the course we learned that careful counting is behind many social science results. But how hard is it to count? For example, suppose we have two (unsorted) lists of numbers, each with 100 million unique items. How long will it take to find the size of the intersection of these two sets? The most straightforward idea is to check every element in set one against every element in set two. This is simple and correct, however the number of comparisons required grows as the product of the size of two sets. Thus for two lists of 100M numbers each, even if we could do 1B comparisons per second, the calculation will take about 10 million seconds, or more than three months!

The reason this approach is infeasible is that the underlying algorithm is quadratic—if the size of each of the sets were to triple, we would now need to perform nine (three squared) times as many comparisons—and for large datasets quadratic algorithms don’t scale. Instead, we often look for linear time algorithms. For such approaches if the size of the input were to triple, the running time would increase by a factor of three as well. In the worst case, we consider algorithms with running time slightly above linear, for example just sorting a list of 100M items takes about a billion comparisons. More generally for an input of size n, sorting the input takes approximately n log n comparisons. Although this is more than linear—tripling the input increases the running time by more than a factor of three, it is only slightly more so.

Coming back to the intersection problem, there are many possible algorithms for finding the intersection of the two lists. To beat the quadratic time barrier it helps to preprocess the input before doing the comparisons. Without preprocessing, the only way to check whether an item from the first list appears in the second list is to scan and compare it to all of the elements, which leads to the quadratic time algorithm. Suppose instead, that such a membership test (testing whether a specific item is in the list) were faster, taking only constant time, then the overall algorithm would run in linear time! A data structure known as a hashtable does exactly this—it’s a way of preprocessing the input to answer only membership queries, but to answer them very quickly, on average in constant time.

In the second lecture we saw that in addition to the overall running time, an important consideration is the skew in the data. We already saw how the skew makes some statistics (for example the average) somewhat misleading when describing a dataset. For example, in real world networks, some nodes are going to have much higher degrees (e.g. friends or followers) than others. Taking this skew into account is crucial, especially when parallelizing the computation across many machines, as naive parallelizations may result in many machines finishing very quickly, in a matter of seconds, whereas the last few take hours to run.

We saw this demonstrated with triangle counting algorithms. The most straightforward approaches run in time quadratic in the degree of each node, which is fine for the majority of the nodes, but causes problems for the very high degree nodes (quadratic algorithms can’t handle very large instances). There are many ways to rectify this, for example treating the very high degree nodes differently by splitting them smaller subsets, and carefully adjusting the computation. An alternative approach is to assign each triangle to be counted from the perspective of the lowest degree node. Although this approach is also quadratic, it is only quadratic in the number of neighbors who have even higher degree, which makes for much more balanced computation.

Lecture 7:

Lecture 8:

Lectures 5 & 6: Networks

These two lectures introduced network theory, following the first part of “Networks, Crowds, and Markets” by Easley & Kleinberg. We discussed basic representations of relational data, basic algorithms for computations on graphs, and common features of real-world networks.

Slides are below. See “Predicting Individual Behavior with Social Networks “ and “The Structure of Online Diffusion Networks “ for more details of the case studies we covered, and the references that follow for information on using Pig and Amazon’s computing services.

References:

  • Hadoop is an open source framework for scalable storage and computing, including a distributed filesystem and scalable implementation of MapReduce.
  • Pig is a high-level language that converts sequences of common data analysis operations (e.g., group-bys, joins, etc.) to chains of MapReduce jobs and executes these either locally or across a Hadoop cluster.
  • Pig is easy to download and try on Linux, Mac OS X, or Windows (requires Java). See RELEASE_NOTES.txt for information on trying Pig in local mode.
  • The data types in Pig are slightly different than in most languages, including bags, tuples, and fields.
  • The diagnostic operators in Pig are useful for understanding code and debugging. In particular, the DUMP operator shows contents of relations and the ILLUSTRATE operator shows how relations are transformed through a Pig program.
  • Programming Pig is a freely available book with an accompanying GitHub page of examples.
  • In addition to running Hadoop and Pig locally, Amazon’s Elastic Compute Cloud (EC2) and Scalable Storage Service (S3) provide pay-per-use computation and storage that can be used to construct rentable clusters for large-scale data analysis.
  • Elastic MapReduce (EMR) facilitates this process, allowing one to easily submit MapReduce jobs without needing to configure low-level cluster details.
  • Amazon’s getting started guide walks through the details of creating an account and launching/logging into a simple Linux instance. Be sure to create a keypair for this last step with the AWS console. You can use the ssh command to log in to machines on Linux, Mac OS X, or in Cygwin on Windows; alternatively you can use the AWS console or Putty to access remote machines.
  • The process for running a Hadoop/Pig job is similar, using the EMR Console. You can alternatively select to run an interactive session and log into the cluster as opposed to submitting just a single job. More details on interactive sessions and individual job flows are available through Amazon’s tutorial articles.
  • Amazon also provides a number of public data sets for analysis on EMR, including a copy of the Common Crawl Corpus of several billion web pages.