So far in the course we learned that careful counting is behind many social science results. But how hard is it to count? For example, suppose we have two (unsorted) lists of numbers, each with 100 million unique items. How long will it take to find the size of the intersection of these two sets? The most straightforward idea is to check every element in set one against every element in set two. This is simple and correct, however the number of comparisons required grows as the product of the size of two sets. Thus for two lists of 100M numbers each, even if we could do 1B comparisons per second, the calculation will take about 10 million seconds, or more than three months!
The reason this approach is infeasible is that the underlying algorithm is quadratic—if the size of each of the sets were to triple, we would now need to perform nine (three squared) times as many comparisons—and for large datasets quadratic algorithms don’t scale. Instead, we often look for linear time algorithms. For such approaches if the size of the input were to triple, the running time would increase by a factor of three as well. In the worst case, we consider algorithms with running time slightly above linear, for example just sorting a list of 100M items takes about a billion comparisons. More generally for an input of size n, sorting the input takes approximately n log n comparisons. Although this is more than linear—tripling the input increases the running time by more than a factor of three, it is only slightly more so.
Coming back to the intersection problem, there are many possible algorithms for finding the intersection of the two lists. To beat the quadratic time barrier it helps to preprocess the input before doing the comparisons. Without preprocessing, the only way to check whether an item from the first list appears in the second list is to scan and compare it to all of the elements, which leads to the quadratic time algorithm. Suppose instead, that such a membership test (testing whether a specific item is in the list) were faster, taking only constant time, then the overall algorithm would run in linear time! A data structure known as a hashtable does exactly this—it’s a way of preprocessing the input to answer only membership queries, but to answer them very quickly, on average in constant time.
In the second lecture we saw that in addition to the overall running time, an important consideration is the skew in the data. We already saw how the skew makes some statistics (for example the average) somewhat misleading when describing a dataset. For example, in real world networks, some nodes are going to have much higher degrees (e.g. friends or followers) than others. Taking this skew into account is crucial, especially when parallelizing the computation across many machines, as naive parallelizations may result in many machines finishing very quickly, in a matter of seconds, whereas the last few take hours to run.
We saw this demonstrated with triangle counting algorithms. The most straightforward approaches run in time quadratic in the degree of each node, which is fine for the majority of the nodes, but causes problems for the very high degree nodes (quadratic algorithms can’t handle very large instances). There are many ways to rectify this, for example treating the very high degree nodes differently by splitting them smaller subsets, and carefully adjusting the computation. An alternative approach is to assign each triangle to be counted from the perspective of the lowest degree node. Although this approach is also quadratic, it is only quadratic in the number of neighbors who have even higher degree, which makes for much more balanced computation.