Linear-time selection

Subpage of Algorithms

Recurrences, asymptotics, and design techniques.

We will look at two approaches. We start with expected \(O(n)\) time and then try to achieve worst-case \(O(n)\) time.

The problem a median-finding algorithm solves is the following:

Given an array \(A=[1,...,n]\) of \(n\) numbers and an index \(i\), where \(1≤i≤n\), find the \(ith\) smallest element of \(A\).

This problem can certainly be solved using a sorting algorithm to sort a list of numbers and return the value at the \(i^{th}\) index. However, many sorting algorithms can’t go faster than \(n \log n\) time. A median-finding algorithm can find the \(i^{th}\) smallest element in a list in \(O(n)\) time.

Quickselect

This algorithm uses partitioning but in a different way from quicksort.

It is still a recursive algorithm but it has fewer recursive calls than quicksort. Here is how it works:

  • Partition the array
  • If the pivot ends up at position k, you're done
  • If k is smaller, recurse on the left part
  • If k is larger, recurse on the right part

Unlike quicksort where you recurse on both sides, you only recurse on one side. In the best case, this makes the recurrence \(T(n) = T(n/2) + O(n)\), which solves to \(O(n)\).

The recursion tree can be seen as a geometric series which we can sum.

Note that this assumes that we are lucky and partition roughly in half. In reality, the size of the partition depends on the pivot element so the overall efficiency of the algorithm depends on how we choose the pivot element.

The Median-of-medians Algorithm

The median-of-medians algorithm is a deterministic linear-time selection algorithm.

The algorithm works by:

  1. Divide a list into sublists and then determine the approximate median in each of the sublists.
  2. Take those medians and puts them into a list and finds the median of that list. We use that median value as a pivot and compares other elements of the list against the pivot.
  3. If an element is less than the pivot value, the element is placed to the left of the pivot, and if the element has a value greater than the pivot, it is placed to the right.
  4. The algorithm recurses on the list, honing in on the value it is looking for.

This implementation works on lists that have unique elements (no repeated elements).

def median_of_medians(A, i):

    #divide A into sublists of len 5
    sublists = [A[j:j+5] for j in range(0, len(A), 5)]
    medians = [sorted(sublist)[len(sublist)/2] for sublist in sublists]
    if len(medians) <= 5:
        pivot = sorted(medians)[len(medians)/2]
    else:
        #the pivot is the median of the medians
        pivot = median_of_medians(medians, len(medians)/2)

    #partitioning step
    low = [j for j in A if j < pivot]
    high = [j for j in A if j > pivot]

    k = len(low)
    if i < k:
        return median_of_medians(low,i)
    elif i > k:
        return median_of_medians(high,i-k-1)
    else: #pivot = k
        return pivot

The time for dividing lists, finding the medians of the sublists, and partitioning takes \(T(n)=T(5n)+O(n)\) time, and with the recursion factored in, the overall recurrence to describe the median-of-medians algorithm is

\[T(n)≤T(\frac{n}{5})+T(\frac{7n}{10})+O(n)\]

The master theorem can be used to show that this recurrence equals \(O(n)\).

Why 5?

The median-of-medians divides a list into sublists of length five to get an optimal running time. Remember, finding the median of small lists by brute force (sorting) takes a small amount of time, so the length of the sublists must be fairly small.

If the algorithm divided the list into sublists of length three, \(p\) would be greater than approximately \(\frac{n}{3}\)elements and it would be smaller than approximately \(\frac{n}{3}\) elements. This would cause a worst case \(\frac{2n}{3}\) recursions, yielding the recurrence \(T(n)=T(\frac{n}{3})+T(\frac{2n}{3})+O(n)\) which by the master theorem is \(O(n \log⁡ n)\), which is slower than linear time.

The algorithm needs a sublist size greater than 5 but we need to keep the sublist size as small as we can so that sorting the sublists can be done in what is effectively constant time.