Advanced Dynamic Programming 👹

Subpage of Algorithms

Recurrences, asymptotics, and design techniques.

Here are some advanced DP concepts that haven’t been covered in the introduction page. We seek to figure out how we can make out DP solutions work better.

Going Faster

There are some techniques we can use to make our DP solution better. We will consider some strategies here.

Change the Recurrence

The idea is to come up with a recurrence where the number of sub-problems is minimized.

The idea is often to let the recurrence capture multiple references to sub-problems instead of hard-coding them. Refer to the Unbounded Knapsack Problem for an example: Going Faster.

To be able to do this, you need to grasp the problem at a high level. Once you understand the structure of the problem, see what can be changed and what is fixed. This will bring you towards a solution.

Another way to see how we can change the recurrence may come from examining the solution space. Then we determine what the most efficient way to traverse the solution space is.

Here are some questions you can ask to come up with improvements:

  • Ask whether the state is too detailed.
  • Ask whether two dimensions can be collapsed into one.
  • Ask whether the transition can be re-expressed as a window, a monotone choice, a line, or a divide-and-conquer split.
  • Ask whether a hidden invariant lets you avoid checking every previous state.

Here is how we can address these issues once we have identified them:

Compress State Space

If there is information that is redundant, we can capture it using one variable instead of needing two. To apply this, identify dependent variables that can be derived or aggregated; replace them with smaller descriptors (e.g., parity, min/max, counts).

If we cannot reduce the number of variables at least we can reduce their ranges. Here is an example:

Consider this problem:

You have n items, each with a weight w[i] and value v[i]. Your knapsack has capacity W. Maximize total value without exceeding W. Constraints: n ≤ 100, v[i] ≤ 100, but W ≤ 10^9.

The standard formulation is:

dp[i][w] = maximum value using the first i items, with weight capacity w

However, when w is big, we can have up to 10^11 states. It is not possible to store so much so we need a better solution. But paying attention to the problem, we can get a better approach. We know that the total possible value is relatively small (n * max(v)) so we instead formulate the state as:

dp[i][v] = minimum weight needed to achieve exactly value v using the first i items

Prefix‑sums

The core idea is that DP transitions often contain a range aggregate — a sum, min, or max over a subarray — that gets recomputed from scratch for every (i, j) pair.

If you precompute the answer to all such range queries in one O(n) pass, each transition drops from O(n) to O(1), collapsing the overall complexity by a full factor of n.

Consider this problem:

You have an array A[1..n]. Partition it into any number of contiguous subarrays. The cost of a subarray [l..r] is the square of its element sum. Minimize total cost. Constraints: n ≤ 5000.

The naive approach would be as follows:

dp[i] = minimum cost to partition A[1..i]

dp[i] = min over j from 0 to i-1 of:
            dp[j] + (A[j+1] + A[j+2] + ... + A[i])²

However, this is an \(O(n^3)\) algorithm because one state takes \(O(n^2)\) time to compute. There is a simple way to speed things up. We pre-compute the prefix sum array P and then we can write the recursive step as:

dp[i] = min over j from 0 to i-1 of:
            dp[j] + (P[i] - P[j])²

This leads to a \(O(n^2)\) solution.

Divide‑and‑conquer optimization

If the argmin/argmax for DP[i] is nondecreasing in i, you can either move a pointer (monotone queue) or apply divide‑and‑conquer to compute ranges using narrowed candidate intervals.

If this is satisfied, we first prove monotonicity (often via convexity or quadrangle inequality), then implement D&C recursion that queries only the reduced range. Consider the following example:

You have an array of n elements. Split it into exactly k contiguous subarrays. The cost of a subarray of length L is . Minimize total cost. Constraints: n ≤ 10⁵, k ≤ 100.

This problem can be mathematically solved but for the sake of example, consider the following naive DP solution:

dp[t][i] = min cost to split A[1..i] into exactly t subarrays

dp[t][i] = min over j from t-1 to i-1 of:
                dp[t-1][j] + (i - j)²

This problem has \(O(nk)\) sub-problems and takes \(O(n)\) time per sub-problem. This leads to a \(O(n^2 k)\) algorithm.

Define opt[t][i] as the index j that achieves the minimum for dp[t][i]. The key observation for our cost function (i - j)² is:

opt[t][i] ≤ opt[t][i+1]     for all i

The optimal split point is monotone non-decreasing. This means: if the best place to split before position 5 is at index 3, the best place to split before position 6 is at index 3 or later — never earlier.

We exploit this with divide-and-conquer over the states:

def solve(lo, hi, opt_lo, opt_hi):
    if lo > hi: return
    mid = (lo + hi) // 2

    # Find opt[mid] by scanning only [opt_lo, opt_hi]
    best_j, best_val = opt_lo, infinity
    for j in range(opt_lo, min(opt_hi, mid - 1) + 1):
        val = dp_prev[j] + (mid - j) ** 2
        if val < best_val:
            best_val, best_j = val, j
    dp_curr[mid] = best_val

    # Left half: opt is in [opt_lo, best_j]
    solve(lo, mid - 1, opt_lo, best_j)
    # Right half: opt is in [best_j, opt_hi]
    solve(mid + 1, hi, best_j, opt_hi)

At each recursion level, the scanned ranges across all calls partition [0, n] — so each level costs O(n) total. With O(log n) levels, each layer of the DP costs O(n log n) and the overall algorithm takes O(kn log n) time.

In general, this optimization can be applied when we have a recurrence of the form:

\[dp(i, j) = \min_{0 \leq k \leq j} { dp(i - 1, k - 1) + C(k, j) }\]

where the cost function satisfies the quadrangle inequality:

\[C(a,c) + C(b,d) ≥ C(a,d) + C(b,c) \text{ for } a ≤ b ≤ c ≤ d\]

Many natural cost functions (squared lengths, absolute differences, certain weighted sums) satisfy this.

Knuth optimization

For interval DP with quadrangle inequality and monotone optima, restrict search for optimal split k of interval [i,j] to [opt[i][j−1], opt[i+1][j]].

Here is an trivial example on a toy example (but the method can be used on problems like optimal binary search tree or certain matrix chain variants to cut search drastically):

You have n piles of stones in a row with sizes A[1..n]. You can only merge adjacent piles. The cost of any merge is the total size of stones in the two piles being merged. Find the minimum total cost to merge everything into one pile. Constraints: n ≤ 5000.

The naive approach would formulate the state and recurrence as follows:

dp[i][j] = minimum cost to merge piles i through j into one pile

Base case:
    dp[i][i] = 0

Transition (try every split point k):
    dp[i][j] = min over k from i to j-1 of:
                   dp[i][k] + dp[k+1][j] + sum(A[i..j])

(sum(A[i..j]) is paid every time we do a final merge of the two halves)

This yields a formulation with \(O(n^2)\) states and \(O(n)\) time per state (if we use prefix sums to calculate the sum) to yield a \(O(n^3)\) algorithm.

Similar to the divide and conquer optimization, define opt[i][j] as the split point k achieving the minimum for dp[i][j]. The cost function C(i, j) = P[j] - P[i-1] satisfies the quadrangle inequality.

This inequality, when combined with the DP structure, gives Knuth's key theorem:

\[opt[i][j-1] ≤ opt[i][j] ≤ opt[i+1][j]\]

The optimal split for [i, j] is bounded below by the optimal split for [i, j-1] and above by the optimal split for [i+1, j]. The implementation change is minimal — just restrict the scan:

# Fill by increasing interval length
for length in range(2, n + 1):
    for i in range(1, n - length + 2):
        j = i + length - 1
        dp[i][j] = infinity

        # KEY: restrict k to [opt[i][j-1], opt[i+1][j]]
        for k in range(opt[i][j-1], opt[i+1][j] + 1):
            val = dp[i][k] + dp[k+1][j] + P[j] - P[i-1]
            if val < dp[i][j]:
                dp[i][j] = val
                opt[i][j] = k

Fix a diagonal (all intervals of the same length L). The ranges [opt[i][j-1], opt[i+1][j]] for adjacent intervals on that diagonal telescope — consecutive intervals share a bound, so the total work across the entire diagonal is O(n). Summing over all O(n) diagonals gives O(n²) total.

Convex hull trick

When transitions are of form dp[i]=min⁡j(mjxi+bj), maintain lower/upper envelope of lines and query by xi. Use deque for monotone slopes or balanced tree for arbitrary slopes.

Use for linearized cost functions to get O(log⁡n) or amortized O(1) per query.

Changing how you Compute Values

This method relies on observations that we make when we look at the state table. We observe patterns in how the values are updated and accordingly come up with an efficient way to fill up the state table.

The approach used to achieve the efficiency gain may be non-trivial and unique to a certain set of problems.

A good example is the Bounded Knapsack Problem: Going Faster

Here are some questions you can ask to come up with improvements:

  • Ask whether you really need the whole DP table.
  • Ask whether you can compute states in a different order.
  • Ask whether repeated subexpressions can be cached or precomputed.
  • Ask whether a data structure can maintain candidates faster than rescanning them.

Here is how you can implement some of these optimizations:

Computing States in a Different Order

By identifying structure in how states relate to one another, you can often find an order where each state is computed exactly once from already-known neighbours, eliminating all redundant recomputation.

Consider the problem:

Given an unweighted tree of n nodes, find for every node v the sum of distances from v to all other nodes. Constraints: n ≤ 10⁵.

The naive solution would be to run BFS/DFS starting from each node then add up the distances. This is a \(O(n^2)\) algorithm. We can do better.

The key observation is: if you already know ans[v] for a parent, you can derive ans[child] in O(1) by reasoning about what changes when you shift the root downward by one edge.

Pass 1 — Bottom-up (leaves → root): Fix an arbitrary root (say, node 0). Compute two values per node via a single post-order DFS:

sz[v]   = number of nodes in v's subtree (including v)
down[v] = sum of distances from v to every node in its subtree

Base case (leaf):
    sz[v] = 1,  down[v] = 0

Transition (v with children c₁, c₂, ...):
    sz[v]   = 1 + Σ sz[c]
    down[v] = Σ (down[c] + sz[c])
              ↑ each node in c's subtree is 1 edge farther from v than from c

After Pass 1: ans[root] = down[root] — the root sees all nodes below it.

Pass 2 — Top-down (root → leaves): Now propagate ans downward. When shifting the root from v to a child c:

  • The sz[c] nodes inside c's subtree each get 1 closer to the new root.
  • The n - sz[c] nodes outside c's subtree each get 1 farther.
ans[root] = down[root]

For each node v processed top-down, for each child c:
    ans[c] = ans[v] - sz[c] + (n - sz[c])

Each node is visited exactly once in each pass.

Total cost: O(n) for Pass 1 + O(n) for Pass 2 = O(n).

The general idea is to find one "canonical" perspective and compute it fully. Then, ask what changes by exactly O(1) when you shift perspective by one step? Express the new answer as a correction to the old one. Propagating corrections this way leads to \(O(n)\) computation.

This pattern appears in rerooting DPs on trees, sliding window DPs (shifting a window one step), and contribution reversal (computing the effect of adding/removing one element from a DP without rerunning it).

Using a better data structure

The core idea is that a DP transition of the form dp[i] = best over some set of previous dp[j] values doesn't require scanning all candidates naively. If the set of relevant j's has exploitable structure — being ordered by value, indexed by position, or satisfying a range constraint — then the right data structure can answer each transition query in O(log n) instead of O(n), reducing the whole DP by a logarithmic factor.

A simple example is the Longest Increase Subsequence (LIS) problem:

Given an array A[1..n], find the length of its longest strictly increasing subsequence. Constraints: n ≤ 10⁵.

The naive approach that we have seen before is \(O(n^2)\) but we can do better.

First, re-frame what the recurrence does. It asks "what is the maximum dp[j] among all j where A[j] < A[i]?".

This is a prefix maximum query over the value domain — exactly what a Binary Indexed Tree (BIT) supporting point-update and prefix-max answers in O(log n). The BIT stores, at each value v, the best dp value seen so far for any element with compressed value v. query_max(x) returns the maximum over positions [1..x].

Each of the n elements triggers one O(log n) query and one O(log n) update. This leads to a \(O(n \log n)\) algorithm.

Here is a list of data structures we could use for different contexts:

  • j < i, A[j] < A[i] → prefix-max on value domain → BIT
  • j in [i-k, i-1] → sliding window max → Monotone deque
  • j < i, arbitrary order → range max over indices → Segment tree
  • dp[j] + linear cost → convex hull of linear funcs → CHT / Li Chao tree

DP + bitmask