Skip to content

BabolokiJ/String-Matching-Algorithm-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

KMP vs Naive String Matching

A benchmark comparing the naive (brute force) and Knuth-Morris-Pratt (KMP) string searching algorithms across several scenarios to show when and why KMP wins.

Running it

pip install matplotlib
python benchmark.py

Results are printed to the terminal and also written to results.md, which you can regenerate any time by re-running the script.

What is string pattern matching?

Given a text of length n and a pattern of length m, find every position in the text where the pattern appears.

text:    ABABCABABAB
pattern: ABAB
matches: [0, 5, 7]

Used in text editors (Ctrl+F), DNA sequence analysis, log searching, network packet inspection, compilers, and basically anywhere you need to find one string inside another.

The naive algorithm

Slide the pattern along the text one position at a time. At each position compare characters one by one. If anything mismatches, give up and try the next position.

text:    A B A B C ...
pattern: A B A B D
         v v v v x  -> mismatch, restart at position 1

Complexity: O(n x m)

The problem is backtracking. When a mismatch happens after several characters have already matched, the algorithm throws away everything it learned and starts over. In the worst case (searching for AAAAB in AAAAAAA...) it scans nearly the whole pattern at every single position.

KMP

KMP (Knuth-Morris-Pratt, 1977) avoids redundant comparisons by precomputing a jump table from the pattern before searching. When a mismatch happens, instead of resetting to position 0, it jumps back to the furthest point where a valid partial match could still exist. The text pointer never moves backwards.

Complexity: O(n + m)

The LPS table

The jump table is called the LPS table (Longest Proper Prefix which is also a Suffix). For each position i in the pattern, lps[i] stores the length of the longest proper prefix of pattern[0..i] that is also a suffix.

pattern:  A  B  A  B  C
lps:      0  0  1  2  0
  • lps[2] = 1 because "A" is both a prefix and suffix of "ABA"
  • lps[3] = 2 because "AB" is both a prefix and suffix of "ABAB"

When a mismatch occurs at pattern position j, KMP sets j = lps[j-1] and tries again without re-reading any text characters.

How the test data is generated

Most scenarios use randomly generated text and patterns. Characters are sampled uniformly at random from the relevant alphabet (lowercase letters for the general cases, ACGT for the DNA scenario). Lengths range from 1,000 to 100,000 characters to show how both algorithms scale.

The script calls random.seed(42) at startup, This enables as much reproducibility as is possible.

Compute times complicate matters. Those depend on your machine, CPU load, caching, and whatever else is running at the time. The script takes the best of 3 runs per test to reduce noise but some variance between machines is normal.

Scenario 2 is the exception and doesn't use random text at all. It uses a constructed AAAA...B string specifically designed to maximise naive backtracking. This represents any situation where the text and pattern share a long common prefix (repetitive binary data, run-length encoded sequences, certain file formats, etc.) It's included because random inputs with a 26-character alphabet actually flatter naive, since mismatches happen so early that it rarely has to backtrack much. The constructed case forces some real worst-case behaviour.

Results

Full timing tables are in results.md. The graphs are regenerated every time you run benchmark.py and saved to produced_graphs/.

Speedup across all tests

speedup summary

Almost every bar sits near zero except for the three in the middle, these are the worst-case scenario tests. The fact that the worst-case bars are so tall while everything else is nearly invisible shows just how input-dependent the performance difference is. On typical random text the two algorithms are close. On adversarial input, naive falls apart completely.

Scenario 1: random text, random pattern

scenario 1

With a 26-character alphabet, the chance of a partial match at any given position is low, so naive rarely has to backtrack far before giving up and moving on. KMP wins by 1.0x at small sizes and 1.4x at 100,000 characters. The gap grows slightly with text length but stays modest. This is the most common real-world case and neither algorithm struggles here.

Scenario 2: worst case for naive (AAAA...B)

scenario 2

This is where the theory becomes undeniable. At text=1,000 with pattern=50 naive is already 14.8x slower. At text=5,000 with pattern=250 it is 80.7x slower. At text=10,000 with pattern=500 the gap reaches 131.5x. The KMP bar is barely visible at the bottom of the chart because it processes each character once and moves on. Naive has to scan nearly the entire pattern at every single position in the text before failing, so its time grows as O(n x m) while KMP stays O(n + m).

Scenario 3: DNA-style sequences (ACGT only)

scenario 3

With only 4 characters to choose from, the text and pattern share partial matches more often, which means naive backtracks more before finding a mismatch. The speedup is 1.1x at small sizes and reaches 1.2x at 100,000 characters. The gap is more consistent here than in scenario 1 because the small alphabet keeps causing partial matches regardless of scale.

Scenario 4: varying pattern length (text = 50k)

scenario 4

With the text fixed at 50,000 characters and the pattern length varying from 5 to 500, the speedup actually decreases as the pattern gets longer (1.4x and 1.5x for short patterns, dropping to 1.0x for longer ones). Short patterns cause more comparison attempts per character of text, which is where KMP saves the most work. Longer patterns with a large alphabet mismatch so quickly that naive barely backtracks at all, closing the gap.

Complexity comparison

worst case average space
naive O(n x m) O(n) O(1)
KMP O(n + m) O(n+m) O(m)

Naive's average case is actually O(n) with a large random alphabet because early mismatches are common. KMP's O(n + m) is a worst-case guarantee that holds regardless of input. That's the real difference: average vs guaranteed performance.

Files

benchmark.py        - both algorithms + benchmark runner, writes results.md and produced_graphs/
results.md          - timing output and graphs from the last run (auto-generated)
produced_graphs/    - graphs shown in readme.md
graphs/             - png charts generated by benchmark.py
README.md           - this file

About

A benchmark comparing the naive (brute force) and Knuth-Morris-Pratt (KMP) string searching algorithms across several scenarios to show when and why KMP wins.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages