String Matching
String Matching
- Problem definition:
- Find a Pattern (length m) within a Text (length
n)
- Can find first match or all occurrences
- An important problem area:
- Editors
- Language String API
- DNA Sequence Matching
- Many interesting algorithms - generally use preprocessing
- Typical Algorithm Design Technique - Transform and Conquer:
- Build tables
- Tables use information based on incorrect matches
Finite State Automata
- Finite State Automata (early 1970s)
- Create a FSA that recognizes the pattern and use text as input
Various Algorithms
- Knuth-Morris-Pratt (1977)
- Solved by Knuth and Morris, and independently, by Pratt
- Published jointly
Rabin-Karp
- Rabin-Karp (1987)
- If text and pattern are digits
- Convert pattern into a m-digit number
- Convert first M digits of pattern into a number (Horner's
method) and compare with pattern
- If no match, then slide one digit to right and repeat
- Sliding one digit right is fast
- Use modulo arithmetic to keep numbers small
- If numbers are equal, then check for pattern in text
- If characters are not digits, consider them to be digits in
some appropriate base
Boyer-Moore-Horspool
- Boyer-Moore-Horspool (1980)
- Variant of Boyer-Moore with one table
- Algorithm: while not matched: Shift to next occurrence of T_e
- See text
Think about Horspool
- What other shift could be used?
- while not matched: Shift to next occurrence of ???
- Could we combine them?
Boyer-Moore (1977)
- Preprocess and make 2 tables
- Process text from left to right, process pattern from right to left
- On a matching suffix of length k, shift max of these 2:
- H(x) - k where H is Horspool table, x is first unmatched character from right
- Value from one of these two cases:
- Distance to first earlier substring that is identical
to matching suffix, but preceded by different character
- Distance to longest suffix that's also a prefix
- This is the entire length, if there is no such suffix
Performance
- Best/Worst?
- Brute Force: O(...)
- Finite automata O(m^3 * number of alphabet symbols)
- KMP: Θ(n) [plus Θ(m) preprocess]
- Rabin-Karp: O(n + m) [if number of occurrences is small)
- Boyer-Moore: 3n
- Boyer-Moore-Horspool O(mn)
Other Notes
- Best algorithm depends on relative values of m and n and the
alphabet size (what are alphabet sizes for characters? For DNA?)
and on expected number of matches
- A linear algorithm with O(1) extra space exists (Galil and
Seiferas, 1983)
- Someday I should add notes on the metaphor discussion of KNP