э ь е ь р ыр я рь п хж тжс вв р р жвь п сж ь яв

advertisement
Chapter 1
Exat Mathing:
Fundamental Preproessing
and First Algorithms
1.1
The Naive method
Almost all disussions of exat mathing begin with the Naive Method, and
we follow this tradition. The naive method aligns the left end of P with the
left end of T , then ompares the haraters of P and T left to right until
either two unequal haraters are found or until P is exhausted, in whih
ase an ourrene of P is reported. In either ase, P is then shifted one
plae to the right, and the omparisons are restarted from the left end of P .
This proess repeats until the right end of P shifts past the right end of T .
Using n to denote the length of P and m to denote the length of T ,
the worst ase number of omparisons made by this method is (nm). In
partiular, if both P and T onsist of the same repeated harater, then
there is an ourrene of P at eah of the rst m n + 1 positions of T
and the method performs exatly n(m n + 1) omparisons. For example if
P = aaa and T = aaaaaaaaaa then n = 3; m = 10 and 24 omparisons are
made.
The naive method is ertainly simple to understand and program, but
its worst ase running time of (nm) may be unsatisfatory and an be
improved. Even the pratial running time of the naive method may be
too slow for larger texts and patterns. Early on, there were several related
ideas to improve the naive method, both in pratie and in worst ase. The
result is that the O(n m) worst-ase bound an be redued to O(n + m).
Changing \" to \+" in the bound is extremely signiant (try n = 1000
and m = 10; 000; 000, whih are realisti numbers in some appliations.
1
2CHAPTER 1.
1.2
EXACT MATCHING: FUNDAMENTAL PREPROCESSING AND FIRST AL
The preproessing approah
Many string mathing and analysis algorithms are able to eÆiently skip
omparisons by rst spending \modest" time learning about the internal
struture of either the pattern P or the text T . During that time, the other
string may not even be known to the algorithm. This part of the overall
algorithm is alled the preproessing stage. Preproessing is followed by a
searh stage, where the information found during the preproessing stage
is used to redue the work done while searhing for ourrenes of P in
T . In the above example, the smarter method was assumed to know that
harater a did not our again until position 5, and the even smarter method
was assumed to know that the pattern abx was repeated again starting at
position 5. This assumed knowledge is obtained in the preproessing stage.
For the exat mathing problem, all of the algorithms mentioned in the
previous setion preproess pattern P . (The opposite approah of preproessing text T is used in other algorithms, suh as those based on suÆx
trees. Those methods will be explained later in the book.) These preproessing methods, as originally developed, are \similar in spirit" but often
quite dierent in detail and oneptual diÆulty. In this book we take a
dierent approah and do not initially explain the originally developed preproessing methods. Rather, we highlight the similarity of the preproessing
tasks needed for several dierent mathing algorithms, by rst dening a
fundamental preproessing of P that is independent of any partiular mathing algorithm. Then we show how eah spei mathing algorithm uses the
information omputed by the fundamental preproessing of P . The result
is a simpler more uniform exposition of the preproessing needed by several
lassial mathing methods, and a simple linear time algorithm for exat
mathing based only on this preproessing (disussed in Setion 1.5). This
approah to linear-time pattern mathing was developed in [?℄.
1.3
Fundamental preproessing of the pattern
Fundamental preproessing will be desribed for a general string denoted
In spei appliations of fundamental preproessing, S will often be the
pattern P , but here we use S instead of P beause fundamental preproessing
will also be applied to strings other than P .
The following denition gives the key values omputed during the fundamental preproessing of a string.
Denition Given a string S and a position i > 1, let Z (S ) be the length
of the longest substring of S that starts at i and mathes a prex of S .
In other words, Z (S ) is the length of the longest prex of S [i::s℄ whih
S.
i
i
1.3.
3
FUNDAMENTAL PREPROCESSING OF THE PATTERN
mathes a prex of S . For example, when S = aabaabxaaz then
Z (S ) = 3 (aab:::aabx:::);
Z (S ) = 1 (aa:::ab:::);
Z (S ) = Z (S ) = 0;
Z (S ) = 2 (aab:::aaz ):
When S is lear by ontext, we will use Z in plae of Z (S ).
To introdue the next onept, onsider the boxes drawn in Figure 1.1.
Eah box starts at some position j > 1 suh that Z is greater than zero.
The length of the box starting at j is meant to represent Z . Therefore, eah
box in the gure represents a maximal-length substring of S that mathes a
prex of S , and that doesn't start at position one. Eah suh box is alled
a Z-box. More formally,
Denition For any position i > 1 where Z is greater than zero, the
Z-box at i is dened as the interval starting at i and ending at position
i + Z 1.
Denition For every i > 1, r is the rightmost endpoint of the Z -boxes
that begin at or before position i. Another way to state this is: r is the
largest value of j + Z 1 over all 1 < j i suh that Z > 0. (See Figure
1.1.
We use the term l for the value of j speied in the above denition.
That is, l is the position of left end of Z-box that ends at r . In ase there is
more than one Z -box ending at r , then l an be hosen to be the left end
of any of those Z -boxes. As an example, suppose S = aabaabaxaabaaby.
Then Z = 7, r = 16 and l = 10.
5
6
7
8
9
i
i
j
j
i
i
i
i
j
j
i
i
i
i
10
15
i
15
α
S
α
Zl
l
i
i
i
Figure 1.1: Eah solid box represents a substring of S that mathes a prex
of S and that starts between positions 2 and i. Eah box is alled a Z -box.
We use r to denote the rightmost end of any Z -box that begins at or to the
left of position i, and to denote the substring in the Z -box ending at r .
Then l denotes the left end of . The opy of that ours as a prex of
S is also shown in the gure.
The linear time omputation of Z values from S is the fundamental
preproessing task that we will use in all the lassial linear-time mathing
i
i
i
ri
4CHAPTER 1.
EXACT MATCHING: FUNDAMENTAL PREPROCESSING AND FIRST AL
algorithms that preproess P . But before detailing those uses, we show how
to do the fundamental preproessing in linear time.
1.4
Fundamental preproessing in linear time
The task of this setion is to show how to ompute all the Z values for S in
linear time, i.e., in O(jS j) time. A diret approah based on the denition
would take (jS j ) time. The method we will present was developed in [?℄
for a dierent purpose.
The preproessing algorithm omputes Z ; r and l for eah suessive
position i, starting from i = 2. All the Z values omputed will be kept by
the algorithm, but in any iteration i, the algorithm only needs the r and l
values for j = i 1. No earlier r or l values are needed. Hene the algorithm
only uses a single variable, r, to refer to the most reently omputed r
value; similarly it only uses a single variable l. Therefore, in eah iteration
i, if the algorithm disovers a new Z -box (starting at i), variable r will be
inremented to end of that Z -box, whih is the rightmost position of any
Z -box disovered so far.
To begin, the algorithm nds Z by expliitly omparing, left to right,
the haraters of S [2::jS j℄ and S [1::jS j℄ until a mismath is found. Z is the
length of the mathing string. If Z > 0, then r = r is set to Z + 1 and
l = l is set to 2. Otherwise r and l are set to zero. Now assume indutively
that the algorithm has orretly omputed Z for i up to k 1 > 1, and
assume that the algorithm knows the urrent r = r and l = l . The
algorithm next omputes Z , r = r , and l = l .
The main idea is to use the already omputed Z values to aelerate
the omputation of Z . In fat, in some ases, Z an be dedued from the
previous Z values without doing any additional harater omparisons. As
a onrete example, suppose k = 121, all the values Z through Z have
already been omputed, and r = 130 and l = 100. That means that
there is a substring of length 31 that starts at position 100 and that mathes
a prex of S (of length 31). It follows that the substring of length 10 starting
at position 121 must math the substring of length 10 starting at position
22 of S , and so Z may be very helpful in omputing Z . As one ase, if
Z is three, say, then a little reasoning shows that Z must also be three.
So in this illustration, Z an be dedued without any additional harater
omparisons. This ase, along with the others, will be formalized and proven
orret below.
i
2
i
i
i
j
j
j
2
2
2
2
2
2
i
k
k
k
k
1
k
1
k
k
2
120
120
120
22
121
22
121
121
1.4.
5
FUNDAMENTAL PREPROCESSING IN LINEAR TIME
The Z Algorithm
Given Z for all 1 < i k 1 and the urrent values of r and l, Z and the
updated r and l are omputed as follows:
Begin
1. If k > r, then nd Z by expliitly omparing the haraters starting
at position k to the haraters starting at position 1 of S , until a mismath
is found. The length of the math is Z . If Z > 0, then set r to k + Z 1
and set l to k.
2. If k r, then position k is ontained in a Z -box, hene S (k) is
ontained substring S [l::r℄ (all it ) suh that l > 1 and mathes a prex
of S . Therefore harater S (k) also appears in position k0 = k l + 1 of S .
By the same reasoning, substring S [k::r℄ (all it ) must math substring
S [k0::Z ℄. It follows that the substring beginning at position k must math a
prex of S of length at least the minimum of Z and jj (whih is r k +1).
See Figure 1.2.
We onsider two subases based on what that minimum is.
2a. If Z < jj then Z = Z and r; l remain unhanged (see Figure
1.3).
2b. If Z jj then the entire substring S [k::r℄ must be a prex of S
and Z jj = r k + 1. However, Z might be stritly larger than jj,
so ompare the haraters starting at position r + 1 of S to the haraters
starting a position jj + 1 of S until a mismath ours. Say the mismath
ours at harater q r + 1. Then Z is set to q k, r is set to q 1 and
l is set to k (see Figure 1.4).
End
i
k
k
k
k
k
l
k0
k0
k
k0
k0
k
k
k
α
S
α
β
k’
Z
l
l
Figure 1.2: String S [k::r℄ is labeled and also ours starting at position k0
of S .
Theorem 1.4.1 Using Algorithm Z , value Zk is orretly omputed and
variables r and l are orretly updated.
In ase 1, Z is set orretly sine it is omputed by expliit
omparisons. Also (sine k > r in ase 1), before Z is omputed, no Z - box
has been found that starts between positions 2 and k 1 and that ends at or
after position k. Therefore when Z > 0 in ase 1, the algorithm does nd a
Proof
k
k
k
β
k
r
6CHAPTER 1.
S
α
γ
Z k’
EXACT MATCHING: FUNDAMENTAL PREPROCESSING AND FIRST AL
α
β
γ
k’
β
γ
k
l
r
k’+ Zk’ - 1
k+Zk - 1
Figure 1.3: Case 2a. The longest string starting at k0 that mathes a prex
of S is shorter than jj. In this ase, Z = Z :
k
β
S
α
k0
α
β
k’
k
l
k’ + Zk’ - 1
β
Figure 1.4: Case 2b. The longest string starting at k0 that mathes a prex
of S is at least jj.
new Z -box ending at or after k, and it is orret to hange r to k + Z 1.
Hene the algorithm works orretly in ase 1.
In ase 2a, the substring beginning at position k an math a prex of
S only for length Z < j j. If not, then the next harater to the right,
harater k + Z , must math harater 1 + Z . But harater k + Z
mathes harater k0 + Z (sine Z < jj) so harater k0 + Z must
math harater 1+ Z . But that would be a ontradition to the denition
of Z , for it would establish a substring longer than Z that starts at k0 and
mathes a prex of S . Hene Z = Z in this ase. Further, k + Z 1 < r,
so r and l remain orretly unhanged.
In ase 2b, must be a prex of S (as argued in the body of the algorithm) and sine any extension of this math is expliitly veried by omparing haraters beyond r to haraters beyond the prex , the full extent
of the math is orretly omputed. Hene Z is orretly obtained in this
ase. Furthermore, sine k + Z 1 r, the algorithm orretly hanges r
and l. 2
k
k0
k0
k0
k0
k0
k0
k0
k0
k0
k0
k
k0
k
k
k
Corollary 1.4.1 Repeating algorithm Z for eah position i >
yields all the Zi values.
2
orretly
Theorem 1.4.2 All the Zi (S ) values are omputed by the algorithm in O(jS j)
time.
The time is proportional to the number of iterations, jS j, plus the
number of harater omparisons. Eah omparison results in either a math
Proof
?
r
1.5. THE SIMPLEST LINEAR-TIME EXACT MATCHING ALGORITHM
or a mismath, so we next bound the number of mathes and mismathes
that an our.
Eah iteration that does any harater omparisons at all ends the rst
time it nds a mismath, hene there are at most jS j mismathes during the
entire algorithm. To bound the number of mathes, note rst that r r
for every iteration k. Now, let k be an iteration where q > 0 mathes our.
Then r is set to r + q at least. Finally, r jS j, so the total number of
mathes that our during any exeution of the algorithm is at most jS j. 2
k
k
1.5
k
1
k
1
k
The simplest linear-time exat mathing algorithm
Before disussing the more omplex (lassial) exat mathing methods, we
show that fundamental preproessing alone provides a simple linear time exat mathing algorithm. This is the simplest linear-time mathing algorithm
we know of.
Let S = P $T be the string onsisting of P followed by the symbol \$"
followed by T , where \$" is a harater appearing in neither P nor T . Reall
that P has length n and T has length m, and n m. So, S = P $T has
length n + m + 1 = O(m). Compute Z (S ) for i from 1 to n + m + 1. Sine
\$" does not appear in P or T , Z n for every i. Any value of i > n + 1
suh that Z (S ) = n identies an ourrene of P in T starting at position
i (n +1) of T . Conversely, if P ours in T starting at position j of T , then
Z
must be equal to n. Sine all the Z (S ) values an be omputed in
O(n + m) = O(m) time, this approah identies all the ourrenes of P in
T in O(m) time.
The method an be implemented to use only O(n) spae (in addition
to the spae needed for pattern and text) independent of the size of the
alphabet. Sine Z n for all i, position k0 (determined in step 2) will
always fall inside P . Therefore there is no need to reord the Z values for
haraters in T . Instead, we only need to reord the Z values for the n
haraters in P , and also maintain the urrent l and r. Those values are
suÆient to ompute (but not store) the Z value of eah harater in T and
hene to identify and output any position i where Z = n.
There is another harateristi of this method that is worth introduing
here. The method is onsidered an alphabet independent linear time method.
That is, we never had to assume that the alphabet size was nite, or that
we knew the alphabet ahead of time { a harater omparison only determines whether the two haraters math or mismath, it needs no further
information about the alphabet. We will see that this harateristi is also
true of the Knuth-Morris-Pratt and Boyer-Moore algorithms, but not of the
Aho-Corasik algorithm or methods based on suÆx trees.
i
i
i
(n+1)+j
i
i
i
7
Download