Chapter 1 Exat Mathing: Fundamental Preproessing and First Algorithms 1.1 The Naive method Almost all disussions of exat mathing begin with the Naive Method, and we follow this tradition. The naive method aligns the left end of P with the left end of T , then ompares the haraters of P and T left to right until either two unequal haraters are found or until P is exhausted, in whih ase an ourrene of P is reported. In either ase, P is then shifted one plae to the right, and the omparisons are restarted from the left end of P . This proess repeats until the right end of P shifts past the right end of T . Using n to denote the length of P and m to denote the length of T , the worst ase number of omparisons made by this method is (nm). In partiular, if both P and T onsist of the same repeated harater, then there is an ourrene of P at eah of the rst m n + 1 positions of T and the method performs exatly n(m n + 1) omparisons. For example if P = aaa and T = aaaaaaaaaa then n = 3; m = 10 and 24 omparisons are made. The naive method is ertainly simple to understand and program, but its worst ase running time of (nm) may be unsatisfatory and an be improved. Even the pratial running time of the naive method may be too slow for larger texts and patterns. Early on, there were several related ideas to improve the naive method, both in pratie and in worst ase. The result is that the O(n m) worst-ase bound an be redued to O(n + m). Changing \" to \+" in the bound is extremely signiant (try n = 1000 and m = 10; 000; 000, whih are realisti numbers in some appliations. 1 2CHAPTER 1. 1.2 EXACT MATCHING: FUNDAMENTAL PREPROCESSING AND FIRST AL The preproessing approah Many string mathing and analysis algorithms are able to eÆiently skip omparisons by rst spending \modest" time learning about the internal struture of either the pattern P or the text T . During that time, the other string may not even be known to the algorithm. This part of the overall algorithm is alled the preproessing stage. Preproessing is followed by a searh stage, where the information found during the preproessing stage is used to redue the work done while searhing for ourrenes of P in T . In the above example, the smarter method was assumed to know that harater a did not our again until position 5, and the even smarter method was assumed to know that the pattern abx was repeated again starting at position 5. This assumed knowledge is obtained in the preproessing stage. For the exat mathing problem, all of the algorithms mentioned in the previous setion preproess pattern P . (The opposite approah of preproessing text T is used in other algorithms, suh as those based on suÆx trees. Those methods will be explained later in the book.) These preproessing methods, as originally developed, are \similar in spirit" but often quite dierent in detail and oneptual diÆulty. In this book we take a dierent approah and do not initially explain the originally developed preproessing methods. Rather, we highlight the similarity of the preproessing tasks needed for several dierent mathing algorithms, by rst dening a fundamental preproessing of P that is independent of any partiular mathing algorithm. Then we show how eah spei mathing algorithm uses the information omputed by the fundamental preproessing of P . The result is a simpler more uniform exposition of the preproessing needed by several lassial mathing methods, and a simple linear time algorithm for exat mathing based only on this preproessing (disussed in Setion 1.5). This approah to linear-time pattern mathing was developed in [?℄. 1.3 Fundamental preproessing of the pattern Fundamental preproessing will be desribed for a general string denoted In spei appliations of fundamental preproessing, S will often be the pattern P , but here we use S instead of P beause fundamental preproessing will also be applied to strings other than P . The following denition gives the key values omputed during the fundamental preproessing of a string. Denition Given a string S and a position i > 1, let Z (S ) be the length of the longest substring of S that starts at i and mathes a prex of S . In other words, Z (S ) is the length of the longest prex of S [i::s℄ whih S. i i 1.3. 3 FUNDAMENTAL PREPROCESSING OF THE PATTERN mathes a prex of S . For example, when S = aabaabxaaz then Z (S ) = 3 (aab:::aabx:::); Z (S ) = 1 (aa:::ab:::); Z (S ) = Z (S ) = 0; Z (S ) = 2 (aab:::aaz ): When S is lear by ontext, we will use Z in plae of Z (S ). To introdue the next onept, onsider the boxes drawn in Figure 1.1. Eah box starts at some position j > 1 suh that Z is greater than zero. The length of the box starting at j is meant to represent Z . Therefore, eah box in the gure represents a maximal-length substring of S that mathes a prex of S , and that doesn't start at position one. Eah suh box is alled a Z-box. More formally, Denition For any position i > 1 where Z is greater than zero, the Z-box at i is dened as the interval starting at i and ending at position i + Z 1. Denition For every i > 1, r is the rightmost endpoint of the Z -boxes that begin at or before position i. Another way to state this is: r is the largest value of j + Z 1 over all 1 < j i suh that Z > 0. (See Figure 1.1. We use the term l for the value of j speied in the above denition. That is, l is the position of left end of Z-box that ends at r . In ase there is more than one Z -box ending at r , then l an be hosen to be the left end of any of those Z -boxes. As an example, suppose S = aabaabaxaabaaby. Then Z = 7, r = 16 and l = 10. 5 6 7 8 9 i i j j i i i i j j i i i i 10 15 i 15 α S α Zl l i i i Figure 1.1: Eah solid box represents a substring of S that mathes a prex of S and that starts between positions 2 and i. Eah box is alled a Z -box. We use r to denote the rightmost end of any Z -box that begins at or to the left of position i, and to denote the substring in the Z -box ending at r . Then l denotes the left end of . The opy of that ours as a prex of S is also shown in the gure. The linear time omputation of Z values from S is the fundamental preproessing task that we will use in all the lassial linear-time mathing i i i ri 4CHAPTER 1. EXACT MATCHING: FUNDAMENTAL PREPROCESSING AND FIRST AL algorithms that preproess P . But before detailing those uses, we show how to do the fundamental preproessing in linear time. 1.4 Fundamental preproessing in linear time The task of this setion is to show how to ompute all the Z values for S in linear time, i.e., in O(jS j) time. A diret approah based on the denition would take (jS j ) time. The method we will present was developed in [?℄ for a dierent purpose. The preproessing algorithm omputes Z ; r and l for eah suessive position i, starting from i = 2. All the Z values omputed will be kept by the algorithm, but in any iteration i, the algorithm only needs the r and l values for j = i 1. No earlier r or l values are needed. Hene the algorithm only uses a single variable, r, to refer to the most reently omputed r value; similarly it only uses a single variable l. Therefore, in eah iteration i, if the algorithm disovers a new Z -box (starting at i), variable r will be inremented to end of that Z -box, whih is the rightmost position of any Z -box disovered so far. To begin, the algorithm nds Z by expliitly omparing, left to right, the haraters of S [2::jS j℄ and S [1::jS j℄ until a mismath is found. Z is the length of the mathing string. If Z > 0, then r = r is set to Z + 1 and l = l is set to 2. Otherwise r and l are set to zero. Now assume indutively that the algorithm has orretly omputed Z for i up to k 1 > 1, and assume that the algorithm knows the urrent r = r and l = l . The algorithm next omputes Z , r = r , and l = l . The main idea is to use the already omputed Z values to aelerate the omputation of Z . In fat, in some ases, Z an be dedued from the previous Z values without doing any additional harater omparisons. As a onrete example, suppose k = 121, all the values Z through Z have already been omputed, and r = 130 and l = 100. That means that there is a substring of length 31 that starts at position 100 and that mathes a prex of S (of length 31). It follows that the substring of length 10 starting at position 121 must math the substring of length 10 starting at position 22 of S , and so Z may be very helpful in omputing Z . As one ase, if Z is three, say, then a little reasoning shows that Z must also be three. So in this illustration, Z an be dedued without any additional harater omparisons. This ase, along with the others, will be formalized and proven orret below. i 2 i i i j j j 2 2 2 2 2 2 i k k k k 1 k 1 k k 2 120 120 120 22 121 22 121 121 1.4. 5 FUNDAMENTAL PREPROCESSING IN LINEAR TIME The Z Algorithm Given Z for all 1 < i k 1 and the urrent values of r and l, Z and the updated r and l are omputed as follows: Begin 1. If k > r, then nd Z by expliitly omparing the haraters starting at position k to the haraters starting at position 1 of S , until a mismath is found. The length of the math is Z . If Z > 0, then set r to k + Z 1 and set l to k. 2. If k r, then position k is ontained in a Z -box, hene S (k) is ontained substring S [l::r℄ (all it ) suh that l > 1 and mathes a prex of S . Therefore harater S (k) also appears in position k0 = k l + 1 of S . By the same reasoning, substring S [k::r℄ (all it ) must math substring S [k0::Z ℄. It follows that the substring beginning at position k must math a prex of S of length at least the minimum of Z and jj (whih is r k +1). See Figure 1.2. We onsider two subases based on what that minimum is. 2a. If Z < jj then Z = Z and r; l remain unhanged (see Figure 1.3). 2b. If Z jj then the entire substring S [k::r℄ must be a prex of S and Z jj = r k + 1. However, Z might be stritly larger than jj, so ompare the haraters starting at position r + 1 of S to the haraters starting a position jj + 1 of S until a mismath ours. Say the mismath ours at harater q r + 1. Then Z is set to q k, r is set to q 1 and l is set to k (see Figure 1.4). End i k k k k k l k0 k0 k k0 k0 k k k α S α β k’ Z l l Figure 1.2: String S [k::r℄ is labeled and also ours starting at position k0 of S . Theorem 1.4.1 Using Algorithm Z , value Zk is orretly omputed and variables r and l are orretly updated. In ase 1, Z is set orretly sine it is omputed by expliit omparisons. Also (sine k > r in ase 1), before Z is omputed, no Z - box has been found that starts between positions 2 and k 1 and that ends at or after position k. Therefore when Z > 0 in ase 1, the algorithm does nd a Proof k k k β k r 6CHAPTER 1. S α γ Z k’ EXACT MATCHING: FUNDAMENTAL PREPROCESSING AND FIRST AL α β γ k’ β γ k l r k’+ Zk’ - 1 k+Zk - 1 Figure 1.3: Case 2a. The longest string starting at k0 that mathes a prex of S is shorter than jj. In this ase, Z = Z : k β S α k0 α β k’ k l k’ + Zk’ - 1 β Figure 1.4: Case 2b. The longest string starting at k0 that mathes a prex of S is at least jj. new Z -box ending at or after k, and it is orret to hange r to k + Z 1. Hene the algorithm works orretly in ase 1. In ase 2a, the substring beginning at position k an math a prex of S only for length Z < j j. If not, then the next harater to the right, harater k + Z , must math harater 1 + Z . But harater k + Z mathes harater k0 + Z (sine Z < jj) so harater k0 + Z must math harater 1+ Z . But that would be a ontradition to the denition of Z , for it would establish a substring longer than Z that starts at k0 and mathes a prex of S . Hene Z = Z in this ase. Further, k + Z 1 < r, so r and l remain orretly unhanged. In ase 2b, must be a prex of S (as argued in the body of the algorithm) and sine any extension of this math is expliitly veried by omparing haraters beyond r to haraters beyond the prex , the full extent of the math is orretly omputed. Hene Z is orretly obtained in this ase. Furthermore, sine k + Z 1 r, the algorithm orretly hanges r and l. 2 k k0 k0 k0 k0 k0 k0 k0 k0 k0 k0 k k0 k k k Corollary 1.4.1 Repeating algorithm Z for eah position i > yields all the Zi values. 2 orretly Theorem 1.4.2 All the Zi (S ) values are omputed by the algorithm in O(jS j) time. The time is proportional to the number of iterations, jS j, plus the number of harater omparisons. Eah omparison results in either a math Proof ? r 1.5. THE SIMPLEST LINEAR-TIME EXACT MATCHING ALGORITHM or a mismath, so we next bound the number of mathes and mismathes that an our. Eah iteration that does any harater omparisons at all ends the rst time it nds a mismath, hene there are at most jS j mismathes during the entire algorithm. To bound the number of mathes, note rst that r r for every iteration k. Now, let k be an iteration where q > 0 mathes our. Then r is set to r + q at least. Finally, r jS j, so the total number of mathes that our during any exeution of the algorithm is at most jS j. 2 k k 1.5 k 1 k 1 k The simplest linear-time exat mathing algorithm Before disussing the more omplex (lassial) exat mathing methods, we show that fundamental preproessing alone provides a simple linear time exat mathing algorithm. This is the simplest linear-time mathing algorithm we know of. Let S = P $T be the string onsisting of P followed by the symbol \$" followed by T , where \$" is a harater appearing in neither P nor T . Reall that P has length n and T has length m, and n m. So, S = P $T has length n + m + 1 = O(m). Compute Z (S ) for i from 1 to n + m + 1. Sine \$" does not appear in P or T , Z n for every i. Any value of i > n + 1 suh that Z (S ) = n identies an ourrene of P in T starting at position i (n +1) of T . Conversely, if P ours in T starting at position j of T , then Z must be equal to n. Sine all the Z (S ) values an be omputed in O(n + m) = O(m) time, this approah identies all the ourrenes of P in T in O(m) time. The method an be implemented to use only O(n) spae (in addition to the spae needed for pattern and text) independent of the size of the alphabet. Sine Z n for all i, position k0 (determined in step 2) will always fall inside P . Therefore there is no need to reord the Z values for haraters in T . Instead, we only need to reord the Z values for the n haraters in P , and also maintain the urrent l and r. Those values are suÆient to ompute (but not store) the Z value of eah harater in T and hene to identify and output any position i where Z = n. There is another harateristi of this method that is worth introduing here. The method is onsidered an alphabet independent linear time method. That is, we never had to assume that the alphabet size was nite, or that we knew the alphabet ahead of time { a harater omparison only determines whether the two haraters math or mismath, it needs no further information about the alphabet. We will see that this harateristi is also true of the Knuth-Morris-Pratt and Boyer-Moore algorithms, but not of the Aho-Corasik algorithm or methods based on suÆx trees. i i i (n+1)+j i i i 7