Sorting and Searching Algorithms: A Cookbook

The first section introduces basic data structures and notation. The next section presents several sorting algorithms. This is followed by techniques for implementing dictionaries, structures that allow efficient search, insert, and delete operations. The last section illustrates algorithms that sort data and implement dictionaries for very large files. Source code for each algorithm, in ANSI C, is available at the site listed below

Sorting and Searching Algorithms: A Cookbook Thomas Niemann Preface This is a collection of algorithms for sorting and searching. Descriptions are brief and intuitive, with just enough theory thrown in to make you nervous. I assume you know C, and that you are familiar with concepts such as arrays and pointers. The first section introduces basic data structures and notation. The next section presents several sorting algorithms. This is followed by techniques for implementing dictionaries, structures that allow efficient search, insert, and delete operations. The last section illustrates algorithms that sort data and implement dictionaries for very large files. Source code for each algorithm, in ANSI C, is available at the site listed below. Permission to reproduce this document, in whole or in part, is given provided the original web site listed below is referenced, and no additional restrictions apply. Source code, when part of a software project, may be used freely without reference to the author. THOMAS NIEMANN Portland, Oregon email: [email protected] home: http://members.xoom.com/thomasn/s_man.htm By the same author: A Guide to Lex and Yacc, at http://members.xoom.com/thomasn/y_man.htm. -2- CONTENTS 1. INTRODUCTION 4 2. SORTING 8 2.1 Insertion Sort 8 2.2 Shell Sort 10 2.3 Quicksort 11 2.4 Comparison 14 3. DICTIONARIES 15 3.1 Hash Tables 15 3.2 Binary Search Trees 19 3.3 Red-Black Trees 21 3.4 Skip Lists 25 3.5 Comparison 26 4. VERY LARGE FILES 29 4.1 External Sorting 29 4.2 B-Trees 32 5. BIBLIOGRAPHY 36 -3- 1. Introduction Arrays and linked lists are two basic data structures used to store information. We may wish to search, insert or delete records in a database based on a key value. This section examines the performance of these operations on arrays and linked lists. Arrays Figure 1-1 shows an array, seven elements long, containing numeric values. To search the array sequentially, we may use the algorithm in Figure 1-2. The maximum number of comparisons is 7, and occurs when the key we are searching for is in A[6]. 0 4 Lb 1 7 2 16 3 20 M 4 37 5 38 6 43 Ub Figure 1-1: An Array int function SequentialSearch (Array A , int Lb , int Ub , int Key ); begin for i = Lb to Ub do if A [ i ] = Key then return i ; return –1; end; Figure 1-2: Sequential Search -4- int function BinarySearch (Array A , int Lb , int Ub , int Key ); begin do forever M = ( Lb + Ub )/2; if ( Key < A[M]) then Ub = M – 1; else if (Key > A[M]) then Lb = M + 1; else return M ; if (Lb > Ub) then return –1; end; Figure 1-3: Binary Search If the data is sorted, a binary search may be done (Figure 1-3). Variables Lb and Ub keep track of the lower bound and upper bound of the array, respectively. We begin by examining the middle element of the array. If the key we are searching for is less than the middle element, then it must reside in the top half of the array. Thus, we set Ub to (M – 1). This restricts our next iteration through the loop to the top half of the array. In this way, each iteration halves the size of the array to be searched. For example, the first iteration will leave 3 items to test. After the second iteration, there will be one item left to test. Therefore it takes only three iterations to find any number. This is a powerful method. Given an array of 1023 elements, we can narrow the search to 511 elements in one comparison. After another comparison, and we’re looking at only 255 elements. In fact, we can search the entire array in only 10 comparisons. In addition to searching, we may wish to insert or delete entries. Unfortunately, an array is not a good arrangement for these operations. For example, to insert the number 18 in Figure 1-1, we would need to shift A[3]…A[6] down by one slot. Then we could copy number 18 into A[3]. A similar problem arises when deleting numbers. To improve the efficiency of insert and delete operations, linked lists may be used. -5- Linked Lists X 18 P # 4 7 16 20 37 38 43 Figure 1-4: A Linked List In Figure 1-4 we have the same values stored in a linked list. Assuming pointers X and P, as shown in the figure, value 18 may be inserted as follows: X->Next = P->Next; P->Next = X; Insertion and deletion operations are very efficient using linked lists. You may be wondering how pointer P was set in the first place. Well, we had to do a sequential search to find the insertion point X. Although we improved our performance for insertion/deletion, it was done at the expense of search time. Timing Estimates Several methods may be used to compare the performance of algorithms. One way is simply to run several tests for each algorithm and compare the timings. Another way is to estimate the time required. For example, we may state that search time is O(n) (big-oh of n). This means that search time, for large n, is proportional to the number of items n in the list. Consequently, we would expect search time to triple if our list increased in size by a factor of three. The big-O notation does not describe the exact time that an algorithm takes, but only indicates an upper bound on execution time within a constant factor. If an algorithm takes O(n2) time, then execution time grows no worse than the square of the size of the list. -6- n lg n n lg n n 1.25 n2 1 0 0 1 1 16 4 64 32 256 256 8 2,048 1,024 65,536 4,096 12 49,152 32,768 16,777,216 65,536 16 1,048,565 1,048,476 4,294,967,296 1,048,476 20 20,969,520 33,554,432 1,099,301,922,576 16,775,616 24 402,614,784 1,073,613,825 281,421,292,179,456 Table 1-1: Growth Rates Table 1-1 illustrates growth rates for various functions. A growth rate of O(lg n) occurs for algorithms similar to the binary search. The lg (logarithm, base 2) function increases by one when n is doubled. Recall that we can search twice as many items with one more comparison in the binary search. Thus the binary search is a O(lg n) algorithm. If the values in Table 1-1 represented microseconds, then a O(lg n) algorithm may take 20 microseconds to process 1,048,476 items, a O(n1.25) algorithm might take 33 seconds, and a O(n2) algorithm might take up to 12 days! In the following chapters a timing estimate for each algorithm, using big-O notation, will be included. For a more formal derivation of these formulas you may wish to consult the references. Summary As we have seen, sorted arrays may be searched efficiently using a binary search. However, we must have a sorted array to start with. In the next section various ways to sort arrays will be examined. It turns out that this is computationally expensive, and considerable research has been done to make sorting algorithms as efficient as possible. Linked lists improved the efficiency of insert and delete operations, but searches were sequential and time-consuming. Algorithms exist that do all three operations efficiently, and they will be the discussed in the section on dictionaries. -7- 2. Sorting Several algorithms are presented, including insertion sort, shell sort, and quicksort. Sorting by insertion is the simplest method, and doesn’t require any additional storage. Shell sort is a simple modification that improves performance significantly. Probably the most efficient and popular method is quicksort, and is the method of choice for large arrays. 2.1 Insertion Sort One of the simplest methods to sort an array is an insertion sort. An example of an insertion sort occurs in everyday life while playing cards. To sort the cards in your hand you extract a card, shift the remaining cards, and then insert the extracted card in the correct place. This process is repeated until all the cards are in the correct sequence. Both average and worst-case time is O(n2). For further reading, consult Knuth [1998]. -8- Theory Starting near the top of the array in Figure 2-1(a), we extract the 3. Then the above elements are shifted down until we find the correct place to insert the 3. This process repeats in Figure 2-1(b) with the next number. Finally, in Figure 2-1(c), we complete the sort by inserting 2 in the correct place. 4 4 3 3 4 4 D 1 1 1 1 2 2 2 2 3 3 1 4 4 3 3 E 1 4 4 2 2 2 2 1 1 1 1 3 3 2 F 4 4 3 3 2 4 4 Figure 2-1: Insertion Sort Assuming there are n elements in the array, we must index through n – 1 entries. For each entry, we may need to examine and shift up to n – 1 other entries, resulting in a O(n2) algorithm. The insertion sort is an in-place sort. That is, we sort the array in-place. No extra memory is required. The insertion sort is also a stable sort. Stable sorts retain the original ordering of keys when identical keys are present in the input data. Implementation Source for the insertion sort algorithm may be found in file ins.c. Typedef T and comparison operator compGT should be altered to reflect the data stored in the table. -9- 2.2 Shell Sort Shell sort, developed by Donald L. Shell, is a non-stable in-place sort. Shell sort improves on the efficiency of insertion sort by quickly shifting values to their destination. Average sort time is O(n1.25), while worst-case time is O(n1.5). For further reading, consult Knuth [1998]. Theory In Figure 2-2(a) we have an example of sorting by insertion. First we extract 1, shift 3 and 5 down one slot, and then insert the 1, for a count of 2 shifts. In the next frame, two shifts are required before we can insert the 2. The process continues until the last frame, where a total of 2 + 2 + 1 = 5 shifts have been made. In Figure 2-2(b) an example of shell sort is illustrated. We begin by doing an insertion sort using a spacing of two. In the first frame we examine numbers 3-1. Extracting 1, we shift 3 down one slot for a shift count of 1. Next we examine numbers 5-2. We extract 2, shift 5 down, and then insert 2. After sorting with a spacing of two, a final pass is made with a spacing of one. This is simply the traditional insertion sort. The total shift count using shell sort is 1+1+1 = 3. By using an initial spacing larger than one, we were able to quickly shift values to their proper destination. 2s 2s 1s 3 1 1 1 5 3 2 2 D 1 5 3 3 2 2 5 4 4 4 4 5 1s 1s 1s 3 1 1 1 5 5 2 2 E 1 3 3 3 2 2 5 4 4 4 4 5 Figure 2-2: Shell Sort Various spacings may be used to implement shell sort. Typically the array is sorted with a large spacing, the spacing reduced, and the array sorted again. On the final sort, spacing is one. Although the shell sort is easy to comprehend, formal analysis is difficult. In particular, optimal spacing values elude theoreticians. Knuth has experimented with several values and recommends that spacing h for an array of size N be based on the following formula: Let h1 = 1, hs +1 = 3hs + 1, and stop with ht when ht + 2 ≥ N - 10 - Thus, values of h are computed as follows: h1 = 1 h2 = (3 × 1) + 1 = 4 h3 = (3 × 4) + 1 = 13 h4 = (3 × 13) + 1 = 40 h5 = (3 × 40) + 1 = 121 To sort 100 items we first find hs such that hs ≥ 100. For 100 items, h5 is selected. Our final value (ht) is two steps lower, or h3. Therefore our sequence of h values will be 13-4-1. Once the initial h value has been determined, subsequent values may be calculated using the formula hs −1 = hs / 3 Implementation Source for the shell sort algorithm may be found in file shl.c. Typedef T and comparison operator compGT should be altered to reflect the data stored in the array. The central portion of the algorithm is an insertion sort with a spacing of h. 2.3 Quicksort Although the shell sort algorithm is significantly better than insertion sort, there is still room for improvement. One of the most popular sorting algorithms is quicksort. Quicksort executes in O(n lg n) on average, and O(n2) in the worst-case. However, with proper precautions, worst-case behavior is very unlikely. Quicksort is a non-stable sort. It is not an in-place sort as stack space is required. For further reading, consult Cormen [1990]. Theory The quicksort algorithm works by partitioning the array to be sorted, then recursively sorting each partition. In Partition (Figure 2-3), one of the array elements is selected as a pivot value. Values smaller than the pivot value are placed to the left of the pivot, while larger values are placed to the right. - 11 - int function Partition (Array A, int Lb, int Ub); begin select a pivot from A[Lb]…A[Ub]; reorder A[Lb]…A[Ub] such that: all values to the left of the pivot are ≤ pivot all values to the right of the pivot are ≥ pivot return pivot position; end; procedure QuickSort (Array A, int Lb, int Ub); begin if Lb < Ub then M = Partition (A, Lb, Ub); QuickSort (A, Lb, M – 1); QuickSort (A, M + 1, Ub); end; Figure 2-3: Quicksort Algorithm In Figure 2-4(a), the pivot selected is 3. Indices are run starting at both ends of the array. One index starts on the left and selects an element that is larger than the pivot, while another index starts on the right and selects an element that is smaller than the pivot. In this case, numbers 4 and 1 are selected. These elements are then exchanged, as is shown in Figure 2-4(b). This process repeats until all elements to the left of the pivot are ≤ the pivot, and all items to the right of the pivot are ≥ the pivot. QuickSort recursively sorts the two sub-arrays, resulting in the array shown in Figure 2-4(c). Lb Ub D 4 2 3 5 1 SLYRW Lb M Lb E 1 2 3 5 4 F 1 2 3 4 5 Figure 2-4: Quicksort Example As the process proceeds, it may be necessary to move the pivot so that correct ordering is maintained. In this manner, QuickSort succeeds in sorting the array. If we’re lucky the pivot selected will be the median of all values, equally dividing the array. For a moment, let’s assume - 12 -

Tải về miễn phí