A piece of
inspired thinking
at the Haifa
Research Laboratory has led to a
dramatic increase
in the speed at which computers can sort data.
The result:
a significant saving in computing time and cost.
In Brief:
Sorting data is an essential, time-consuming and costly task of computers. Until recently, computer scientists were happy with the existing sorting programs, as they had reached the theoretical speed limit for sorting by comparison. However, a team at IBM's Haifa Research Laboratory has developed an algorithm that permits sorting speeds more than 10 times faster than present methods. The algorithm is the basis of IBM's SMARTsort data-sorting product, introduced
last year.
One of the most mundane but essential tasks of computers is sorting and arranging numbers from lowest to highest, putting text entries in alphabetical order and organizing fields in databases, for example. That type of work often takes up 20 to 30 percent of a typical computer's time.
Not surprisingly, computer scientists want to speed up the process, and thereby reduce the costs of computing. For many years, they faced an apparent limit to the amount of improvement they could achieve. However, a team at IBM's Haifa Research Laboratory has found a way around that roadblock. The group has developed a new algorithm that enables users to sort data at least 10 times as fast as they can using traditional methods.
The algorithm forms the basis of SMARTsort, a data sorting product that IBM's Storage System Division (SSD) introduced last October. In May, the division followed up with a second release. This includes support for Windows and OS/2, as well as portability to Hewlett-Packard and Sun workstations, in addition to AIX® and COBOL® supported by the initial release. Beyond sorting, the product copies, merges and filters information. "It's a data manipulation utility," says Alan Hartman, who oversees research on the algorithm in his role as manager of applied mathematics at Haifa.
Sorting systems
Computerized sorting literally occurs by numbers. Nonnumerical items can be converted to numerical values to permit them to be sorted.The simplest way of arranging a list of numbers into a sequence begins by going through the list and keeping a record of the smallest number. That number is placed at the top, and the process is repeated over and over until all the numbers are in order. The method is inherently slow, particularly as the length of the list increases, because the time depends on the square of the total of numbers to be sorted.
More sophisticated methods stem from the "heap approach," in which the smallest number repeatedly pops to the top of a heap of numbers. The time needed to carry out such "comparison-based sorting" is proportional to n (the total of numbers to be sorted) times the logarithm of n, expressed mathematically as nlogn. Efficient implementations can speed up comparison sorts somewhat, but not by much.
In February 1993, SSD asked Haifa to develop a method of sorting for workstations, as opposed to mainframes, that would work at least twice as fast as available sorting procedures. For about 18 months, a team led by Zvi Yehudai worked to develop a prototype system based on improvements in comparison sorting.
Then came an inspiration. Applied probability expert Shmuel Gal and Zvi Yehudai decided to try a version of a generally neglected sorting technique. Called "distribution sorting," it was ignored because many believed that it was impractical to apply it to large quantities of real-life data. Nevertheless, Gal and Yehudai decided to try it. "I looked at the problem with a different view from the traditional computer science approach," explains Gal.
The program starts by picking a small random sample of the numbers to be sorted; for a total of 10 billion numbers, the sample might consist of 1,000. Using standard techniques, the program estimates the smallest and largest numbers in the random sample. Next, it allocates a large amount of memory, which it divides into a series of "bins," each corresponding to equal size intervals of numbers, for example, 1.01 to 1.02, 1.02 to 1.03 and so on. The total number of bins and the size of the interval depend on the high and low numbers in the random sample.
The program then sorts all the numbers to be sorted, by placing each in the appropriate bin. For instance, the numbers 123.3567 and 123.3571 would both fit into a bin that covers the range from 123.35 to 123.36. When two or more numbers fall in the same bin, the program uses the heap approach to sort them. That's not difficult as long as bins contain
only a few numbers. So the goal in setting up the bins is to ensure that most contain just a single number or none at all. That means selecting perhaps three or four times as many bins as the total of numbers to be sorted. "This takes a lot of memory," explains
Hartman, "but memory is cheap."
How well does distribution sorting work? As soon as Gal came up with the idea, Yehudai implemented it on a mainframe, and compared it with a comparison sort. To his delight, it worked 10 to 20 times faster. "That was a surprise," he admits. So spectacularly did the algorithm behave, Yehudai recalls, that SSD dropped the algorithm on which Haifa was working and introduced the new one.
Then came another tough task. "There's a large
difference between taking an algorithm and making a product," comments Hartman. Yehudai and the Haifa team implemented the algorithm and incorporated it into SMARTsort for last fall's release.
Along the way they refined the algorithm to deal with two situations in which the bins become too crowded. A poor initial selection of limits can cause the bins for the highest and lowest numbers - which theoretically can range down to zero and up to infinity - to fill up with more numbers than heap sorting can deal with easily. And when the numbers in the whole sample are unevenly distributed, a poor random
selection can lead to overcrowding in central bins. In both cases, the algorithm is used recursively for the interval in question.
The Haifa team is now working on a variation of the algorithm for parallel computers, such as IBM's SP2. The group has also developed procedures - such as overlapping reading, writing and sorting - that will give its program even more impressive speed.