Craig S. Mullins 

Return to Home Page

May 2002

 


The “Top Ten” Problem

by Craig S. Mullins

A common problem faced by application developers is the requirement to retrieve a limited number of qualifying rows from the database – for example, to select the ten largest deposits or the five smallest balances. The first reaction of most programmers is to simply use the WHERE clause to eliminate non-qualifying rows. But this is simplistic, and often is not sufficient to produce the results desired in an optimal manner.

For example, what if the program requires that only the top ten results be returned? This can be a difficult request to formulate using SQL alone. Consider, for example, an application that needs to retrieve only the top ten highest paid employees from the EMP table in the DB2 sample database. You could simply issue a SQL request that retrieves all of the books in order by price, but only use the first ten retrieved. That is easy, for example:

SELECT    EMPNO, LASTNAME, FIRSTNME, SALARY

FROM        DSN8710.EMP

ORDER BY SALARY DESC;

But that does not satisfy the requirements of retrieving only the top ten. It merely returns every row with the results sorted into descending sequence. So the results would still be all employees in the table, but in the correct order so you can view the “top ten” salaries easily. When using this approach be sure to specify the ORDER BY clause with the DESC key word. This sorts the results into descending order, instead of the default, which is ascending. Without the DESC key word, the top of the list would contain the lowest paid employees and the “top ten” would be at the very end of the results set.

For many users this approach of ordering the desired results to the top may be sufficient. But it is not a “complete” solution that meets the specifications of returning only the “top ten” using only SQL. The ideal solution should return only the top ten employees with the highest salaries. There are several DB2 solutions that can be used to produce this result. Of course, you could implement the SQL in a cursor and programmatically return only the first ten rows. But this would require host language programming skills and an application program would need to be run each time that the results are required.

DB2 Version 7 provides an easy way to limit the results of a SELECT statement using a new clause – the FETCH FIRST n ROWS clause. When the FETCH FIRST n ROWS clause is specified, DB2 will limit the number of rows that are fetched and returned by a SELECT statement. This Version 7 approach requires SQL only and is quite simple and efficient. The FETCH FIRST n ROWS ONLY clause is appended right to the end of the SELECT statement. It is used as follows:

SELECT   EMPNO, LASTNAME, FIRSTNME, SALARY

FROM    DSN8710.EMP

ORDER BY SALARY DESC

FETCH FIRST 10 ROWS ONLY;

Of course, the value can be any number – not just 10. For example, to retrieve only the top 4 salaries you would code:

SELECT   EMPNO, LASTNAME, FIRSTNME, SALARY

FROM    DSN8710.EMP

ORDER BY SALARY DESC

FETCH FIRST 4 ROWS ONLY;

This is the simplest and probably the most elegant solution for limiting the number of rows returned by a DB2 query. This clause is different from the OPTIMIZE FOR n ROWS clause that has been available for several releases of DB2 now. The FETCH FIRST n ROWS ONLY clause will limit the number of rows returned to the specified number, n. Remember that the OPTIMIZE FOR n ROWS clause does not impact the number of rows returned, but is used only by the optimizer for optimizing SQL.

But, the FETCH FIRST n ROWS ONLY clause requires you to be running at least DB2 Version 7 and you might not be running at that level. In that case, another approach is required. One approach is to use the COUNT function and some “tricky” SQL. The following SQL will also return the top ten employees by salary:

SELECT   EMPNO, LASTNAME, FIRSTNME, SALARY

FROM     DSN8710.EMP A

WHERE 10 > (SELECT COUNT(*)

            FROM   DSN8710.EMP B

            WHERE  A.SALARY < B.SALARY

            AND    B.SALARY IS NOT NULL)

ORDER BY SALARY DESC;

Let's break this query down into components to understand how it works. This SQL statement uses a correlated subquery. A correlated subquery is an inner query that must be evaluated for each row of the outer query. The query uses aliases to identify the table references. The alias “A” identifies the table in the outer query, so that in the subquery, the A.SALARY identifies that column as belonging to the outer query's table. The alias “B” is used for the subquery table (though it is not required).

So, for each row of the outer query, the subquery counts the number of rows with a larger score than that of the outer row under consideration. If there are fewer than 10 rows with a larger score, then the outer row satisfies the outer WHERE clause – in other words, it belongs in the top ten.

The ORDER BY clause is required to sort the results in the right order. If it is removed from the query, the results will still contain the top ten, but they may be in no particular order. Additionally, this query works for all DB2 versions and platforms (mainframe, Unix, and NT) and it is portable from DB2 to other database servers, such as SQL Server and Oracle. And, of course, you can change the constant 10 to any number you wish, thereby retrieving the top 20, or top 5, as deemed necessary by the needs of your application.

That does not mean to suggest that this query is without problems – indeed, it can be quite inefficient. This particular SQL statement uses a correlated subquery. The number of rows counted will grow exponentially as the number of rows in the table grows. It can be quite inefficient when the number of rows grows to as little as a thousand. For very small amounts of data though, this query usually performs quite well. 

But there is another difference between this query and the previous one – and that is the way that “ties” are handled. A tie occurs when more than one row contains the same value. This query may return more than 10 rows if there are multiple rows with the same value for price within the top ten. The previous query that used the FETCH FIRST n ROWS ONLY clause always will limit the number of rows returned to n, even if there are other rows with the same value for price as the number n row in the results set. The needs of your application will dictate whether ties are to be ignored or included in the result set. If ties should not be included in the results set, do not use the last SQL formulation because it will include ties.

One final approach to the “top ten” problem is the brute force method. This method relies on systematically identifying the largest value yet to be returned. It works kind of like peeling back the layers of an onion. The following example uses the brute force method to return the top ten salaries from the EMP table:

SELECT   EMPNO, LASTNAME, FIRSTNME, SALARY

FROM     DSN8710.EMP

WHERE    SALARY = (SELECT MAX(SALARY) 
                   FROM DSN8710.EMP)

UNION ALL

SELECT   EMPNO, LASTNAME, FIRSTNME, SALARY

FROM     DSN8710.EMP

WHERE    SALARY =

 (SELECT MAX(SALARY) FROM DSN8710.EMP

  WHERE  SALARY < (SELECT MAX(SALARY) 
                   FROM DSN8710.EMP))

UNION ALL

SELECT   EMPNO, LASTNAME, FIRSTNME, SALARY

FROM     DSN8710.EMP

WHERE    SALARY =

 (SELECT MAX(SALARY) FROM DSN8710.EMP

  WHERE  SALARY < SELECT MAX(SALARY) 
                  FROM DSN8710.EMP

                  WHERE  SALARY <
                  (SELECT MAX(SALARY) 
                   FROM DSN8710.EMP))

UNION ALL

. . .

ORDER BY SALARY DESC;

You get the picture. The first query in the UNION ALL statement simply retrieves the largest salary. The second query retrieves the largest salary that is less than the first salary retrieved. The third query retrieves the largest salary that is less than the second salary retrieved… and so on until you have retrieved the n largest salaries. Be sure to include the ORDER BY clause to return the rows in descending order by salary.

Actually, this method of obtaining the top ten values is a little bit different than the other methods we have discussed, too. It actually returns the top ten distinct values – no duplicate salary values will be returned when using the brute force method. When multiple salary values fall within the requested range the results will show only one of the employees that qualify. This can be confusing because the query may return different employees each time it is run depending on the access path chosen.

Additionally, the brute force method is not generally recommended because it can quickly become quite cumbersome to code; and making changes to such an unwieldy SQL statement can be very confusing. Furthermore, it is likely that the brute force method will not perform optimally due to all of the MAX functions and subselects needed in the multiple UNION statements.

Want to Return the Bottom Ten?

Any of these queries can be modified to return the bottom ten instead of the top ten. For the first query, simply remove the DESC from the ORDER BY clause. This will cause the rows to be sorting into ascending sequence, which is the default. Then the FETCH FIRST n ROWS ONLY will cause the bottom ten results to be returned.

For the middle query, using standard SQL alone, simply reverse the less than sign (>) to a greater than sign (<) in the subquery, and remove the DESC from the ORDER BY clause, as follows:

SELECT   EMPNO, LASTNAME, FIRSTNME, SALARY

FROM     DSN8710.EMP A

WHERE 10 > (SELECT COUNT(*)

            FROM   DSN8710.EMP B

            WHERE  A.SALARY > B.SALARY

            AND    B.SALARY IS NOT NULL)

ORDER BY SALARY;

And with the brute force method, you simply can deploy the same method but using the MIN function and greater than predicates in place of the MAX function and less than predicates.

Bottom Line

The “top ten” request is a common application requirement. Any application that needs to return an ordered subset of a given table can take advantage of one of these “top ten” queries. Consider using SQL to return only the results you need instead of writing an application program that reads query results to limit the results set. “SQL only” solutions can be easier to use than bulky application programs. But be aware that the “SQL only” approach, depending on the method deployed, also may be less efficient.

 

 

From DB2 Update (Xephon) May 2002.

© 2002 Craig S. Mullins, All rights reserved.
Home.