Friday, August 22, 2008

Be Smart. Get A Row Number !

Imagine there is a table with no primary key or unique constraint and volumes of duplicated rows keep coming in. The only piece of information that differentiates these rows is a timstamp column that stores the date and time where record is inserted.








ID (VARCHAR)NAME (VARCHAR)AMOUNT (INT)LAST_UPDATED (TIMESTAMP)
1Eddy80000.002006-03-14-00.10.31.999999
1Eddy90000.002006-03-14-00.09.31.999999
1Eddy90000.002006-03-14-00.11.31.999999
2Lee Sin Ti100.002006-03-14-00.10.31.999999
2Lee Sin Ti200.002006-03-14-00.09.31.999999

Table1


Based on this, your DB2 query is to remove all the outdated records while maintaining only the latest entries.


An idiotic first attempt:

DELETE FROM TABLE1
WHERE
ID || NAME || CAST (AMOUNT AS VARCHAR(32)) || CAST(LAST_UPDATED AS VARCHAR(64))
NOT IN
(
SELECT (ID || NAME || CAST(AMOUNT AS VARCHAR(32)) || CAST(MAX(LAST_UPDATED) AS VARCHAR(64)) ) AS KEY FROM TABLE1 GROUP BY ID, NAME
);

This attempt is definitely a NO-NO. Not only the string concatenations takes a huge amount of processing cycles, it is also UGLY in my point of view. Using a generated random data of 100k records, it takes an unacceptable amount of time to complete the delete query.

Then, my second attempt got to deal with DB2 support of row_number() function, which I greatly appreciated from IBM.


Second attempt:

DELETE FROM
(
SELECT ROW_NUMBER() OVER (PARTITION BY ID, NAME ORDER BY LAST_UPDATED DESC)
FROM TABLE1
) AS X (ROWNUM) WHERE ROWNUM > 1;

WOW, an optimized yet elegant query to achieve my goal. With the same random data set, it only took less than 1 minute to complete the query.

No comments: