DB2 Nerd: Be Smart. Get A Row Number !

Imagine there is a table with no primary key or unique constraint and volumes of duplicated rows keep coming in. The only piece of information that differentiates these rows is a timstamp column that stores the date and time where record is inserted.

ID (VARCHAR)	NAME (VARCHAR)	AMOUNT (INT)	LAST_UPDATED (TIMESTAMP)
1	Eddy	80000.00	2006-03-14-00.10.31.999999
1	Eddy	90000.00	2006-03-14-00.09.31.999999
1	Eddy	90000.00	2006-03-14-00.11.31.999999
2	Lee Sin Ti	100.00	2006-03-14-00.10.31.999999
2	Lee Sin Ti	200.00	2006-03-14-00.09.31.999999

Table1

Based on this, your DB2 query is to remove all the outdated records while maintaining only the latest entries.

An idiotic first attempt:

DELETE FROM TABLE1
WHERE
ID || NAME || CAST (AMOUNT AS VARCHAR(32)) || CAST(LAST_UPDATED AS VARCHAR(64))
NOT IN
(
SELECT (ID || NAME || CAST(AMOUNT AS VARCHAR(32)) || CAST(MAX(LAST_UPDATED) AS VARCHAR(64)) ) AS KEY FROM TABLE1 GROUP BY ID, NAME
);

This attempt is definitely a NO-NO. Not only the string concatenations takes a huge amount of processing cycles, it is also UGLY in my point of view. Using a generated random data of 100k records, it takes an unacceptable amount of time to complete the delete query.

Then, my second attempt got to deal with DB2 support of row_number() function, which I greatly appreciated from IBM.

Second attempt:

DELETE FROM
(
SELECT ROW_NUMBER() OVER (PARTITION BY ID, NAME ORDER BY LAST_UPDATED DESC)
FROM TABLE1
) AS X (ROWNUM) WHERE ROWNUM > 1;

WOW, an optimized yet elegant query to achieve my goal. With the same random data set, it only took less than 1 minute to complete the query.

DB2 Nerd

Friday, August 22, 2008

Be Smart. Get A Row Number !

No comments:

Interesting Sites

Blog Archive