I recently needed a SQL query that could find duplicate rows in any table, without specifying each column. With some help from Tom H., I have a solution. And with some modifications, a foray into dynamic SQL and a stored procedure, I now use this little beauty:
CREATE PROCEDURE GetDuplicates @TableName NVARCHAR(64) AS BEGIN SET NOCOUNT ON DECLARE @Query NVARCHAR(MAX) SET @Query = ' WITH CTE AS ( SELECT *, CHECKSUM(*) AS chksum, ROW_NUMBER() OVER(ORDER BY GETDATE()) AS rownum FROM ' + @TableName + ' ) SELECT DISTINCT T1.* FROM CTE T1, CTE T2 WHERE T2.chksum = T1.chksum AND T2.rownum <> T1.rownum' EXEC sp_executesql @Query END
So this is doing a few things. At the highest level, it’s a stored procedure, which we could execute like this:
EXEC [dbo].[GetDuplicates] @TableName = N'Users'
This is passing
Users as the name of the table in which we wish to find duplicates. The actual stored procedure uses a very simple example of dynamic SQL (for more information on the pros and cons of dynamic SQL, read this). By writing the actual query we want to use in an
NVARCHAR variable, it becomes possible to replace the table in the
FROM clause to the value passed into the stored procedure. We then pass the variable containing the query to the
sp_executesql system stored procedure, which is kind of like
eval() in PHP.
Now, on to what the query is actually supposed to do. A common table expression is a named result set whose lifetime exists for the duration of a single subsequent
DELETE statement. The query that populates this result set (named
SELECT *, CHECKSUM(*) AS chksum, ROW_NUMBER() OVER(ORDER BY GETDATE()) AS rownum FROM Users
CHECKSUM function is a hash function that takes columns as arguments, which is used to uniquely identify the values of all the fields in a row; this includes derived columns. The call to
ROW_NUMBER() is used to uniquely identify each row, despite the values in its fields. The point being, if two rows have the same checksum value, but (obviously) different row numbers, then they are duplicates. So with that in mind, we do an implicit inner left join:
SELECT DISTINCT T1.* FROM CTE T1, CTE T2 WHERE T2.chksum = T1.chksum AND T2.rownum <> T1.rownum
Remember, we wanted the same checksum, but different row numbers.
T2 are aliases for the same temporary result set we populated earlier, and the
DISTINCT T1.* eliminates extra columns and rows. The final result will each duplicate row. It might look something like this:
Username Password Birthday chksum rownum
foo bar 1999-12-31 00:00:00.000 -1327027712 1
foo bar 1999-12-31 00:00:00.000 -1327027712 7
foo bar 1999-12-31 00:00:00.000 -1327027712 8
foo bar 1999-12-31 00:00:00.000 -1327027712 9
foo bar 1999-12-31 00:00:00.000 -1327027712 10
hillary ilikepie 1999-12-31 00:00:00.000 956702162 4
hillary ilikepie 1999-12-31 00:00:00.000 956702162 5
hillary ilikepie 1999-12-31 00:00:00.000 956702162 6
jim pass1 1999-12-31 00:00:00.000 2026298685 2
jim pass1 1999-12-31 00:00:00.000 2026298685 3
So if you want to know how many duplicate rows there are, there’s one row in the result for each. Sorry, I got lazy and everybody in this example is eleven years old.