SQL query to find duplicate rows, in any table

I recently needed a SQL query that could find duplicate rows in any table, without specifying each column.  With some help from Tom H., I have a solution.  And with some modifications, a foray into dynamic SQL and a stored procedure, I now use this little beauty:

CREATE PROCEDURE GetDuplicates
	@TableName NVARCHAR(64)
AS
BEGIN
	SET NOCOUNT ON

	DECLARE @Query NVARCHAR(MAX)
	SET @Query = '

	WITH CTE AS (
		SELECT
			*,
			CHECKSUM(*) AS chksum,
			ROW_NUMBER() OVER(ORDER BY GETDATE()) AS rownum
		FROM
			' + @TableName + '
	)
	SELECT
		DISTINCT T1.*
	FROM
		CTE T1, CTE T2
	WHERE
		T2.chksum = T1.chksum AND
		T2.rownum <> T1.rownum'

	EXEC sp_executesql @Query
END

So this is doing a few things.  At the highest level, it’s a stored procedure, which we could execute like this:

EXEC [dbo].[GetDuplicates] @TableName = N'Users'

This is passing Users as the name of the table in which we wish to find duplicates.  The actual stored procedure uses a very simple example of dynamic SQL (for more information on the pros and cons of dynamic SQL, read this).  By writing the actual query we want to use in an NVARCHAR variable, it becomes possible to replace the table in the FROM clause to the value passed into the stored procedure. We then pass the variable containing the query to the sp_executesql system stored procedure, which is kind of like eval() in PHP.

Now, on to what the query is actually supposed to do.  A common table expression is a named result set whose lifetime exists for the duration of a single subsequent SELECT, INSERT, UPDATE or DELETE statement.  The query that populates this result set (named CTE) is

SELECT
	*,
	CHECKSUM(*) AS chksum,
	ROW_NUMBER() OVER(ORDER BY GETDATE()) AS rownum
FROM
	Users

The CHECKSUM function is a hash function that takes columns as arguments, which is used to uniquely identify the values of all the fields in a row; this includes derived columns.  The call to ROW_NUMBER() is used to uniquely identify each row, despite the values in its fields. The point being, if two rows have the same checksum value, but (obviously) different row numbers, then they are duplicates. So with that in mind, we do an implicit inner left join:

SELECT
	DISTINCT T1.*
FROM
	CTE T1, CTE T2
WHERE
	T2.chksum = T1.chksum AND
	T2.rownum <> T1.rownum

Remember, we wanted the same checksum, but different row numbers. T1 and T2 are aliases for the same temporary result set we populated earlier, and the DISTINCT T1.* eliminates extra columns and rows. The final result will each duplicate row. It might look something like this:

Username Password Birthday chksum rownum
foo bar 1999-12-31 00:00:00.000 -1327027712 1
foo bar 1999-12-31 00:00:00.000 -1327027712 7
foo bar 1999-12-31 00:00:00.000 -1327027712 8
foo bar 1999-12-31 00:00:00.000 -1327027712 9
foo bar 1999-12-31 00:00:00.000 -1327027712 10
hillary ilikepie 1999-12-31 00:00:00.000 956702162 4
hillary ilikepie 1999-12-31 00:00:00.000 956702162 5
hillary ilikepie 1999-12-31 00:00:00.000 956702162 6
jim pass1 1999-12-31 00:00:00.000 2026298685 2
jim pass1 1999-12-31 00:00:00.000 2026298685 3

So if you want to know how many duplicate rows there are, there’s one row in the result for each. Sorry, I got lazy and everybody in this example is eleven years old.

This entry was posted in SQL and tagged , , , . Bookmark the permalink.
  • Delicious
  • Facebook
  • Reddit
  • StumbleUpon
  • Twitter
  • RSS Feed
  • Google
  • Digg

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>