Skip to main content

Identify and Delete duplicate Records in SQL Server

     
Scenario: I have accumulated data into SQL Server database from three different servers (oracle, O2 and oracle) for a financial reporting. After data accumulated , we found more duplicate rows; business requirement was to remove all duplicate rows because of final data will be import to another system.


For this purpose, I want to write script to find out duplicate rows from my table. Finally, I want to delete only duplicate rows not original row.

Original row means if i have total 2 rows those same or duplicate, so I want to delete only one row that is duplicate.

Business Web Hosting


Let’s go............................

Step 1: Create a student Table below Script:

CREATE TABLE [Students](
    [ID] [int] NULL,
    [Name] [varchar](50) NULL,
    [address] [varchar](50) NULL,
    [contact] [varchar](50) NULL,
    [email] [varchar](50) NULL
)

Step 2: insert Record that shown below:
 
Insert Duplicate Records
  
Step 3: Selecting Duplicate Rows

WITH CTE AS
(
    SELECT ROW_NUMBER()
          OVER
          (
                  PARTITION BY [ID] , [Name],[address],[contact],[email]
                        ORDER BY ( [ID])
          ) AS RowID,
          * FROM Students
)

 SELECT * FROM CTE
        WHERE [ID] IN ( SELECT [ID] FROM CTE WHERE RowID >1 ) 


Identify Duplicate Records

Step 4: Deleting Duplicate Rows

WITH CTE AS
(
    SELECT    ROW_NUMBER()
    OVER
    (
            PARTITION BY [ID] , [Name],[address],[contact],[email]
            ORDER BY ( [ID])
    ) AS RowID,
    * FROM Students
)

Delete From CTE where RowID Not In
(
    SELECT MIN(RowID)
        FROM CTE
        WHERE [ID] IN ( SELECT [ID] FROM CTE where RowID >1)  
)

Result: 2 rows Deleted. 

 Again select duplicate rows by Step-3. Display No records.


Note: I will discuss ASAP about major keywords are WITH , ROW_NUMBER() and PARTITION BY in another post.

Comments

Unknown said…
Why not simply this? Only one scan of the data...

WITH CTE AS
(
SELECT ROW_NUMBER()
OVER(PARTITION BY [ID] , [Name],[address],[contact],[email] ORDER BY [ID]) AS RowID
FROM Students
)
DELETE
FROM CTE
WHERE [RowID] >= 2;

Popular posts from this blog

What are the difference between DDL, DML , DCL and TCL commands?

Database Command Types DDL Data Definition Language  (DDL) statements are used to define the database structure or schema.  -           CREATE - to create objects in the database -           ALTER - alters the structure of the database -           DROP - delete objects from the database -           TRUNCATE - remove all records from a table, including all spaces allocated for the records are removed -           COMMENT - add comments to the data dictionary -           RENAME - rename an object DML Data Manipulation Language  (DML) statements are used for managing data in database. DML commands are not auto-committed. It means changes made by DML command are not permanent to d...

Display Image Immediately from fileUpload Control

Client Side Programming is more efficient for user to very first interection Like Jquery . Now i provide some jquery script to display Image from FileUpload Control directly.. For this purpose we use reference . <script src="../../../JS/JQueryAutocomplete/jquery-1.7.2.js" type="text/javascript"></script>. You just download from jquery.com. Now code below... <script language="javascript" type="text/javascript"> /* Write FileUpload Change event .. */ $('#FileUpload1').bind('change', function() {                                 displayImage(this);                    }); //-------------------------// function displayImage(input)             {        ...

DBCC command to RESEED Table Identity Value – Reset Table Identity

Generally,  we are using identity column into table where identity value incrementally inserted.  DBCC command RESEED is used to reseed (reset) column identity. For example, your table has 20 rows that means last identity value is 20. if you want to reseed identity value will be 30, you can run following TSQL. DBCC  CHECKIDENT  ( yourtable ,  reseed ,  29 ); i f You reseed identity value which is below in current table, it will violate   the uniqueness constraint as soon as the values start to duplicate and will generate error. If you delete last row in current table and insert new row into current table, your new inserted identity value will be last (deleted identity value + incremental value). For example, you delete last three rows where Identity value is 20 - 22 ,so if you now insert new row,  identity value will be 23. That means seed value keep into transaction log.