Skip to main content

Identify and Delete duplicate Records in SQL Server

     
Scenario: I have accumulated data into SQL Server database from three different servers (oracle, O2 and oracle) for a financial reporting. After data accumulated , we found more duplicate rows; business requirement was to remove all duplicate rows because of final data will be import to another system.


For this purpose, I want to write script to find out duplicate rows from my table. Finally, I want to delete only duplicate rows not original row.

Original row means if i have total 2 rows those same or duplicate, so I want to delete only one row that is duplicate.

Business Web Hosting


Let’s go............................

Step 1: Create a student Table below Script:

CREATE TABLE [Students](
    [ID] [int] NULL,
    [Name] [varchar](50) NULL,
    [address] [varchar](50) NULL,
    [contact] [varchar](50) NULL,
    [email] [varchar](50) NULL
)

Step 2: insert Record that shown below:
 
Insert Duplicate Records
  
Step 3: Selecting Duplicate Rows

WITH CTE AS
(
    SELECT ROW_NUMBER()
          OVER
          (
                  PARTITION BY [ID] , [Name],[address],[contact],[email]
                        ORDER BY ( [ID])
          ) AS RowID,
          * FROM Students
)

 SELECT * FROM CTE
        WHERE [ID] IN ( SELECT [ID] FROM CTE WHERE RowID >1 ) 


Identify Duplicate Records

Step 4: Deleting Duplicate Rows

WITH CTE AS
(
    SELECT    ROW_NUMBER()
    OVER
    (
            PARTITION BY [ID] , [Name],[address],[contact],[email]
            ORDER BY ( [ID])
    ) AS RowID,
    * FROM Students
)

Delete From CTE where RowID Not In
(
    SELECT MIN(RowID)
        FROM CTE
        WHERE [ID] IN ( SELECT [ID] FROM CTE where RowID >1)  
)

Result: 2 rows Deleted. 

 Again select duplicate rows by Step-3. Display No records.


Note: I will discuss ASAP about major keywords are WITH , ROW_NUMBER() and PARTITION BY in another post.

Comments

Unknown said…
Why not simply this? Only one scan of the data...

WITH CTE AS
(
SELECT ROW_NUMBER()
OVER(PARTITION BY [ID] , [Name],[address],[contact],[email] ORDER BY [ID]) AS RowID
FROM Students
)
DELETE
FROM CTE
WHERE [RowID] >= 2;

Popular posts from this blog

What is RDBMS?

In our crucial programming world, RDBMS is stand for Relational Database Management Systems is based on the Relational Model that maintain data records and index in tables. Generally, the term "Relations" are maintained in RDMS.  In a relational database,store the data into collection of tables, which might be related by common fields (database table columns). RDBMS also provide relational operators to manipulate the data stored into the database tables. Most RDBMS use SQL as database query language. There are several objects are included in RDBMS like ,Indexes,Synonyms,Tables,Views Clusters,Constraints,Database links,Database triggers. Most popular RDBMS  are Oracle, MS-SqlServer, MySql,DB2. etc.

What are the difference between DDL, DML , DCL and TCL commands?

Database Command Types DDL Data Definition Language  (DDL) statements are used to define the database structure or schema.  -           CREATE - to create objects in the database -           ALTER - alters the structure of the database -           DROP - delete objects from the database -           TRUNCATE - remove all records from a table, including all spaces allocated for the records are removed -           COMMENT - add comments to the data dictionary -           RENAME - rename an object DML Data Manipulation Language  (DML) statements are used for managing data in database. DML commands are not auto-committed. It means changes made by DML command are not permanent to database, it can be rolled back. -           SELECT – retrieve data from the a database -           INSERT – insert data into a table -           UPDATE – updates existing data within a table -           DELETE – deletes all records from a table, the space for the records rema