How to remove duplicates in PostgreSQL?

Today somebody asked me how to remove duplicates which accidentally made it into a table. The problem is: A normal DELETE won't do, because you would delete both values - not just the one which is in there twice.

The magic word is "ctid"

To solve the problem, you have to use a "secret" column called "ctid". The "ctid" identifies a row inside a table. Here is an example:

test=# CREATE TABLE t_test (idint4);
CREATE TABLE

test=# INSERT INTO t_test VALUES (1) (2) (3);
idint4 -> t_test (id int4);

test=# CREATE TABLE t_test (idint4);

CREATE TABLE

test=# INSERT INTO t_test VALUES (1) (2) (3);

idint4 -> t_test (id int4);

As you can see two values show up twice. To find out how we can remove the duplicate value we can query the "ctid":

test=# SELECT ctid, * FROM t_test;
ctid  | id
______+___
(0,1) | 1
(0,2) | 2
(0,3) | 2
(0,4) | 3
(4 rows)

test=# SELECT ctid, * FROM t_test;

ctid | id

______+___

(0,1) | 1

(0,2) | 2

(0,3) | 2

(0,4) | 3

(4 rows)

We can make use of the fact that the ctid is not the same for our values. The subselect will check for the lowest ctid for a given value and delete it:

test=# DELETE FROM t_test
       WHERE ctid IN (SELECT min(ctid)
                      FROM   t_test
                      GROUP BY id
                      HAVING count(*) > 1
                     )
       RETURNING * ;
id
___
 2
(1 row)

test=# DELETE FROM t_test

WHERE ctid IN (SELECT min(ctid)

FROM t_test

GROUP BY id

HAVING count(*) > 1

)

RETURNING * ;

___

(1 row)

This query works nicely if we can rely on the fact that we only got values which don't show up more often than twice. If we want to do things in a generic way, we can use a simple windowing function to make things work:

test=# SELECT ctid 
       FROM (SELECT ctid, id,
                    count(*) OVER (PARTITION BY id ORDER BY ctid)
             FROM   t_test
            ) AS x
       WHERE count > 1;
 ctid
------
 (0,3)
 (0,4)
(2 rows)

test=# DELETE FROM t_test
       WHERE ctid IN (SELECT ctid)
                      FROM (SELECT ctid, id,
                                   count(*) OVER (PARTITION BY id ORDER BY ctid)
                            FROM t_test
                           ) AS x
                      WHERE count > 1
                     )
       RETURNING * ;
 id
----
  2
  2
(2 rows)

test=# SELECT ctid

FROM (SELECT ctid, id,

count(*) OVER (PARTITION BY id ORDER BY ctid)

FROM t_test

) AS x

WHERE count > 1;

ctid

------

(0,3)

(0,4)

(2 rows)

test=# DELETE FROM t_test

WHERE ctid IN (SELECT ctid)

FROM (SELECT ctid, id,

count(*) OVER (PARTITION BY id ORDER BY ctid)

FROM t_test

) AS x

WHERE count > 1

)

RETURNING * ;

----

(2 rows)

Now we can check for the result:

test=# SELECT ctid, * FROM t_test;
ctid  | id
______+___
(0,1) | 1
(0,2) | 2
(0,5) | 3
(3 rows)

test=# SELECT ctid, * FROM t_test;

ctid | id

______+___

(0,1) | 1

(0,2) | 2

(0,5) | 3

(3 rows)

In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Facebook or LinkedIn.

0 0 votes

Article Rating

2 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Jorge Fernandez

9 years ago

Finally a cristal-clear explanation.... thank you so much!

Bjoern Schilberg

10 years ago

Just a quick note here. Window Functions appeared first in PostgreSQL 8.4.

Stay tuned with our

Removing duplicates in PostgreSQL

The magic word is "ctid"

Hans-Jürgen Schönig

Blog Tags

NEWSLETTER

Articles by our PostgreSQL Experts