Deleting 1 million rows from PostgreSQL

Deleting 1 million rows from PostgreSQL

Get that garbage out

Are you struggling to manage large amounts of data in your database? Do you want to learn how to efficiently delete unnecessary data and improve the performance of your database?

In this article, we will discuss the DELETE and TRUNCATE commands in PostgreSQL and how they can be used to clean up your database.

Although we will be using PostgreSQL, the information in this article applies to other SQL databases as well. By the end of this article, you will have a better understanding of how to get rid of large amounts of data in your database efficiently.

TRUNCATE

Let's first see what the TRUNCATE command does then we will be exploring it further.

TRUNCATE quickly removes all rows from an individual table or a set of tables. It is similar to the DELETE command without a WHERE clause, but it is faster because it doesn't scan the table. In PostgreSQL, TRUNCATE also immediately reclaims the space for usage without requiring the VACUUM command(we will look into it later on), which can be useful for optimizing performance.

The query for TRUNCATE looks like this...

TRUNCATE tableName;

Essential things to keep in mind while using the TRUNCATE command

  • It acquires an ACCESS EXCLUSIVE lock on the table, which blocks all the other concurrent operations on the table including the basic SELECT command.

  • You can't partially get rid of data using it i.e. you can't use the WHERE clause with it.

  • It cannot be used on a table that has foreign-key references from another table unless the linked table is also truncated in the same command using the CASCADE parameter. However, using the CASCADE parameter is not advisable because it requires table scans, which defeats the purpose of using TRUNCATE in the first place.

Scenarios where TRUNCATE can be useful

  • You don’t need any data present in a table and there are no foreign-key references from another table
    \>>> Perfect scenario to use TRUNCATE.

  • You don’t need any data present in the tables but there are foreign-key references from another table
    \>>> Start the truncation process from the child table and move your way up to the parent table, to avoid table scans rather than using the CASCADE parameter(in PostgreSQL).

  • You need to delete the majority of the data(>60%) and there are no foreign-key references from another table
    \>>> Select the data that is needed and move it to another temporary table.
    \>>> TRUNCATE the original table.
    \>>> Rename the new table.

  • You need to delete partial data but there are foreign-key references from another table
    \>>> In this scenario, it is better to use the DELETE command irrespective of the amount of data that needs to be deleted. You can’t escape table scans so why not use the DELETE command...

In the above steps when you are trying to create the new table you may refer to this link.

DELETE

The DELETE command as the name suggests deletes a row or set of rows from a table. You can use the WHERE clause with the DELETE statement to delete rows which fulfil a particular condition.

DELETE looks like this...

DELETE from tableName
WHERE to_be_deleted=TRUE;

Things to keep in mind about DELETE

  • When rows in PostgreSQL are deleted, they are not deleted physically from the table instantaneously, it is marked for deletion and you can claim the space using the VACUUM command.
    Periodic auto vacuum is performed by the database, but during mass deletion, it is important to keep this in mind and manually run the VACUUM command to prevent performance issues.

  • You can't use the LIMIT parameter with DELETE. So something like this will throw an error.

    DELETE from tableName
    WHERE to_be_deleted=TRUE
    LIMIT 10;
    

Preparing your database for mind gobbling amount of deletions

  • To improve the performance of deletion, it is important to delete unused indexes and recreate them after the deletion is complete. This means deleting indexes on columns that are not used in WHERE clauses or in ORDER BY. Having more indexes will lead to more updates as after deletion indexes are also updated, and high deletions can cause index fragmentation making indexes less effective. Therefore, deleting unused indexes can improve performance and prevent fragmentation of unused indexes. This will make the remaining indexes more effective and improve the speed of updating them.

  • Remove any “on delete” triggers if possible.

  • If foreign-key references to other tables are present then make indexes on the foreign keys. It will help to reduce table scan time thus improving performance. Also, perform the above steps on linked tables.

With the preparation done, now let’s address the elephant in the room, how to DELETE

Let's see how to actually delete the data

Delete the data in batches, but wait I only said above that using LIMIT with DELETE is not possible. But wait there is a workaround, but before that let me tell you why this is important. Whenever a row is deleted, a row-level lock is acquired. If too many rows are deleted, there is a lock escalation and the entire table will be locked for the transaction's duration and the transaction logs that will be generated will be huge. For the SQL server, this limit is 5000, but I was unable to find the hard limit for PostgreSQL, but this thing is also applicable for PostgreSQL.

Deleting in batches of 1000-5000 records would be ideal. So what you can do is make a script that would delete half a million records in batches of 1000-5000.

Before we move forward, let's first see how to delete in batches. The idea is that you would select say 1000 id from the required table and then ask SQL to delete those rows whose id are provided. This would look something like this...

DELETE FROM tableName
WHERE id IN 
(
    SELECT id 
    FROM tableName
    WHERE to_be_deleted=TRUE
    LIMIT 1000        
);

To actually delete half a million rows, you can either write a loop using SQL itself or any other programming language, for example, you can write something like this using Javascript and Knex(query builder).

const runner = async () => {
  try {
    console.log(`Started ${moment().toISOString()}`);

    const iterations = 10000000 / 1000;

    for (let index = 0; index < iterations; index++) {
      console.log(`Running ------------ ${index}`);
      try {
        await KNEX.raw(`
            DELETE from "tableName"
            WHERE "tableName"."id" IN (
                SELECT "tableName"."id"
                FROM "tableName"
                WHERE "tableName"."toBeDeleted" = TRUE
                LIMIT 1000
            )
        `);
      } catch (error) {
        console.log(error);
      }
      console.log(`Completed ------------ ${index}`);
    }

    console.log(`Completed ${moment().toISOString()}`);
  } catch (error) {
    console.log(error);
  }
};

After deleting 500,000 rows, the next important step is to run the VACUUM ANALYZE command on the affected tables. This command reclaims the storage occupied by dead rows, as deletions in PostgreSQL don't actually delete the data physically from the table. The ANALYZE parameter collects statistics about the table content and uses this data to help the query planner determine the most efficient execution plan. Furthermore, the command will re-create fragmented indexes. Therefore, running the VACUUM ANALYZE command is essential for maintaining the performance and integrity of the database.
NOTE: VACUUM ANALYZE causes a substantial increase in I/O traffic, which might cause poor performance for other active sessions.

While deletion is taking place you would see that free space in your database would be decreasing instead of increasing. This would be fixed after running VACUUM.

The command for running VACUUM is...

VACUUM ANALYZE VERBOSE tableName;

Now repeat the above steps until you have deleted all the required data.

I hope this article helped you out!

I would love to hear your thoughts on this topic. Please share your ideas and experiences in the comments section below.