PostgreSQL COPY result set to file

Are you struggling with performance issues in your Spring, Jakarta EE, or Java EE application?

What if there were a tool that could automatically detect what caused performance issues in your JPA and Hibernate data access layer?

Wouldn’t it be awesome to have such a tool to watch your application and prevent performance issues during development, long before they affect production systems?

Well, Hypersistence Optimizer is that tool! And it works with Spring Boot, Spring Framework, Jakarta EE, Java EE, Quarkus, Micronaut, or Play Framework.

So, rather than fixing performance issues in your production system on a Saturday night, you are better off using Hypersistence Optimizer to help you prevent those issues so that you can spend your time on the things that you love!

Introduction

In this article, we are going to see how we can use the PostgreSQL COPY command to export a large result set to an external file.

ETL (Extract, Transform, Load)

When implementing an ETL (Extract, Transform, Load) process, you might have to extract a large result set from a relational database in order to transform it according to some specific business rules.

Let’s assume we have to export the result set of the following SQL query:

SELECT
    p.id AS post_id,
    p.title AS post_title,
    pc.id AS comment_id,
    pc.review AS comment_review
FROM post p
INNER JOIN post_comment pc ON pc.post_id = p.id

If we have only 2 post records, each one having 5 post_comment child rows associated with them, then the result set would look like this:

| post_id | post_title                                 | comment_id | comment_review                                |
| ------- | ------------------------------------------ | ---------- | --------------------------------------------- |
| 1       | High-Performance Java Persistence - page 1 | 1          | Excellent book to understand Java Persistence |
| 1       | High-Performance Java Persistence - page 1 | 2          | Must-read for Java developers                 |
| 1       | High-Performance Java Persistence - page 1 | 3          | Five Stars                                    |
| 1       | High-Performance Java Persistence - page 1 | 4          | A great reference book                        |
| 1       | High-Performance Java Persistence - page 1 | 5          | The ultimate guide to a critical topic        |
| 2       | High-Performance Java Persistence - page 2 | 6          | Excellent book to understand Java Persistence |
| 2       | High-Performance Java Persistence - page 2 | 7          | Must-read for Java developers                 |
| 2       | High-Performance Java Persistence - page 2 | 8          | Five Stars                                    |
| 2       | High-Performance Java Persistence - page 2 | 9          | A great reference book                        |
| 2       | High-Performance Java Persistence - page 2 | 10         | The ultimate guide to a critical topic        |

However, in our case, we have one million post records that will be joined with five million post_comment child rows, so our query result set will contain five million entries.

Exporting a large result set using Streams

Streaming is a very well-known method of extracting a large result set, and if we choose this option we need to be careful with relational databases that prefetch the entire result set, such as PostgreSQL and MySQL.

As I explained in this article, PostgreSQL and MySQL prefetch the entire JDBC ResultSet even before iterating it.

Therefore, if we want to use streaming, we need to set the statement fetch size to a positive integer value that will define the maximum number of records that are fetched at once from the database server to the JDBC client.

Assuming that we will export the query result into a file named post_and_comments.csv, the first thing we will do in our integration test is to delete any existing file matching this name:

String fileName = "post_and_comments.csv";

Path filePath = Paths.get(
    queryResultSetOutputFolder, 
    fileName
)
.toAbsolutePath();

if (Files.exists(filePath)) {
    Files.delete(filePath);
}
assertFalse(Files.exists(filePath));

Now, we are going to execute our SQL query and wrap the result into a Java Stream that we will save to a file:

long startNanos = System.nanoTime();

try(Stream<String> tuples = entityManager.createNativeQuery("""
    SELECT 'post_id,post_title,comment_id,comment_review'
    UNION ALL
    SELECT concat_ws(',',
        p.id,
        p.title,
        pc.id,
        pc.review
    )
    FROM post p
    INNER JOIN post_comment pc ON pc.post_id = p.id
    """, String.class)
.setHint(AvailableHints.HINT_FETCH_SIZE, BATCH_SIZE)
.getResultStream()) {
    Files.write(filePath, (Iterable<String>) tuples::iterator);
}

assertTrue(Files.exists(filePath));

long endNanos = System.nanoTime();

long lineCount;
try (Stream<String> stream = Files.lines(filePath)) {
    lineCount = stream.count();
}

LOGGER.info(
    "Fetched and saved [{}] records in [{}] ms",
    lineCount - 1,
    TimeUnit.NANOSECONDS.toMillis(
        endNanos - startNanos
    )
);
LOGGER.info(
    "File [{}] size is [{}]",
    fileName,
    FileUtils.byteCountToDisplaySize(Files.size(filePath))
);

Notice that we used the HINT_FETCH_SIZE JPA query hint to set the statement fetch size, which allows us to stream data from the database without running out of memory.

When executing the above integration test, we get the following results:

Fetched and saved [5000000] records in [15262] ms
File [post_and_comments.csv] size is [445 MB]

Streaming the result set and saving it to a file takes 15.2 seconds.

PostgreSQL COPY export query result set to file

PostgreSQL provides the COPY command, which we can use to export a query result set to a file.

First, we will create the following exportQueryResultSet method that allows us to export the result set of a given SQL query to the provided file path:

private int exportQueryResultSet(
        String query, 
        Path filePath) {
    return doInJPA(entityManager -> {
        return entityManager.createNativeQuery(
            String.format("""
                COPY (%s)
                TO '%s'
                WITH CSV HEADER
                """,
                query,
                filePath
            )
        )
        .executeUpdate();
    });
}

The query result set is exported by the PostgreSQL COPY command, which can transform the result set to CSV and write it to a file.

With the exportQueryResultSet method in place, we can export the result set of our SQL query like this:

long startNanos = System.nanoTime();

int copyCount = exportQueryResultSet("""
    SELECT
        p.id AS post_id,
        p.title AS post_title,
        pc.id AS comment_id,
        pc.review AS comment_review
    FROM post p
    INNER JOIN post_comment pc ON pc.post_id = p.id
    """,
    filePath
);

LOGGER.info(
    "Copy has exported [{}] records in [{}] ms",
    copyCount,
    TimeUnit.NANOSECONDS.toMillis(
        System.nanoTime() - startNanos
    )
);

assertTrue(Files.exists(filePath));
LOGGER.info(
    "File [{}] size is [{}]",
    fileName,
    FileUtils.byteCountToDisplaySize(Files.size(filePath))
);

When running the above test case, we get the following output:

Copy has exported [5000000] records in [7813] ms
File [post_and_comments.csv] size is [445 MB]

The export was done in 7.8 seconds, which is almost twice as fast as the previous streaming option because it does not incur the networking overhead of moving the data from the database to the application.

Awesome, right?

If you enjoyed this article, I bet you are going to love my Book and Video Courses as well.

Conclusion

Exporting a large result set can be done either via streaming or a database-specific exporting command, such as the PostgreSQL COPY.

While streaming is a well-known option, the PostgreSQL COPY can export the query result set to file much faster since it does not incur any networking overhead.

Transactions and Concurrency Control eBook

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.