Scaling a Spring application with a YugabyteDB cluster

Imagine having a tool that can automatically detect JPA and Hibernate performance issues. Wouldn’t that be just awesome?

Well, Hypersistence Optimizer is that tool! And it works with Spring Boot, Spring Framework, Jakarta EE, Java EE, Quarkus, or Play Framework.

So, enjoy spending your time on the things you love rather than fixing performance issues in your production system on a Saturday night!

Introduction

In this article, we are going to see that scaling the data access layer of a Spring application can be done very easily with a YugabyteDB cluster.

As I explained in this article, YugabyteDB is an open-source distributed SQL database that offers all the benefits of a typical relational database (e.g., SQL, strong consistency, ACID transactions) with the advantages of a globally-distributed auto-sharded database system (e.g., NoSQL document databases).

How to create a local YugabyteDB cluster

As I showed in this article, creating a single-node YugabyteDB Docker container is extremely easy.

However, configuring a multi-node YugabyteDB Docker container is just as easy.

First, we will create a Docker network that will be used by our YugabyteDB cluster:

docker network create yugabyte-network

Afterward, we will create the first YugabyteDB node:

docker run -d --name yugabyte-replica1 --net=yugabyte-network ^
 -p7001:7000 -p9000:9000 -p5433:5433 ^
 yugabytedb/yugabyte:latest bin/yugabyted start ^
 --base_dir=/tmp/yugabyte ^
 --daemon=false

The caret (e.g., ^) bash operator I used here is to instruct my local Windows OS to continue the command on the next line without executing it when reaching the end of the line.

If you’re using Linux or Mac OS, then replace ^ with a \ bash operator.

After creating the first node, we can add two more nodes so that we will end up with a three-node cluster:

docker run -d --name yugabyte-replica2 --net=yugabyte-network ^
 yugabytedb/yugabyte:latest bin/yugabyted start ^
 --base_dir=/tmp/yugabyte ^
 --join=yugabyte-replica1 ^
 --daemon=false
 
docker run -d --name yugabyte-replica3 --net=yugabyte-network ^
 yugabytedb/yugabyte:latest bin/yugabyted start ^
 --base_dir=/tmp/yugabyte ^
 --join=yugabyte-replica1 ^
 --daemon=false

That’s it!

If we open the YugabyteDB Admin web server UI, which, in our case, is available on the 7001 port on the localhost because of the -p7001:7000 parameter we provided to docker run when the yugabyte-replica1 node was created, then we will be able to see our newly created 3-node cluster:

YugabyteDB Cluster

Configuring Spring to use the YugabyteDB cluster

While you can use the PostgreSQL JDBC Driver to connect to a single-node YugabyteDB database, if you have a cluster of nodes, it’s much better to use the YugabyteDB-specific JDBC Driver, which you can get from Maven Central:

<dependency>
    <groupId>com.yugabyte</groupId>
    <artifactId>jdbc-yugabytedb</artifactId>
    <version>${yugabytedb.version}</version>
</dependency>

As I explained in this article, it’s very important to use a connection pool when you are connecting to any database system, and YugabyteDB is no different.

Therefore, our DataSource configuration is going to look as follows:

@Bean(destroyMethod = "close")
public DataSource dataSource() {
    YBClusterAwareDataSource dataSource = new YBClusterAwareDataSource();
    dataSource.setURL(url());
    dataSource.setUser(username());
    dataSource.setPassword(password());
    
    HikariConfig hikariConfig = new HikariConfig();
    hikariConfig.setMaximumPoolSize(maxConnections);
    hikariConfig.setAutoCommit(false);
    hikariConfig.setDataSource(dataSource);
    
    return new HikariDataSource(hikariConfig);
}

private String url() {
    return String.format(
        "jdbc:yugabytedb://%s:%d/%s?load-balance=true",
        host,
        port,
        database
    );
}

Notice that we are wrapping the YBClusterAwareDataSource into a HikariDataSource so that we can allow HikariCP to manage the YugabyteDB physical connections.

Spring batch processing task

To demonstrate how the YugabyteDB cluster works, let’s build a Spring batch processing task.

Consider we have a Post entity that’s mapped as follows:

@Entity
@Table(name = "post")
public class Post {

    @Id
    @GeneratedValue
    private Long id;

    private String title;

    @Enumerated(EnumType.ORDINAL)
    private PostStatus status;

    public Long getId() {
        return id;
    }

    public Post setId(Long id) {
        this.id = id;
        return this;
    }

    public String getTitle() {
        return title;
    }

    public Post setTitle(String title) {
        this.title = title;
        return this;
    }

    public PostStatus getStatus() {
        return status;
    }

    public Post setStatus(PostStatus status) {
        this.status = status;
        return this;
    }
}

We will define the PostRepository that will provide the data access methods for the Post entity:

@Repository
public interface PostRepository extends BaseJpaRepository<Post, Long> {
}

We also need to provide the @EnableJpaRepositories configuration that instructs Spring where the Spring Data Repositories are located and the implementation of the BaseJpaRepository interface:

@EnableJpaRepositories(
    value = "com.vladmihalcea.book.hpjp.spring.batch.repository",
    repositoryBaseClass = BaseJpaRepositoryImpl.class
)

Note that the PostRepository extends the BaseJpaRepository from the Hypersistence Utils open-source project and not the default JpaRepository from Spring Data JPA.

While the JpaRepository is a popular choice, few Java developers are aware of its problems. Not only it provides a save method that doesn’t always translate to the proper JPA operation, but inheriting the findAll method in every single Repository, even the ones which can have hundreds of millions of records in the associated database tables, is a terrible idea.

The PostRepository is used by the ForumService, which defines a createPosts method that can save a large volume of Post records using multiple threads and JDBC-level batching:

@Service
@Transactional(readOnly = true)
public class ForumService {

    private static final Logger LOGGER = LoggerFactory.getLogger(
        ForumService.class
    );

    private static final ExecutorService executorService = Executors
        .newFixedThreadPool(
            Runtime.getRuntime().availableProcessors()
        );

    private final PostRepository postRepository;

    private final TransactionTemplate transactionTemplate;

    private final int batchProcessingSize;

    public ForumService(
        @Autowired PostRepository postRepository,
        @Autowired TransactionTemplate transactionTemplate,
        @Autowired int batchProcessingSize) {
        this.postRepository = postRepository;
        this.transactionTemplate = transactionTemplate;
        this.batchProcessingSize = batchProcessingSize;
    }

    @Transactional(propagation = Propagation.NEVER)
    public void createPosts(List<Post> posts) {
        CollectionUtils.spitInBatches(posts, batchProcessingSize)
            .map(postBatch -> executorService.submit(() -> {
                try {
                    transactionTemplate.execute((status) -> 
                        postRepository.persistAll(postBatch)
                    );
                } catch (TransactionException e) {
                    LOGGER.error("Batch transaction failure", e);
                }
            }))
            .forEach(future -> {
                try {
                    future.get();
                } catch (InterruptedException e) {
                    Thread.currentThread().interrupt();
                } catch (ExecutionException e) {
                    LOGGER.error("Batch execution failure", e);
                }
            });
    }
}

There are several aspects worth mentioning about the ForumService class:

  • The createPosts uses the @Transactional(propagation = Propagation.NEVER) because the transaction management is handled by each batch processing task.
  • The CollectionUtils.spitInBatches is used to split a List<T> into a Stream<List<T>> where each Stream element has at most batchProcessingSize elements.
  • The Post entities are persisted using the persistAll method from the BaseJpaRepository.

Since Hibernate does not use JDBC-batching by default, we will set the following Hibernate configurations in the Java-based Spring Bean configuration:

protected Properties additionalProperties() {
    Properties properties = new Properties();
    
    properties.setProperty(
        "hibernate.jdbc.batch_size", 
        String.valueOf(batchProcessingSize())
    );
    properties.setProperty(
        "hibernate.order_inserts", 
        "true"
    );
    properties.setProperty(
        "hibernate.order_updates", 
        "true"
    );
    
    return properties;
}

Testing time

When sending a large chunk of Post entities to the createPosts method:

List<Post> posts = LongStream.rangeClosed(1, POST_COUNT)
    .mapToObj(postId -> new Post()
        .setTitle(
            String.format("High-Performance Java Persistence - Page %d",
                postId
            )
        )
        .setStatus(PostStatus.PENDING)
    )
    .toList();

forumService.createPosts(posts);

In the YugabyteDB Tablet Servers view, the Write ops/sec columns indicate that all three YugabyteDB nodes are used equally to save the post table records:

YugabyteDB Cluster Tablet Servers Writes

So, the write operations are propagated to all nodes, not just to a single Primary node, as it’s the case with the Single-Primary Replication strategy.

This is because YugabyteDB employs a shared-nothing architecture where each node is the primary for a given data subset. For instance, the first node can be the owner of Post with the id value of 1 and handle read and write operations associated with this record.

The other two nodes will keep a copy of this Post with the id value of 1 for high-availability purposes. At the same time, the second node can be the owner of the Post record with the id value of 2, and the other two nodes will be keeping a copy of this particular record.

Therefore, the YugabyteDB cluster is not used only for writing. The three nodes are also used also when executing read operations because the data is automatically sharded.

For instance, if we fetch 1000 Post entities:

LongStream.rangeClosed(1, 1000)
    .boxed()
    .forEach(id -> 
        assertNotNull(forumService.findById(id)
    )
);

We can see in the Table Servlets that all nodes were being used when fetching these records:

YugabyteDB Cluster Tablet Servers Reads

Cool, right?

If you enjoyed this article, I bet you are going to love my Book and Video Courses as well.

Conclusion

Most often, a Spring application uses a Single-Primary replicated relational database since this has been the typical replication architecture employed by PostgreSQL, MySQL, Oracle, or SQL Server.

However, scaling a single-primary replicated database requires a read-write and read-only transaction routing strategy. For this reason, reads will be scaled horizontally on Replica nodes, while writes can only be scaled vertically on the Primary node.

A YugabyteDB cluster doesn’t have this limitation since it can scale both the read and write operations across the entire cluster, making it easier for you to scale a Spring application.

This research was funded by Yugabyte and conducted in accordance with the blog ethics policy.

While the article was written independently and reflects entirely my opinions and conclusions, the amount of work involved in making this article happen was compensated by Yugabyte.

Transactions and Concurrency Control eBook

16 Comments on “Scaling a Spring application with a YugabyteDB cluster

  1. Hi Vlad, Rafael,

    The “smart driver” (YBClusterAwareDataSource) is actually smarter than that 😉
    There’s no need for a load balancer (and YB-Master is not involved there). It is all client-side.
    The JDBC endpoint (127.0.0.0:5433 here) is any of the T-Server. When load-balance=true, what does the driver is using this connection to ask for the list of nodes, by querying yb_servers(), and takes one at random (taking care to load balance) to return to the app. The list of servers is cached and refreshed every 5 minutes. So this is transparent to the application, doesn’t need an additional load balancer, and doesn’t have to go through a proxy.

    If you add &loggerLevel=TRACE to the URL you can see this when it gets the list.

    • Thanks, Franck.

      If I understood you correctly, this intelligence happens when we use the YBClusterAwareDataSource (with its JDBC driver), but what happens when using PostgreSQL’s Driver or DataSource?

      • Yes, with the PostgreSQL’s Driver you need a HA proxy. For example a headless service in Kubernetes.

  2. Vad, thanks for the series (indeed so much that you offer). I’ve read all the posts (on yugabyte), and besides its own strengths, you’ve stressed how easily it can be used in place of postgres. That’s certainly compelling–and again you’ve stressed how it even trumps pg in some ways.

    But have I missed any discussion of this: how easily can a pg db be converted to being a yb one? Or might it be as easy as that yb can use a pg db, unchanged? Or is it perhaps just a simple dump out of pg and a restore into yb?

    (You did discuss in one article how index definition needed to change, so perhaps it’s not quite as plug-and-play as some might hope.)

    I hope you don’t mind me asking here, rather than trying to find the answer elsewhere. This seemed a topic that perhaps other readers here would appreciate. Might it even warrant its own post? Perhaps you already had it in mind. 🙂

    Apologies if somehow you help me see something foolish in the whole question.

    • YugabyteDB uses the PostgreSQL wire protocol, so it’s compatible in relation to DML statements.

      There are DDL or other configuration changes due to the fact that the underlying storage engine is different, so it’s not a drop-in replacement for PostgreSQL, but it’s much easier to use it than to migrate to Google Spanner or Cosmos DB.

      You will need to have integration tests to prove that most of your business use cases work fine on YugabyteDB.

      While I’ve just started using YugabyteDB, Franck Pachot and Denis Magda are experts, sothey can surely help you answer your questions.

    • Hi Charlie,

      Let me expand on the compatibility question.

      When we talk about Postgres compatibility, we should define several compatibility levels:

      1) Wire-level compatibility – a database serializes and deserializes commands that are exchanged with clients (apps, tools) using the Postgres protocol. If the database is wire-level compatible with Postgres, then you should be able to connect to it with psql or standard drivers.

      2) Syntax-level compatibility – a database supports Postgres SQL dialect. You can execute the Postgres version of DDL and DML.

      3) Feature-level compatibility – a database implements Postgres features such as triggers, stored procedures, materialized views, and more.

      4) Runtime-level compatibility – a database matches PostgreSQL execution semantics at runtime. More specifically, runtime-compatible databases should support queries to the system catalog, error messages, and error codes. The runtime-compatible databases usually work and look like Postgres.

      Also, if a database is feature-compatible, then it’s assumed that it’s also syntax- and wire-compatible. If it’s runtime-compatible, then it’s also feature-compatible with Postgres.

      YugabyteDB is runtime-compatible with PostgreSQL. Which makes it easy to migrate to YugabyteDB without a full application rewrite. Plus, it means that most of the drivers, libraries, tools, and frameworks created for Postgres would also work for YugabyteDB.

      Finally, if we say that some database is feature-compatible with Postgres, it doesn’t mean that it’s 100% compatible. The only 100% compatible database with Postgres is Postgres! Instead, we should talk about low, moderate or high compatibility. For instance, it might mean that that database is:
      * 100% compatible at the wire level
      * 80% compatible at the syntax level (because some commands might not be supported yet)
      * 70% compatible at the feature level (some Postgres features are missing)

      YugabyteDB is a highly-compatible with Postgres.

      Check this YouTube series for a hands-on explanation of different compatibility levels:

      • Thanks, Vlad and Denis. That all makes sense, and I appreciate the clarification–indeed the elaboration, and all the more video, to boot. 🙂 It all “sings to my heart”, in this era of Twitter/Slack-level “answers”, and makes me all the more encouraged to explore yb!

  3. Hi Vlad,

    Thank you for another great article.

    Is it reasonable/feasible to use this setup (the containerized multi-node YugabyteDB cluster) in a production environment?

    Thank you.

  4. Hi Vlad, great article. Thanks for sharing 👏🏻👏🏻

    One question, in this URL "jdbc:yugabytedb://%s:%d/%s?load-balance=true", which host and port did you use?

    • In my case, I used:

      private String host = "127.0.0.1";
      private int port = 5433;
      

      But, it all depends on how you configure the Docker container.

      • I got it, thanks!

        So this instance (127.0.0.0:5433) is responsible for load balancing the transactions among all those replicas, right?

        That’s nice because I thought we had to configure all replicas in the JDBC URL or let an external proxy take care of it.

      • Yes, that’s the location of the YB-Master, which redirects the calls to the cluster of Tablet Servers. The first node I created is both the YB-Master leader and a tablet server, but if it fails, then another node will become the new YB-Master leader.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.