Scaling a Spring application with a YugabyteDB cluster
Imagine having a tool that can automatically detect JPA and Hibernate performance issues. Wouldn’t that be just awesome?
Well, Hypersistence Optimizer is that tool! And it works with Spring Boot, Spring Framework, Jakarta EE, Java EE, Quarkus, or Play Framework.
So, enjoy spending your time on the things you love rather than fixing performance issues in your production system on a Saturday night!
Introduction
In this article, we are going to see that scaling the data access layer of a Spring application can be done very easily with a YugabyteDB cluster.
As I explained in this article, YugabyteDB is an open-source distributed SQL database that offers all the benefits of a typical relational database (e.g., SQL, strong consistency, ACID transactions) with the advantages of a globally-distributed auto-sharded database system (e.g., NoSQL document databases).
How to create a local YugabyteDB cluster
As I showed in this article, creating a single-node YugabyteDB Docker container is extremely easy.
However, configuring a multi-node YugabyteDB Docker container is just as easy.
First, we will create a Docker network that will be used by our YugabyteDB cluster:
docker network create yugabyte-network
Afterward, we will create the first YugabyteDB node:
docker run -d --name yugabyte-replica1 --net=yugabyte-network ^ -p7001:7000 -p9000:9000 -p5433:5433 ^ yugabytedb/yugabyte:latest bin/yugabyted start ^ --base_dir=/tmp/yugabyte ^ --daemon=false
The caret (e.g.,
^
) bash operator I used here is to instruct my local Windows OS to continue the command on the next line without executing it when reaching the end of the line.If you’re using Linux or Mac OS, then replace
^
with a\
bash operator.
After creating the first node, we can add two more nodes so that we will end up with a three-node cluster:
docker run -d --name yugabyte-replica2 --net=yugabyte-network ^ yugabytedb/yugabyte:latest bin/yugabyted start ^ --base_dir=/tmp/yugabyte ^ --join=yugabyte-replica1 ^ --daemon=false docker run -d --name yugabyte-replica3 --net=yugabyte-network ^ yugabytedb/yugabyte:latest bin/yugabyted start ^ --base_dir=/tmp/yugabyte ^ --join=yugabyte-replica1 ^ --daemon=false
That’s it!
If we open the YugabyteDB Admin web server UI, which, in our case, is available on the 7001
port on the localhost
because of the -p7001:7000
parameter we provided to docker run
when the yugabyte-replica1
node was created, then we will be able to see our newly created 3-node cluster:
Configuring Spring to use the YugabyteDB cluster
While you can use the PostgreSQL JDBC Driver to connect to a single-node YugabyteDB database, if you have a cluster of nodes, it’s much better to use the YugabyteDB-specific JDBC Driver, which you can get from Maven Central:
<dependency> <groupId>com.yugabyte</groupId> <artifactId>jdbc-yugabytedb</artifactId> <version>${yugabytedb.version}</version> </dependency>
As I explained in this article, it’s very important to use a connection pool when you are connecting to any database system, and YugabyteDB is no different.
Therefore, our DataSource
configuration is going to look as follows:
@Bean(destroyMethod = "close") public DataSource dataSource() { YBClusterAwareDataSource dataSource = new YBClusterAwareDataSource(); dataSource.setURL(url()); dataSource.setUser(username()); dataSource.setPassword(password()); HikariConfig hikariConfig = new HikariConfig(); hikariConfig.setMaximumPoolSize(maxConnections); hikariConfig.setAutoCommit(false); hikariConfig.setDataSource(dataSource); return new HikariDataSource(hikariConfig); } private String url() { return String.format( "jdbc:yugabytedb://%s:%d/%s?load-balance=true", host, port, database ); }
Notice that we are wrapping the YBClusterAwareDataSource
into a HikariDataSource
so that we can allow HikariCP to manage the YugabyteDB physical connections.
Spring batch processing task
To demonstrate how the YugabyteDB cluster works, let’s build a Spring batch processing task.
Consider we have a Post
entity that’s mapped as follows:
@Entity @Table(name = "post") public class Post { @Id @GeneratedValue private Long id; private String title; @Enumerated(EnumType.ORDINAL) private PostStatus status; public Long getId() { return id; } public Post setId(Long id) { this.id = id; return this; } public String getTitle() { return title; } public Post setTitle(String title) { this.title = title; return this; } public PostStatus getStatus() { return status; } public Post setStatus(PostStatus status) { this.status = status; return this; } }
We will define the PostRepository
that will provide the data access methods for the Post
entity:
@Repository public interface PostRepository extends BaseJpaRepository<Post, Long> { }
We also need to provide the @EnableJpaRepositories
configuration that instructs Spring where the Spring Data Repositories are located and the implementation of the BaseJpaRepository
interface:
@EnableJpaRepositories( value = "com.vladmihalcea.book.hpjp.spring.batch.repository", repositoryBaseClass = BaseJpaRepositoryImpl.class )
Note that the
PostRepository
extends theBaseJpaRepository
from the Hypersistence Utils open-source project and not the defaultJpaRepository
from Spring Data JPA.While the
JpaRepository
is a popular choice, few Java developers are aware of its problems. Not only it provides asave
method that doesn’t always translate to the proper JPA operation, but inheriting thefindAll
method in every singleRepository
, even the ones which can have hundreds of millions of records in the associated database tables, is a terrible idea.
The PostRepository
is used by the ForumService
, which defines a createPosts
method that can save a large volume of Post
records using multiple threads and JDBC-level batching:
@Service @Transactional(readOnly = true) public class ForumService { private static final Logger LOGGER = LoggerFactory.getLogger( ForumService.class ); private static final ExecutorService executorService = Executors .newFixedThreadPool( Runtime.getRuntime().availableProcessors() ); private final PostRepository postRepository; private final TransactionTemplate transactionTemplate; private final int batchProcessingSize; public ForumService( @Autowired PostRepository postRepository, @Autowired TransactionTemplate transactionTemplate, @Autowired int batchProcessingSize) { this.postRepository = postRepository; this.transactionTemplate = transactionTemplate; this.batchProcessingSize = batchProcessingSize; } @Transactional(propagation = Propagation.NEVER) public void createPosts(List<Post> posts) { CollectionUtils.spitInBatches(posts, batchProcessingSize) .map(postBatch -> executorService.submit(() -> { try { transactionTemplate.execute((status) -> postRepository.persistAll(postBatch) ); } catch (TransactionException e) { LOGGER.error("Batch transaction failure", e); } })) .forEach(future -> { try { future.get(); } catch (InterruptedException e) { Thread.currentThread().interrupt(); } catch (ExecutionException e) { LOGGER.error("Batch execution failure", e); } }); } }
There are several aspects worth mentioning about the ForumService
class:
- The
createPosts
uses the@Transactional(propagation = Propagation.NEVER)
because the transaction management is handled by each batch processing task. - The
CollectionUtils.spitInBatches
is used to split aList<T>
into aStream<List<T>>
where eachStream
element has at mostbatchProcessingSize
elements. - The
Post
entities are persisted using thepersistAll
method from theBaseJpaRepository
.
Since Hibernate does not use JDBC-batching by default, we will set the following Hibernate configurations in the Java-based Spring Bean configuration:
protected Properties additionalProperties() { Properties properties = new Properties(); properties.setProperty( "hibernate.jdbc.batch_size", String.valueOf(batchProcessingSize()) ); properties.setProperty( "hibernate.order_inserts", "true" ); properties.setProperty( "hibernate.order_updates", "true" ); return properties; }
Testing time
When sending a large chunk of Post
entities to the createPosts
method:
List<Post> posts = LongStream.rangeClosed(1, POST_COUNT) .mapToObj(postId -> new Post() .setTitle( String.format("High-Performance Java Persistence - Page %d", postId ) ) .setStatus(PostStatus.PENDING) ) .toList(); forumService.createPosts(posts);
In the YugabyteDB Tablet Servers view, the Write ops/sec
columns indicate that all three YugabyteDB nodes are used equally to save the post
table records:
So, the write operations are propagated to all nodes, not just to a single Primary node, as it’s the case with the Single-Primary Replication strategy.
This is because YugabyteDB employs a shared-nothing architecture where each node is the primary for a given data subset. For instance, the first node can be the owner of Post
with the id
value of 1
and handle read and write operations associated with this record.
The other two nodes will keep a copy of this Post
with the id
value of 1
for high-availability purposes. At the same time, the second node can be the owner of the Post
record with the id
value of 2
, and the other two nodes will be keeping a copy of this particular record.
Therefore, the YugabyteDB cluster is not used only for writing. The three nodes are also used also when executing read operations because the data is automatically sharded.
For instance, if we fetch 1000
Post
entities:
LongStream.rangeClosed(1, 1000) .boxed() .forEach(id -> assertNotNull(forumService.findById(id) ) );
We can see in the Table Servlets that all nodes were being used when fetching these records:
Cool, right?
If you enjoyed this article, I bet you are going to love my Book and Video Courses as well.
Conclusion
Most often, a Spring application uses a Single-Primary replicated relational database since this has been the typical replication architecture employed by PostgreSQL, MySQL, Oracle, or SQL Server.
However, scaling a single-primary replicated database requires a read-write and read-only transaction routing strategy. For this reason, reads will be scaled horizontally on Replica nodes, while writes can only be scaled vertically on the Primary node.
A YugabyteDB cluster doesn’t have this limitation since it can scale both the read and write operations across the entire cluster, making it easier for you to scale a Spring application.
This research was funded by Yugabyte and conducted in accordance with the blog ethics policy.
While the article was written independently and reflects entirely my opinions and conclusions, the amount of work involved in making this article happen was compensated by Yugabyte.

Hi Vlad, Rafael,
The “smart driver” (YBClusterAwareDataSource) is actually smarter than that 😉
There’s no need for a load balancer (and YB-Master is not involved there). It is all client-side.
The JDBC endpoint (127.0.0.0:5433 here) is any of the T-Server. When load-balance=true, what does the driver is using this connection to ask for the list of nodes, by querying yb_servers(), and takes one at random (taking care to load balance) to return to the app. The list of servers is cached and refreshed every 5 minutes. So this is transparent to the application, doesn’t need an additional load balancer, and doesn’t have to go through a proxy.
If you add &loggerLevel=TRACE to the URL you can see this when it gets the list.
Thanks, Franck.
If I understood you correctly, this intelligence happens when we use the
YBClusterAwareDataSource
(with its JDBC driver), but what happens when using PostgreSQL’s Driver or DataSource?The YugabyteDB Driver can do the load balancing. The PostgreSQL Driver cannot do that.
Yes, with the PostgreSQL’s Driver you need a HA proxy. For example a headless service in Kubernetes.
Vad, thanks for the series (indeed so much that you offer). I’ve read all the posts (on yugabyte), and besides its own strengths, you’ve stressed how easily it can be used in place of postgres. That’s certainly compelling–and again you’ve stressed how it even trumps pg in some ways.
But have I missed any discussion of this: how easily can a pg db be converted to being a yb one? Or might it be as easy as that yb can use a pg db, unchanged? Or is it perhaps just a simple dump out of pg and a restore into yb?
(You did discuss in one article how index definition needed to change, so perhaps it’s not quite as plug-and-play as some might hope.)
I hope you don’t mind me asking here, rather than trying to find the answer elsewhere. This seemed a topic that perhaps other readers here would appreciate. Might it even warrant its own post? Perhaps you already had it in mind. 🙂
Apologies if somehow you help me see something foolish in the whole question.
YugabyteDB uses the PostgreSQL wire protocol, so it’s compatible in relation to DML statements.
There are DDL or other configuration changes due to the fact that the underlying storage engine is different, so it’s not a drop-in replacement for PostgreSQL, but it’s much easier to use it than to migrate to Google Spanner or Cosmos DB.
You will need to have integration tests to prove that most of your business use cases work fine on YugabyteDB.
While I’ve just started using YugabyteDB, Franck Pachot and Denis Magda are experts, sothey can surely help you answer your questions.
Hi Charlie,
Let me expand on the compatibility question.
When we talk about Postgres compatibility, we should define several compatibility levels:
1) Wire-level compatibility – a database serializes and deserializes commands that are exchanged with clients (apps, tools) using the Postgres protocol. If the database is wire-level compatible with Postgres, then you should be able to connect to it with psql or standard drivers.
2) Syntax-level compatibility – a database supports Postgres SQL dialect. You can execute the Postgres version of DDL and DML.
3) Feature-level compatibility – a database implements Postgres features such as triggers, stored procedures, materialized views, and more.
4) Runtime-level compatibility – a database matches PostgreSQL execution semantics at runtime. More specifically, runtime-compatible databases should support queries to the system catalog, error messages, and error codes. The runtime-compatible databases usually work and look like Postgres.
Also, if a database is feature-compatible, then it’s assumed that it’s also syntax- and wire-compatible. If it’s runtime-compatible, then it’s also feature-compatible with Postgres.
YugabyteDB is runtime-compatible with PostgreSQL. Which makes it easy to migrate to YugabyteDB without a full application rewrite. Plus, it means that most of the drivers, libraries, tools, and frameworks created for Postgres would also work for YugabyteDB.
Finally, if we say that some database is feature-compatible with Postgres, it doesn’t mean that it’s 100% compatible. The only 100% compatible database with Postgres is Postgres! Instead, we should talk about low, moderate or high compatibility. For instance, it might mean that that database is:
* 100% compatible at the wire level
* 80% compatible at the syntax level (because some commands might not be supported yet)
* 70% compatible at the feature level (some Postgres features are missing)
YugabyteDB is a highly-compatible with Postgres.
Check this YouTube series for a hands-on explanation of different compatibility levels:
Thanks, Vlad and Denis. That all makes sense, and I appreciate the clarification–indeed the elaboration, and all the more video, to boot. 🙂 It all “sings to my heart”, in this era of Twitter/Slack-level “answers”, and makes me all the more encouraged to explore yb!
Hi Vlad,
Thank you for another great article.
Is it reasonable/feasible to use this setup (the containerized multi-node YugabyteDB cluster) in a production environment?
Thank you.
Yes, it should be fine to run it in production, as illustrated by their list of clients.
Thanks for all explanation, Vlad. 👊🏻
You’re welcome.
Hi Vlad, great article. Thanks for sharing 👏🏻👏🏻
One question, in this URL
"jdbc:yugabytedb://%s:%d/%s?load-balance=true"
, which host and port did you use?In my case, I used:
But, it all depends on how you configure the Docker container.
I got it, thanks!
So this instance (
127.0.0.0:5433
) is responsible for load balancing the transactions among all those replicas, right?That’s nice because I thought we had to configure all replicas in the JDBC URL or let an external proxy take care of it.
Yes, that’s the location of the YB-Master, which redirects the calls to the cluster of Tablet Servers. The first node I created is both the YB-Master leader and a tablet server, but if it fails, then another node will become the new YB-Master leader.