The best UUID type for a database Primary Key
Imagine having a tool that can automatically detect JPA and Hibernate performance issues. Wouldn’t that be just awesome?
Well, Hypersistence Optimizer is that tool! And it works with Spring Boot, Spring Framework, Jakarta EE, Java EE, Quarkus, or Play Framework.
So, enjoy spending your time on the things you love rather than fixing performance issues in your production system on a Saturday night!
Introduction
In this article, we are going to see what UUID (Universally Unique Identifier) type works best for a database column that has a Primary Key constraint.
While the standard 128-bit random UUID is a very popular choice, you’ll see that this is a terrible fit for a database Primary Key column.
Standard UUID and database Primary Key
A universally unique identifier (UUID) is a 128-bit pseudo-random sequence that can be generated independently without the need for a single centralized system in charge of ensuring the identifier’s uniqueness.
The RFC 4122 specification defines five standardized versions of UUID, which are implemented by various database functions or programming languages.
For instance, the UUID()
MySQL function returns a version 1 UUID number.
And the Java UUID.randomUUID()
function returns a version 4 UUID number.
For many devs, using these standard UUIDs as a database identifier is very appealing because:
- The ids can be generated by the application. Hence no central coordination is required.
- The chance of identifier collision is extremely low.
- The id value being random, you can safely send it to the UI as the user would not be able to guess other identifier values and use them to see other people’s data.
But, using a random UUID as a database table Primary Key is a bad idea for multiple reasons.
First, the UUID is huge. Every single record will need 16 bytes for the database identifier, and this impacts all associated Foreign Key columns as well.
Second, the Primary Key column usually has an associated B+Tree index to speed up lookups or joins, and B+Tree indexes store data in sorted order.
However, indexing random values using B+Tree causes a lot of problems:
- Index pages will have a very low fill factor because the values come randomly. So, a page of 8kB will end up storing just a few elements, therefore wasting a lot of space, both on the disk and in the database memory, as index pages could be cached in the Buffer Pool.
- Because the B+Tree index needs to rebalance itself in order to maintain its equidistant tree structure, the random key values will cause more index page splits and merges as there is no pre-determined order of filling the tree structure.
If you’re using SQL Server or MySQL, then it’s even worse because the entire table is basically a clustered index.
And all these problems will affect the secondary indexes as well because they store the Primary Key value in the secondary index leaf nodes.
In fact, almost any database expert will tell you to avoid using the standard UUIDs as database table Primary Keys:
- Percona: UUIDs are Popular, but Bad for Performance
- UUID or GUID as Primary Keys? Be Careful!
- Identity Crisis: Sequence v. UUID as Primary Key
TSID – Time-Sorted Unique Identifiers
If you plan to store UUID values in a Primary Key column, then you are better off using a TSID (time-sorted unique identifier).
One such implementation is offered by the Hypersistence TSID OSS library, which provides a 64-bit TSID that’s made of two parts:
- a 42-bit time component
- a 22-bit random component
The random component has two parts:
- a node identifier (0 to 20 bits)
- a counter (2 to 22 bits)
The node identifier can be provided by the tsidcreator.node
system property when bootstrapping the application:
-Dtsidcreator.node="12"
The node identifier can also be provided via the TSIDCREATOR_NODE
environment variable:
export TSIDCREATOR_NODE="12"
The library is available on Maven Central, so you can get it via the following dependency:
<dependency> <groupId>io.hypersistence</groupId> <artifactId>hypersistence-tsid</artifactId> <version>${hypersistence-tsid.version}</version> </dependency>
You can create a TSID
object like this:
TSID tsid = TSID.fast()
From the Tsid
object, we can extract the following values:
- the 64-bit numerical value,
- the Crockford’s Base32 String value that encodes the 64-bit value,
- the Unix milliseconds since epoch that is stored in the
42
-bit sequence
To visualize these values, we can print them into the log:
long tsidLong = tsid.toLong(); String tsidString = tsid.toString(); long tsidMillis = tsid.getUnixMilliseconds(); LOGGER.info( "TSID numerical value: {}", tsidLong ); LOGGER.info( "TSID string value: {}", tsidString ); LOGGER.info( "TSID time millis since epoch value: {}", tsidMillis );
And we get the following output:
TSID numerical value: 388400145978465528 TSID string value: 0ARYZVZXW377R TSID time millis since epoch value: 1670438610927
When generating ten values:
for (int i = 0; i < 10; i++) { LOGGER.info( "TSID numerical value: {}", TSID.fast().toLong() ); }
We can see that the values are monotonically increasing:
TSID numerical value: 388401207189971936 TSID numerical value: 388401207189971937 TSID numerical value: 388401207194165637 TSID numerical value: 388401207194165638 TSID numerical value: 388401207194165639 TSID numerical value: 388401207194165640 TSID numerical value: 388401207194165641 TSID numerical value: 388401207194165642 TSID numerical value: 388401207194165643 TSID numerical value: 388401207194165644
Awesome, right?
Using the TSID in your application
Because the default TSID factories come with a synchronized random value generator, it’s better to use a custom TsidFactory
that provides the following optimizations:
- It can generate the random values using a
ThreadLocalRandom
, therefore avoiding Thread blocking on synchronized blocks - It can use a small number of node bits, therefore leaving us more bits for the random-generated numerical value.
So, we can define the following TsidUtil
that provides us a TsidFactory
to use whenever we want to generate a new TSID
object:
public static class TsidUtil { public static final String TSID_NODE_COUNT_PROPERTY = "tsid.node.count"; public static final String TSID_NODE_COUNT_ENV = "TSID_NODE_COUNT"; public static TSID.Factory TSID_FACTORY; static { String nodeCountSetting = System.getProperty( TSID_NODE_COUNT_PROPERTY ); if(nodeCountSetting == null) { nodeCountSetting = System.getenv( TSID_NODE_COUNT_ENV ); } int nodeCount = nodeCountSetting != null ? Integer.parseInt(nodeCountSetting) : 256; int nodeBits = (int) (Math.log(nodeCount) / Math.log(2)); TSID_FACTORY = TSID.Factory.builder() .withRandomFunction(TSID.Factory.THREAD_LOCAL_RANDOM_FUNCTION) .withNodeBits(nodeBits) .build(); } }
When running this test case that operates 16 threads on 2 nodes:
int threadCount = 16; int iterationCount = 100_000; CountDownLatch endLatch = new CountDownLatch(threadCount); ConcurrentMap<TSID, Integer> tsidMap = new ConcurrentHashMap<>(); long startNanos = System.nanoTime(); AtomicLong collisionCount = new AtomicLong(); int nodeCount = 2; for (int i = 0; i < threadCount; i++) { final int threadId = i; new Thread(() -> { TSID.Factory tsidFactory = TsidUtils.getTsidFactory( nodeCount, threadId % nodeCount ); for (int j = 0; j < iterationCount; j++) { TSID tsid = tsidFactory.generate(); Integer existingTsid = tsidMap.put( tsid, (threadId * iterationCount) + j ); if(existingTsid != null) { collisionCount.incrementAndGet(); } } endLatch.countDown(); }).start(); } LOGGER.info("Starting threads"); endLatch.await(); LOGGER.info( "{} threads generated {} TSIDs in {} ms with {} collisions", threadCount, new DecimalFormat("###,###,###").format( threadCount * iterationCount ), TimeUnit.NANOSECONDS.toMillis( System.nanoTime() - startNanos ), collisionCount );
We get the following result:
16 threads generated 1,600,000 TSIDs in 893 ms with 2 collisions
So, there can be collisions since the TSID is optimized for being compact, not unique.
You can use the
@Retry
annotation for Hypersistence Utils if you get a collision failure.Chekc out this article for more details about how you can use the
@Retry
annotation.
Using the TSID as a Primary Key value
Since the TSID is a time-sorted 64-bit number, the best way to store it in the database is to use a bigint
column type:
CREATE TABLE post ( id bigint NOT NULL, title varchar(255), PRIMARY KEY (id) )
And on the application side, you need to use a 64-bit number, like the Java Long
object type:
@Entity @Table(name = "post") public class Post { @Id private Long id; private String title; }
That’s it!
If you enjoyed this article, I bet you are going to love my Book and Video Courses as well.
And there is more!
You can earn a significant passive income stream from promoting all these amazing products that I have been creating.
If you're interested in supplementing your income, then join my affiliate program.
Conclusion
Using the standard UUID as a Primary Key value is not a good idea unless the first bytes are monotonically increasing.
For this reason, using a time-sorted TSID is a much better idea. Not only that it requires half the number of bytes as a standard UUID, but it fits better as a B+Tree index key.
While SQL Server offers a time-sorted GUID via the NEWSEQUENTIALID
, the size of the GUID is 128 bits, so it’s twice as large as a TSID.
The same issue is with version 7 of the UUID specification, which provides a time-sorted UUID. However, it uses the same canonical format (128 bits) which is way too large. The impact of the Primary Key column storage is amplified by every referencing Foreign Key columns.
If all your Primary keys are 128-bit UUIDs, then the Primary Key and Foreign Key indexes are going to require a lot of space, both on the disk and in the database memory, as the Buffer Pool holds both table and index pages.

Great article, Vlad! Just a comment for the MySQL users: MySQL can generate timestamp-based (type 1) UUIDs. Those can be stored in binary form with their time-low and time-high parts swapped. This addresses most storage and query performance concerns.
That’s good to know.
Hi Vlad,
Great library and interesting article, congratulations.
I’m just wondering how it works in a distributed system. Let’s say there are two nodes running the library and there is a subtle difference in system time. Is there a risk of collision?
Regards.
MA
In a distributed system, you can assign a different node identifier for each node. Hence there will be no such collision unless, on the same node, two threads manage to generate at the very same second the same random number.
So, the probability of that event is rather low. And, even then, there’s no issue since unicity is provided by the PK, and you can just automatically retry the Tx that throws the ConstraintViolation using the
@Retry
annotation, as explained in this article.When do we use TSID instead of a DB sequence?
The TSID is useful when:
Thank you! I will try both generators and play around. I think this topic should be mentioned in books too.
BR,
Chris
I will be mentioned in the second edition of High-Performance Java Persistence.
Doesn’t this result in guessable IDs if multiple IDs are generated within the same millisecond? If the TSID is also used within an web API an attacker could guess other valid/used TSIDs, right? With a UUID guessing IDs is (nearly) impossible but with TSIDs it seams quite easy (just increment/decrement). An attacker who could create one TSID every millisecond, even every used TSID could be generated.
So the security of an application should not rely on the TSID IMHO. Or am I missing something?
A SEQUENCE or IDENTITY generates even more easily guessable ids, yet this problem can be addressed by the data adapter layer.
No one said you have to expose the ids externally. The auto-generated ids are the most common PK option, and there are ways you can prevent this situation using externalIds. The same pattern applies to TSIDs.
Yes, you can use a UUID as the external id. Just don’t save it in the DB.
The security of your app is not derived from the PK column type. Paying the performance penalty for using a UUID PK doesn’t make your app secure. All you need to do is make sure you never expose a guessable id externally. Just like you don’t expose your JPA entities externally, and you’re using external DTOs or Value Objects to decouple the web and the data access layer, you can replace the guessable ids with external ids.
Big thank! I have one question. What do you think about cuid2 (https://github.com/paralleldrive/cuid2). Here is the link for Java implementation (https://github.com/thibaultmeyer/cuid-java).
Regards.
As stated in the README:
That’s exactly why TSID is so useful.
Thank you for your answer.
You’re welcome
I never knew of about TSID. Many thanks, for educating people like me.
Subscribe for more
The problem with tsid is that they require too much precision to be used in REST APIs with javascript clients.
E.g.
395228849641804517
cannot be properly represented in a javascript application“`
let foo = 395228849641804517;
console.log(“foo”, foo);
=>
395228849641804540
The TSID has both a numeric and a String representation of that number.
The number form is great for the DB column value while you can use the Base32 encoded value on the web. You can decode the String value in the app.
The only way I can think of achieving this in a spring boot / spring data rest / JPA application would be to use
Tsid
as the PK (rather than the long representation) and then you need to add anAttributeConverter<Tsid, Long>
, a springConverter<String, Tsid>
,Converter<Tsid, String>
and a jackson Serializer/Deserializer which sounds like a lot of work.Is there another option I am missing?
It’s much simpler than that. The beauty of Hibernate is that it’s highly customizable, so your entities will need just that:
And an IdentifierGenerator will take care of that as illustrated by this example
https://github.com/vladmihalcea/high-performance-java-persistence/blob/57996c68640ac3cb5ae86796533bc9f555261bf8/core/src/test/java/com/vladmihalcea/book/hpjp/hibernate/identifier/tsid/TsidIdentifierTest.java
You won’t need to implement it as I’ll create it in the Hibernate Types project anyway.
As for the web layer, it’s not a good idea to use entities there anyway, so you’d need an adapter that transforms the/from UI DTOs.
Great article, thanks! Do you have any suggestions on how to get a node id in a simple way on Kubernetes?
You can provide the node identifier via that system property or environment variable.
Therefore, you can set it for each pod
https://kubernetes.io/docs/tasks/inject-data-application/define-environment-variable-container/
I was referring to the different approaches one could take to decide on the actual value of the node ID with minimal effort/coordination between the services.
In that case, the easiest approach is to not use the node identifier and allocate all those 22 bits to a random number.
Hey Vlad, which representation do you recommend to be persisted on the database, the string format, or the long one?
The long representation is better when saving it to the DB. The String one can be used as an external identifier.
Interesting. I know you’re taking here about SQL, but UUID is pretty much the standard for record identifiers in NoSQL I believe (certainly seems to be in Mongo, which is the only one I’ve used much). Foreign keys aren’t so much of an issue there, but indexing still is. How have they got round the indexing issues – or should we be using something else there too?
For LSM storages, the fact that UUID is not time-sorted is not an issue, but the extra space is.
Many applications could do just fine with a 64-bit Tsid. The fact that many use UUID is that they have no idea there are better alternatives.
ULID has a very similar approach: https://github.com/ulid/spec
Yes, that’s true. But, ULID is like the version 7 of UUID. While time-sorted, it’s still huge. Tsid is half the size.
Excellent article, Vlad!
This really makes one think!
I’m glad I could help