The misnomer of "UUID"v7s

April 13, 2024 reads

Universally Unique Identifiers, or UUIDs, have long been the de-facto way to identify a unique object in computer systems. When you see a traditional UUID, you can be confident that that random sequence has never been generated before and will never be generated again for all of human history. That's a strong promise, and it's what enables these identifiers to power some of the largest distributed systems at scale. For instance, Amazon Web Services processes over 1,000,000,000 requests every second and assigns each one a unique X-Amzn-RequestId header. There has never been a duplicate, and there never will be. There's beauty in this simplicity.

Yet, traditional UUIDv4's are inefficient to store in databases at scale. By design, nearly every bit is random. Therefore, storing it in a data structure like a Postgres B-tree will result in index bloat at scale and waste valuable database time.

Is it possible to solve this inefficiency, while preserving randomness? Enter UUIDv7. The first 48 bits of a UUIDv7 are not random. They store the timestamp at the time of generation. This gives them some interesting properties, like sortability and more efficient indexing, and solve the aforementioned problems.

Since UUIDv7's birth in June 2022, it has taken the internet by storm. It's growing fastest in indie hacker groups, but people still question its potential for use in production systems at scale. At the core of this hesitance lies the question, does the reduction in randomness seriously compromise the uniqueness of UUIDv7s? Let's answer this question by breaking down the collision probabilities mathematically.

First, the basics. UUIDv4s have 122 bits of randomness. UUIDv7s have 74.

Let's explore what this means for a practical collision. We can understand collision probabilities using the framework of the Birthday Problem. The probability of a collision, given $n$ attempts to generate an event, can be described by the following equation.

$P = 1 - e^{-\frac{n^2}{2d}}$

Let's be conservative, saying that we won't tolerate a collision probability greater than 0.00001% (one in ten million). Setting $P=0.0000001$ , we can solve for $n$ .

$0.0000001 = 1 - e^{-\frac{n^2}{2 \times 2^{74}}}$

$e^{-\frac{n^2}{2 \times 2^{74}}} = 1 - 0.0000001$

$n=61,464,570$

I chose the initial number of one in ten million randomly. Here, you can play around with the $P$ you'll tolerate, and see the probability of a collision.

Collision Probability Calculator

What is the maximum collision probability you are willing to accept? P = 1e-7

Here is the number of UUIDv7s you can generate before getting a collision with that probability: n = 61,464,570

You might think, sixty million actually seems pretty low — surely, the difficulty of a collision should be orders of magnitude greater?

You'd be right, assuming that the upper 48 timestamp bits are identical as well — that is, all sixty million UUIDs were generated in the same millisecond. Practically, with current CPU clock times, we can safely assume this would never happen (assuming that identifiers are being generated on a single computer).

But what happens when your identifiers are being generated across a distributed set of computers at the same time? Let's say you're operating at AWS scale, processing billions of requests a second across millions of CPUs. 74 bits of randomness is not nearly entropy here. You'd see collisions within a few seconds. UUIDv4, not v7, is the answer here.

UUIDv7s are safe choices for many practical business use cases, but calling them "universally unique identifiers" is not technically accurate. Universal uniqueness implies that no matter how many computers are generating them, we will never see a collision. Rather, UUIDv7s are unique to the context of the computer generating them.

Thus, we've arrived at a few conclusions.

UUIDv7s are safe to use for most production systems, assuming you generate values on a single database server.
The "UUID" part of UUIDv7s is a misnomer. They are not universally unique, but rather, only locally so.