Exploring PostgreSQL 18's new UUIDv7 support
278 points
by s4i
5 days ago
| 22 comments
| aiven.io
| HN
crazygringo
2 days ago
[-]
> Using UUIDv7 is generally discouraged for security when the primary key is exposed to end users in external-facing applications or APIs. The main issue is that UUIDv7 incorporates a 48-bit Unix timestamp as its most significant part, meaning the identifier itself leaks the record's creation time... Experts recommend using UUIDv7 only for internal keys and exposing a separate, truly random UUIDv4 as an external identifier.

So this basically defeats the entire performance improvement of UUIDv7. Because anything coming from the user will need to look up a UUIDv4, which means every new row needs to create an extra random UUIDv4 which gets inserted into a second B-tree index, which recreates the very performance problem UUIDv7 is supposedly solving.

In other words, you can only use UUIDv7 for rows that never need to be looked up by any data coming from the user. And maybe that exists sometimes for certain data in JOINs... but it seems like it might be more the exception than the rule, and you never know when an internal ID might need to become an external one in the future.

reply
tracker1
2 days ago
[-]
This is only really true if leaking the creation time of the record is itself a security concern.
reply
donjoe
2 days ago
[-]
To me, the most important question is: how do I scale v7 in an environment of 20+ engineers?

When using v7, I need some sort of audit that checks in every API contract for the usage of v7 and potential information leakage.

Detecting V7 uuids in the API contract would probably require me to enforce a special key name (uuidv7 & uuid for v4) for easier audit.

Engineers will get this wrong more than once - especially in a mixed team of Jr/sr.

Also, the API contracts will look a bit inconsistent: some resources will get addressed by v7, others by v4. On top, by using v4 on certain resources, I'd leak the information that those resources addressed by v4 will contain sensitive information.

By sticking to v4, I'd have the same identifier for all resources across the API. When needed, I can expose the creation timestamp in the response separately. Audit is much simpler since the fields state explicitly what they will contain.

reply
jandrewrogers
2 days ago
[-]
It is human engineer problems all the way down.

UUIDv4 is explicitly forbidden in some high-reliability/high-assurance environments because there is a long history of engineers using weak entropy sources to generate UUIDv4 despite the warnings to use a strong entropy source, which is only discovered when it causes bugs in production. Apparently some engineers don't understand what "strong entropy source" means.

Mixing UUID types should be detectable because type is part of the UUID. But then many companies have non-standard UUID that overwrite the type field mixed with standard UUID across their systems. In practice, you often have to treat UUID as an opaque 128-bit integer with no attached semantics.

reply
sgarland
2 days ago
[-]
> Detecting V7 uuids in the API contract would probably require me to enforce a special key name (uuidv7 & uuid for v4) for easier audit.

Unless I'm missing something, check it on receipt, and reject it if it doesn't match. `uuid.replace("-", "")[12]` or `uuid >> 76 & 0xf`.

Regardless of difficulty, this comes down to priorities. Potential security concerns aside (I maintain this really does not matter nearly as much as people think for the majority of companies), it's whether or not you care about performance at scale. If your table is never going to get over a few million rows, it doesn't matter. If you're going to get into the hundreds of millions, it matters a great deal, especially if you're using them as PKs, and doubly so if you're using InnoDB.

reply
parthdesai
2 days ago
[-]
> By sticking to v4, I'd have the same identifier for all resources across the API. When needed, I can expose the creation timestamp in the response separately. Audit is much simpler since the fields state explicitly what they will contain

Good luck if you're operating at a decent scale, and need to worry about db maintenance/throughput. Ask the DBA at your company what they would prefer.

reply
lazide
2 days ago
[-]
If you read the prior comment, this is now an ouroborus
reply
AdieuToLogic
2 days ago
[-]
>>> Using UUIDv7 is generally discouraged for security when the primary key is exposed to end users in external-facing applications or APIs.

>> So this basically defeats the entire performance improvement of UUIDv7. Because anything coming from the user will need to look up a UUIDv4, which means every new row needs to create an extra random UUIDv4 which gets inserted into a second B-tree index, which recreates the very performance problem UUIDv7 is supposedly solving.

> This is only really true if leaking the creation time of the record is itself a security concern.

No, as "leaking the creation time" is not a concern when API's return resources having properties representing creation/modification timestamps.

Where exposing predictable identifiers creates a security risk, such as exposing UUIDv7 or serial[0] types used as database primary keys, is it enables attackers to be able to synthesize identifiers which match arbitrary resources much quicker than when random identifiers are employed.

0 - https://www.postgresql.org/docs/current/datatype-numeric.htm...

reply
delifue
2 days ago
[-]
With proper data permission check, having predictable ID is totally fine. And UUIDv7's random part is large enough so that it's much harder to predict than auto increment id.

If your security relies on attacker don't know your ID (you don't do proper data permission check), your security is flawed.

reply
AdieuToLogic
19 hours ago
[-]
> With proper data permission check, having predictable ID is totally fine.

That qualification is doing a lot of work in this sentence. For supporting evidence as to why this is the case, a quick search for "CVE PHP security vulnerabilities" or "CVE NodeJS security vulnerabilities" will produce voluminous results.

> And UUIDv7's random part is large enough so that it's much harder to predict than auto increment id.

Usually. One common scenario where using UUIDv7 for primary keys in a persistent store can be exploited similar to sequential integer ID's is when there are queries supporting pagenation and/or those leveraging the temporal ordering UUIDv7 supports intrinsically. For example:

  id > aSynthesizedUUIDv7Value
Note that this does not require successful identification of either the `rand_a` or `rand_b` UUIDv7 fields[0].

> If your security relies on attacker don't know your ID (you don't do proper data permission check), your security is flawed.

Again, I agree with this in theory. But as the saying[1] goes:

  In Theory There Is No Difference Between Theory and 
  Practice, While In Practice There Is
0 - https://www.rfc-editor.org/rfc/rfc9562.html#name-uuid-versio...

1 - https://quoteinvestigator.com/2018/04/14/theory/

reply
pinkgolem
2 days ago
[-]
Is that not quit commen for invites/no user account shares?
reply
javawizard
2 days ago
[-]
Indeed, but one could easily argue that 128 bits of entropy aren't sufficient for a good invite token in the first place.
reply
pinkgolem
2 days ago
[-]
I am just puzzled why delifue calls something that, as far as I know is pretty standard across the industrie, bad practice
reply
treve
2 days ago
[-]
There's 2 cases being discussed. A UUIDv7 is a bad secret, but it's fine for many other ids. If I can guess your user id, it shouldn't really matter because your business logic should prevent me from doing anything with that information. If I can guess your password reset token it's a different story because I don't need anything else beyond that token to do damage.
reply
tracker1
4 hours ago
[-]
But the random part of a UUIDv7 is 74 bits... larger than a 64-bit integer of random values. Larger than many systems use in total when generating random keys for such things. Likely a larger number of values than the total number of comments here on HN over a couple decades. It's emphatically NOT guessable.
reply
nesarkvechnep
2 days ago
[-]
Because it is?
reply
skrebbel
2 days ago
[-]
No?
reply
MikeNotThePope
2 days ago
[-]
Exactly. I wrote about that a few days ago.

Primary keys using UUID v7 are (potentially) an HR violation.

https://mikenotthepope.com/primary-keys-using-uuid-v7-are-po...

reply
beaker52
2 days ago
[-]
Which part is in violation of the age discrimination laws here, the fact that k-sortable uuids divulge the information, or the fact someone is using them to discriminate against a candidate?

If it’s the latter (which, reading wikipedias summary suggests it is), then the entire premise that k-sortable uuids are a “HR violation” is bunk.

The problem with arguing about timestamps leaking this kind of information is that _anything_ can leak this kind of vaguely dated information.

- Seen on a website that ceased to exist after 2010? Gotchya!

- Indexed by Waybackmachine? Gotchya!

- Used <different uuid scheme> for records created before 2022? Gotchya!

The only way to prevent divulging temporal clues about an entity is to never reveal its existence in any kind of correlatable way (which, as far as I’m prepared to think right now, seems to defeat the point of revealing it to a UI at all).

reply
cuu508
2 days ago
[-]
What's the scenario here?

I submit my application in 2025 and get rejected.

20 years later I submit another application to the same company, using my existing 20 years old user profile, and now get rejected because somebody figures out I'm old by looking at my user id?

reply
MikeNotThePope
2 days ago
[-]
Essentially, yes. It's a narrow use case. I just thought it was interesting, so I wrote up a short summary. Not worth a deep analysis.
reply
tracker1
4 hours ago
[-]
The same systems that literally ask for your education history?
reply
gilfoy
2 days ago
[-]
So you just delay getting your old ass rejected until they interact with you instead of some uuid?
reply
da_chicken
2 days ago
[-]
Are there really any performances benefits of UUIDv7 over UUIDv4 that should ever come up in the context of an HR system? Just how many job applicants are you tracking?

I don't understand why you considered UUIDv7 in the first place.

reply
MikeNotThePope
2 days ago
[-]
You can sort records by ID, retrieve the last N records by ID, etc. It's just easier than using a timestamp.
reply
Incipient
1 day ago
[-]
The guaranteed lack of collision is a big one. I pretty much always add an updated column, and sqlalchemy essentially gives you both for free.
reply
kvirani
2 days ago
[-]
Which I have to assume is rare, right?
reply
wongarsu
2 days ago
[-]
We used to leak approximate creation time all the time back when everyone used sequential keys. If anything sequential keys are far worse: they leak the approximate number of records, make it easy to observe the rate at which new keys are created, and once you know that you can deduce the approximate creation date of any key.

UUIDv4 removes all three of those vectors. UUIDv7 still removes two of three. It doesn't leak record count or the rate at which you create them, only creation time. And you still can't guess adjacent keys. It's a pretty narrow information leakage for something you routinely reveal on purpose.

reply
johnisgood
2 days ago
[-]
I often see sequential order IDs, and they get incremented by one, so I can guesstimate the amount of orders they get within a minute by creating my own orders. I watched this happen as I was intentionally removing and creating new orders (as they did not support modification of existing but not yet accepted ones). What may I do with this information though as an user that would be damaging? Legitimate question, intent is not harm, but I genuinely do not see how this is a bad thing.

I can see it being bad for tracking IDs, but not order IDs, unless you are allowed to view any orders that do not belong to your account, which is just fundamentally bad security and using UUIDv4 or a random string would simply be obscuring security.

reply
hinkley
2 days ago
[-]
It’s also industrial espionage on competitors or potential acquisitions.
reply
teddyh
2 days ago
[-]
reply
blackenedgem
2 days ago
[-]
UUIDv7s are much worse for creation time though imo. For sequential IDs an attacker needs to be have a lot of data to narrow the creation time. That raises the barrier of entry considerably to the point that only a committed attacker could infer the time.

With UUIDv7 the creation time is always leaked without any sampling. A casual attacker could quite easily lookup the time and become motivated in probing and linking the account further

reply
AdieuToLogic
2 days ago
[-]
> For sequential IDs an attacker needs to be have a lot of data to narrow the creation time.

When sequential integer ID's are externalized, an attacker does not need creation times to perform predictive attacks. All they need to do is apply deltas to known identifiers.

reply
wredcoll
2 days ago
[-]
It seems wildly paranoid, even for securitt researchers.
reply
oulipo2
2 days ago
[-]
I remember in the cracking days, where we were trying to crack ElGamal encryption or other, we noticed when some code had been written in eg Delphi (which used a weak RNG based on datetime), then when you tried to guess when the code was compiled and the key were generated, you could get a rough timerange, and if you bruteforced through that timerange as a seed to the RNG, and tried to generate the random ElGamal key from that, you would widely reduce the range of possibilities (eg bruteforce 10M ints, instead of billions or more)
reply
noir_lord
2 days ago
[-]
An online casino got hit a similar way a long time ago, iirc someone realised the seed for a known prng was the system clock, so you could brute force every shuffle either side of the approx time stamp and compare the results to some known cards (I.e. the ones you’d been dealt) once you had a match you knew what everyone else had.

Always thought that was elegant (the attach not using the time as the seed).

reply
hipratham
2 days ago
[-]
Can you not just add salt to seed and make it true random? seems like under engineered solution to me.
reply
lazide
2 days ago
[-]
Difficulty - they used the date as the salt.
reply
hinkley
2 days ago
[-]
I stopped airplane maintenance software from shipping with a particularly egregious form of this for SSL session key generation. It’s hard to get a good random seed on a real time operating system. I tell you hwut.
reply
tracker1
4 hours ago
[-]
Depending on access to sensor data, it's possible to use a mix of various sensors as well as the time for seed generation. Though baseline static from RF is better assuming that is possible as well.

Of course, it's always possible for something to do something stupid, like weak rng.

reply
ibejoeb
2 days ago
[-]
There are some practical applications that are not necessarily related to security. If you are storing something like a medical record, you don't want use it as a public ID for a patient visit, because the date is subject to HIPAA.
reply
tracker1
4 hours ago
[-]
Can you please give me a legitimate use case where you would have the ID of a medical case without also having the Date/Time of that corresponding record?
reply
ibejoeb
4 hours ago
[-]
It's not that you can't possess the timestamp of an event. It's that you can't publish certain things that are deemed potentially identifiable.

Dates are specifically cited as potential vectors for de-anonymization. For example, you can't disclose that "Bob H presented to the clinic on October 10th" because that's a lot of information that can be used to find out who Bob H is.

Here's a practical example of what I'm talking about. Suppose you have an app for physicians that allows them to message each other to discuss a case. They can share relevant information for diagnostic purposes, e.g., "34y/o male from the southern Louisianna presented with a rash." They share de-identified photos and chat about ddx, treatment protocol, etc. All of that is cool. However, if the record of that visit is identified with a UUIDv7, and that ID is used as part of the URL you've exposed the time of the visit, and that would be a problem.

reply
mulmen
2 days ago
[-]
But they would have to relate that ID to patient data like their identity right? The date alone cannot be a HIPAA issue. That means every date is a HIPAA violation because people go to the doctor every day.
reply
ownagefool
2 days ago
[-]
This is probably not really true.

You wouldn't be publishing patient visits publically, the only folks that'd legitimatly see that record would be those which access to that visit, and they'd most likely need to know the time of said visit. This access should be controlled via AuthN, AuthZ and audited.

You'd also generally do a lot of time-based lookups on this data; what visits do I have today, this week, and so on. You might also want an additional DateTime field for timezones and offsets, but the v7 is probably better than v4 for this usecase.

reply
replygirl
2 days ago
[-]
it's not about the individual record, it's about correlating records. if you can sequence everything in time it gets a lot easier to deanonymize data
reply
Macha
2 days ago
[-]
However, if your API has a (very common) createdAt field on these objects, the ability to get the creation time from the identifier is rather academic.
reply
inopinatus
2 days ago
[-]
The concern is not limited to access of the full records. The concern extends to any incidental expression of identifiers, especially those sent via insecure side channels such as SMS or email.

In most cases this forms a compliance matter rather than an open attack vector, but it nevertheless remains that one has to answer any question along the lines "did you minimise the privacy surface?" in the negative, or at least, with a caveat.

reply
hinkley
2 days ago
[-]
And that’s why some people are rabid about “no SELECT *”.
reply
tracker1
4 hours ago
[-]
Why would you have Ids of medical events without the details of those events, generally including date+time?
reply
tracker1
2 days ago
[-]
Can you provide an example of where you would legitimately have the ID for a medical record interaction, but not a date/time associated?
reply
tyre
2 days ago
[-]
Email is not secure but sending an email with a link to "Information about your appointment" is fine. If that link goes to `/appointments/sjdhfaskfhjaksdjf`, there is no leaked data. If it goes to `/appointments/20251017lkafjdslfjalsdkjfa`, then the link itself contains PHI.

Whether creation date is PHI…I could see the argument being yes, since it correlates to medical information (when someone sought treatment, which could be when symptoms present.)

reply
lazide
2 days ago
[-]
Notably, this is an absurd argument. Every system I’ve dealt with right now sends the date/time/location/practitioner clear text in the email (or some variant thereof).

The only thing that seems to be protected is ‘reason for appointment’, and not all systems do that.

Everyone signs paperwork to authorize this when they first engage with the medical providers!

reply
ensignavenger
2 days ago
[-]
Email may not be secure, but neither are faces and phones, and yet medical professionals use those all the time.
reply
ensignavenger
2 days ago
[-]
Fat fingered fax... faxes, not faces!
reply
Too
2 days ago
[-]
Your comment here has id 45622189 and the UI tells me in plain sight that you posted it 11h ago. Assuming the ids are sequential, these two combined tells me more about HN vs a uuid ”leaking” something that’s already expected to be public.
reply
rat9988
2 days ago
[-]
Maybe, but what's your point?
reply
oconnor663
2 days ago
[-]
It's relatively common for it to be a privacy concern. Imagine if I'm making an online payment or something, and one of the IDs involved tells you exactly when I created my bank account. That's a decent proxy for my age.
reply
love2read
2 days ago
[-]
1) I would argue that the year that you created your bank account is not a good proxy for age. 2) I would question where you think the uuid representing your age from your bak would leak to considering it’s still a bank account id 3) I would question whether you consider that the vast majority of uuids aren’t used for high stakes ids such as online banking ids
reply
paulddraper
2 days ago
[-]
A bank account number (assuming that is what are talking about, not some token) is already very sensitive information. Like, legal status protected information.

Knowing approximate age is a relatively small leak compared to that.

reply
zie
2 days ago
[-]
bank account numbers are printed on every check you ever wrote. Most people don't write checks anymore, though online bill pay sends physical checks still sometimes. They never really were sensitive information.

Bank security does not depend on your bank account being private information. Pretty much all bank security rounds to the bank having a magic undo button, so they can undo any bad transactions after it comes to light that it was a bad transaction. Sure they do some filtering on the front-end now to eliminate the need to use the magic undo button, but that's just extra icing to keep the undo button's use to a dull roar.

reply
whalesalad
1 day ago
[-]
Yeah if you’re relying on unguessable public IDs as your security model you’re not doing security.
reply
dethos
2 days ago
[-]
Exactly
reply
nitwit005
2 days ago
[-]
It was a concern in the past, as people used password creation tools that were deterministic based on the current time.

There was previously an article linked here about recovering access to some bitcoin by feeding all possible timestamps in a date range to the password creation tool they used, and trying all of those passwords.

reply
matthew16550
2 days ago
[-]
Using UUIDv4 as primary key has unexpected downsides because data locality matters in surprising places [1].

A UUIDv7 primary key seems to reduce / eliminate those problems.

If there is also an indexed UUIDv4 column for external id, I suspect it would not be used as often as the primary key index so would not cancel out the performance improvements of UUIDv7.

[1] https://www.cybertec-postgresql.com/en/unexpected-downsides-...

reply
AdieuToLogic
2 days ago
[-]
> Using UUIDv4 as primary key has unexpected downsides because data locality matters in surprising places.

Very true, as detailed by the link you kindly provided. Which is why a technique I have found useful is to have both an internal `id` PK `serial`[0] column (never externalized to other processes) and another column with a unique constraint having a UUIDv4 value, such as `external_id`, explicitly for providing identifiers to out-of-process collaborators.

0 - https://www.postgresql.org/docs/current/datatype-numeric.htm...

reply
crazygringo
2 days ago
[-]
> I suspect it would not be used as often as the primary key index

That doesn't matter because it's the creation of the index entry that matters, not how often it's used for lookup. The lookup cost is the same anyways.

reply
matthew16550
2 days ago
[-]
The page I linked shows uses after creation where the cost can be different.
reply
crazygringo
2 days ago
[-]
Making the assumption:

> Since workloads commonly are interested in recently inserted rows

That's only true for very specific types of applications. There's nothing general about that.

Plenty of applications grab rows from all time, and there's nothing special about the most recent ones. The most recent might also be the least popular rows, since few things reference them.

reply
oconnore
2 days ago
[-]
If this is a concern, pass your UUIDv7 ID through an ECB block cipher with a 0 IV. 128 bit UUID, 128 bit AES block. Easy, near zero overhead way to scramble and unscramble IDs as they go in/out of your application.

There is no need to put the privacy preserving ID in a database index when you can calculate the mapping on the fly

reply
10000truths
2 days ago
[-]
This is, strictly speaking, an improvement, but not by much. You can't change the cipher key because your downstream users are already relying on the old-key-scrambled IDs, and you lose all the benefits of scrambling as soon as the key is leaked. You could tag your IDs with a "key version" to change the key for newly generated IDs, but then that "key version" itself constitutes an information leak of sorts.
reply
DSingularity
2 days ago
[-]
Why do you need forward secrecy?
reply
10000truths
2 days ago
[-]
I edited that out of my post, as I'm not sure it's the correct term to use, but the problem remains. If the key leaks, then all IDs scrambled with that key can be de-scrambled, and you're back to square one.
reply
blackenedgem
2 days ago
[-]
Then that's just worse and more complicated than storing a 64 bit bigint + 128 UUIDv4. Your salt (AES block) is larger than a bigint. Unless you're talking about a fixed value for the AES (is that a thing) but then that's peppering which is security through obfuscation.
reply
cyberax
2 days ago
[-]
Uhh... What? You just use AES with a fixed key and IV in block mode.

You put in 128 bits, you get out 128 bits. The encryption is strong, so the clients won't be able to infer anything from it, and your backend can still get all the advantages of sequential IDs.

You also can future-proof yourself by reserving a few bits from the UUID for the version number (using cycle-walking).

reply
grapesodaaaaa
2 days ago
[-]
I still feel like calling something like uuid.v4() is easier and less cognitively complex.
reply
cyberax
2 days ago
[-]
There are advantages in monotonically increasing UUIDs, they work better with BTrees and relational databases.
reply
grapesodaaaaa
2 days ago
[-]
I just meant having UUIDv7 internally, and UUIDv4 externally if date leakage is a concern (both on the same object).

UUIDv7 still works great in distributed systems and has algorithmic advantages as you have mentioned.

reply
jongjong
2 days ago
[-]
Great point. Also, having to support multiple IDs is a maintenance headache.

IMO, a major problem solved by UUIDs is the ability to create IDs on the client-side, hence, they are inherently user-facing. A major reason why this is an important use case for UUIDs is because it allows clients to avoid accidental duplication of records when an insertion fails due to network issues. It provides insertion idempotence.

For example, when the user clicks on a button on a form to insert a record into a database, the client can generate the UUID on the client-side, then attach it to a JSON object, then send the object to the server for insertion; in the meantime, if there is a network issue and it's unclear whether or not the record was inserted, the code can automatically retry (or user can manually retry) and there is no risk of duplication of data if you use the same UUID.

This is impossible to do with auto-incrementing IDs because those are generated by the database in a centralized way so the user cannot know the ID head of time and thus, if there is a network failure while submitting a form, the client cannot automatically know whether or not the record was successfully inserted; if they retry, they may create a duplicate record in the database. There is no way to make the operation idempotent without relying on some kind of fixed ID which has a uniqueness constraint on the database side.

reply
macote
2 days ago
[-]
You don't need to add a UUIDv4 column, you could just encrypt your UUIDv7 with format-preserving encryption (FPE).
reply
whattheheckheck
2 days ago
[-]
What's the computational complexity of doing that conversion vs the lookup table of uuidv4 for each uuidv7?
reply
benjiro
2 days ago
[-]
DB lookups + extra index are way more expensive then hardware assisted decoding.

If your UUIDv4 is cached, your still suffering from extra storage and index. Not a issue on a million row system but imagine a billion, 10 billion.

And what if its not cached. Great, now your hitting the disk.

Computers do not suffering from lacking CPU performance, especially when you can deploy CPU instruction sets. Hell, you do not even need encryption. How about making a simple bit shift where you include a simple lookup identifier. Black box sure, and not great if leaked but you have other things to worry about if your actual shift pattern is leaked. Use extra byte or two for iding the pattern.

Obfuscating your IDs is easy. No need for full encryption.

reply
sagarm
2 days ago
[-]
Hardware assisted is a red herring here. As you noted the real problem is that random reads have poor data locality, which degrades your database performance in a way that is expensive to resolve.
reply
jandrewrogers
2 days ago
[-]
Why would it be computationally complex? The encryption is implemented in the silicon, it is close to free for all practical purposes. The lookup table would be wildly more expensive in almost all cases due to comparatively poor memory locality.
reply
tekne
2 days ago
[-]
Question: why not use UUIDv7 but encrypt the user-facing ID you hand out? Then it's just a quick decrypt-on-lookup, and you have the added bonus of e.g. being able to give different users different IDs
reply
gigatexal
2 days ago
[-]
In a well normalized setup idk maybe not. Uuidv4 for your external ids and then have a mapping table to correspond that to something you’d use internally. Then you can torch an exposed uuid update the mapping table and generate a new one and none of your pointers and foreign keys need to change internally.
reply
crazygringo
2 days ago
[-]
The point is, that mapping table incurs the same indexing cost that was trying to be eliminated in the first place. Normalization is irrelevant.
reply
Quekid5
2 days ago
[-]
I wonder if there is a name for such a mapping table in RDBMS-land...?
reply
gigatexal
2 days ago
[-]
We call them lookup or mapping tables.
reply
sgarland
2 days ago
[-]
Who are these "experts?" I'm a DBRE, and also very security conscious, and think this is an absurd what-if for most companies.

If it does matter for your application, then don't expose it - use an opaque id with something like AEAD, and expose that.

reply
lukebechtel
2 days ago
[-]
how risky is exposing creation time really though? I feel like for most applications this is uncritical
reply
Biganon
2 days ago
[-]
I wouldn't say necessarily "risky", it's more that it forces your hand when you wouldn't want to reveal an entity's creation time. Say you use these IDs for users of your site, and they're used in API queries / URLs etc., then it's trivial to know when a user created their account. Sure, many sites already expose this information, but not all of them do; what if you don't want it exposed? What if you consider that a user's seniority is nobody's business, that it could bias the behavior of other users towards them, etc.?
reply
morshu9001
2 days ago
[-]
It takes consideration. There are plenty of systems like Facebook and Twitter that use IDs somewhat exposing time, but the things they're IDing already have public creation timestamps.
reply
sverhagen
2 days ago
[-]
When you see v7 vs. V4, you'd expect the higher number to be better, hopefully better in all aspects, I wouldn't have expected such a thoughtful consideration to be required before upgrading. UUID-b would've been a better name then ;)
reply
jpalawaga
2 days ago
[-]
that is pretty common with uuid. for example in many cases you'll still want a plain uuid4 instead of e.g.uuid 5. maybe you want 5. it's usecase dependent.

for a specification such as uuid, there is not much to improve upon--just rearranging the bytes and their meanings.

reply
ownagefool
2 days ago
[-]
Meh.

You probably shouldn't / don't need to use v7 for your Users table because the age of your User probably has limted to no bearing on the look up patterns. For example, our Steam and Amazon accounts are pretty old, but we likely still use them.

However, your Orders table is significantly more likely to be looked up based on time, so a v7 makes a lot of sense here.

Now I'd argue the security implications are overblown, but in general tems you might also allow someone to look up a user, i.e. you can view my Steam profile, or maybe my Amazon wishlist. You probably don't need to be looking up another Users Order.

Alternativly, if your building an Enterprise Risk Solution, you could take a view that you don't want people knowing how old the risk is, but most solutions would show you some history and would believe that to be pertinent information.

There will be instances of getting it wrong, but it isn't actually _that_ complicated.

reply
saaspirant
2 days ago
[-]
I am using it in a table where sorting by id (primary key) should also sort it by created time (newer records should have "bigger" id).

The id would be exposed to users. An integer would expose the number of records in it.

Am I using right guys?

reply
djantje
2 days ago
[-]
DB multi-master, or the DB not being responsible for primary key generation, is the use case, I think.

And then having uuidv7 as primary and foreign keys, can give you a performance gain.

reply
Illniyar
2 days ago
[-]
If leaking creation time is a concern, can we not just fake the timestamp? We can do so in a way that most performance benefits remain - so like starting with a base time of 1970 and then adding base time to it intermittently, having random months and days to new records (or maybe based on the user's id - so the user's record are temporally consistent but they aren't with other user records).

I'm sure there might be a middle ground where most of the performance gains remain but the deanonymizing risk is greatly reduced.

Edit: encrypting the value in transit seems a simpler solution really

reply
hu3
2 days ago
[-]
In that case, auto increments can also be bumped from time to time. And start from a billion.

They're more performant than uuidv7. Why would I still use UIID? Perhaps I would still want uuids because they can be generated in client and because they make incorrect JOINs return no rows.

reply
tonyhart7
2 days ago
[-]
Yeah, just use uuidv4 and another "ULID" if thats the case

which is pointless

reply
pqdbr
2 days ago
[-]
Great article, specially for this part:

> What can go wrong with using UUIDv7 Using UUIDv7 is generally discouraged for security when the primary key is exposed to end users in external-facing applications or APIs. The main issue is that UUIDv7 incorporates a 48-bit Unix timestamp as its most significant part, meaning the identifier itself leaks the record's creation time.

> This leakage is primarily a privacy concern. Attackers can use the timing data as metadata for de-anonymization or account correlation, potentially revealing activity patterns or growth rates within an organization. While UUIDv7 still contains random data, relying on the primary key for security is considered a flawed approach. Experts recommend using UUIDv7 only for internal keys and exposing a separate, truly random UUIDv4 as an external identifier.

reply
SahAssar
2 days ago
[-]
> Experts recommend

What experts? For what scenarios specifically? When do they consider time-of-creation to be sensitive?

reply
dgb23
2 days ago
[-]
Or just generate them in bulk and take them from a list?
reply
hn_throwaway_99
2 days ago
[-]
> Experts recommend using UUIDv7 only for internal keys and exposing a separate, truly random UUIDv4 as an external identifier.

So then what's the point? How I always did things in the past was use an auto increment big int as the internal primary key, and then use a separate random UUID for the external facing key. I think this recommendation from "experts" is pretty dumb because you get very little benefit using UUIDV7 (beyond some portability improvements) if you're still using a separate internal key.

While I wouldn't use UUIDV7 as a secure token like I would UUIDV4, I don't see anything wrong with using UUIDV7 as externally exposed object keys - you're still going to need permissions checks anyway.

reply
morshu9001
2 days ago
[-]
I asked a similar question, and yeah it seems like this is entirely for distributed systems, even then only some of them. Your basic single DB Postgres should just have a serial PK.
reply
crazygringo
2 days ago
[-]
For distributed databases where you can't use autoincrement.

Or where, for some reason, the ID needs to be created before being inserted into the database. Like you're inserting into multiple services at once.

reply
sgarland
2 days ago
[-]
Many distributed databases have mechanisms to use an auto-increment, actually - often, generating large chunks at a time to hand out.
reply
grapesodaaaaa
2 days ago
[-]
Our “distributed database” at a fortune 90 company spans at least 10 different database products.

UUIDv4 lets us sidestep this.

Is it bad design? Probably. Is it going to happen at huge companies? Yes.

reply
sgarland
1 day ago
[-]
You’re not wrong, of course. It’s a natural consequence of the eschewing of DBAs, and the increasingly powerful compute available - even if someone did notice that the slowdown was due to the PK choice, they can often “fix” that by paying more money.
reply
andy_ppp
2 days ago
[-]
I wish Postgres would just allow you look up records by the random component of the field, what are the chances of collisions with 80 bits of randomness? My guess is it’s still enough.
reply
jagged-chisel
2 days ago
[-]
You can certainly create that index.
reply
andy_ppp
2 days ago
[-]
Yes, just obviously if it’s automated and part of Postgres people will use it without having to think too much and it removes one of the objections to what I think for most large systems is a sensible way to go rather than controversial because security.
reply
mamcx
2 days ago
[-]
What could be better is to allow to create a type with custom display, in/out and internally set the native type IN SQL (this require to do it in c)
reply
themafia
2 days ago
[-]
> growth rates

I honestly don't see how.

reply
gopalv
3 days ago
[-]
UUIDv7 is only bad for range partitioning and privacy concerns.

The "naturally sortable" is a good thing for postgres and for most people who want to use UUID, because there is no sorted distribution buckets where the last bucket always grows when inserting.

I want to see something like HBase or S3 paths when UUIDv7 gets used.

reply
vlovich123
2 days ago
[-]
> UUIDv7 is only bad for range partitioning and privacy concerns.

It's no worse for privacy than other UUID variants if the "privacy" you're worried about leaking is the creation time of the UUID.

As for range partitioning, you can of course choose to partition on the hash of the UUIDv7 at the cost of giving up cheaper rights / faster indices. On the other hand, that of course gives up locality which is a common challenge of partitioning schemes. It depends on the end-to-end design of the system but I wouldn't say that UUIDv7 is inherently good or bad or better/worse than other UUID schemes.

reply
saghm
2 days ago
[-]
Isn't it at least a bit worse than v4, which has no timestamp at all? There might be concerns around non-secure randomness being used to generate the bits, but I don't feel like it's accurate to claim that's indistinguishable from a literal timestamp.
reply
ibejoeb
2 days ago
[-]
UUIDv4 doesn't leak creation time.
reply
parthdesai
2 days ago
[-]
Why is it bad for range partitioning? If anything, it's better? With UUIDv7, you basically can partition on primary key, thus you can have "global" unique constraint.
reply
wara23arish
2 days ago
[-]
confused why it would be worse for range partitioning?

I assume there would be some type of index on the timestamp portion & the uuid portion?

wouldn’t that make it better for partitioning since we’d only need to query partitions that match the timestamp portion

reply
morshu9001
3 days ago
[-]
The article compares UUIDv7 vs v4, but doesn't say why you'd do either instead of just serial/bigserial, which has always been my goto. Did I miss something?
reply
molf
2 days ago
[-]
Good question. There's a few reasons to pick UUID over serial keys:

- Serial keys leak information about the total number of records and the rate at which records are added. Users/attackers may be able to guess how many records you have in your system (counting the number of users/customers/invoices/etc). This is a subtle issue that needs consideration on a case by case basis. It can be harmless or disastrous depending on your application.

- Serial keys are required to be created by the database. UUIDs can be created anywhere (including your backend or frontend application), which can sometimes simplify logic.

- Because UUIDs can be generated anywhere, sharding is easier.

The obvious downside to UUIDs is that they are slightly slower than serial keys. UUIDv7 improves insert performance at the cost of leaking creation time.

I've found that the data leaked by serial keys is problematic often enough; whereas UUIDs (v4) are almost always fast enough. And migrating a table to UUIDv7 is relatively straightforward if needed.

reply
MBCook
2 days ago
[-]
Not only can you make a good guess at how many customers/etc exist, you can guess individual ones.

World’s easiest hack. You’re looking at /customers/3836/bills? What happens if you change that to 4000? They’re a big company. I bet that exists.

Did they put proper security checks EVERYWHERE? Easy to test.

But if you’re at /customers/{big-long-hex-string}/bill the chances of you guessing another valid ID are basically zero.

Yeah it’s security through obscurity. But it’s really good obscurity.

reply
neya
2 days ago
[-]
This advice assumes /customers/:id/bills is public. Protected routes shouldn't expose sensitive information such as bills anyway, so this is more of an authorization issue (who can access which resource) more than privacy concerns. So this means, if you can access customes/4000/bills, then that's an application logic issue more than the type of ID itself.

In a well designed application, you shouldn't be able to guess whether a record exists or not simply by accessing a protected URL. As a counter argument - normal BIGINT or serial PKs are performant and are more than enough for most applications.

reply
andrewjf
1 day ago
[-]
You describe a world where human skill is required to prevent these class of bugs, time and time again we've proven that people are people and bugs happen.

Systems must be _structurally architected_ with security in mind.

Security is layered, using a random key with 128-bit space makes guessing UUIDs infeasible. But _also_ you should be doing AuthZ on the records, and also you should be doing rate limiting on API so they can't be brute forced, either.

reply
morshu9001
2 days ago
[-]
You normally aren't supposed to expose the PK anyway.
reply
bruce511
2 days ago
[-]
That advice was born primarily _because_ of the bigint/serial problem. If the PK is UUIDv4 then exposing the PK is less significant.

In some use cases it can be possible to exclude, or anonymize the PK, but in other cases a PK is necessary. Once you start building APIs to allow others to access your system, a UUIDv4 is the best ID.

There are some performance issues with very large tables though. If you have very large tables (think billions of rows) then UUIDv7 offers some performance benefits at a small security cost.

Personally I use v4 for almost all my tables because only a very small number of them will get large enough to matter. But YMMV.

reply
morshu9001
2 days ago
[-]
It's not about table size so much as number of joins. You don't need to trade off between security and performance if you simply expose a uuid4 secondary col on a serial PK'd table.
reply
edoceo
2 days ago
[-]
So the client side can create the ID before insert - that's the case that (mostly) drives it for me. The other is where you have distributed systems and then later want to merge the data and not have any ID conflicts.
reply
saagarjha
2 days ago
[-]
Allowing the client to generate IDs for you seems like a bad idea?
reply
morshu9001
2 days ago
[-]
Client = backend here, right? So you could make a bunch of rows that relate to each other then insert, without having to ping the DB each time to assign a serial ID. Normally the latter is what I do, but I can imagine a scenario where it'd be slow.
reply
wongarsu
2 days ago
[-]
The usual flow would be INSERT ... RETURNING id, which gives you the db-generated id for the record you just inserted with no performance penalty. That doesn't work for circular dependencies and it limits the amount of batching you can do. But typically those are smaller penalties than the penalty from having a 128 bit primary key vs a 64 bit key
reply
morshu9001
2 days ago
[-]
Yeah, that's what I do
reply
bramhaag
2 days ago
[-]
It can be quite elegant. You can avoid the whole temporary or external ID mess when the client generates the ID, this is particularly useful for offline-first clients.

Of course you need to be sure the server will accept the ID, but that is practically guaranteed by the uniqueness property of UUIDs.

reply
coolspot
2 days ago
[-]
“client” here may refer to a backend app server. So you can have 10-100s of backend servers inserting into a same table without having a single authority coordinating IDs.
reply
morshu9001
2 days ago
[-]
That table is still a single authority, isn't it? But I guess fewer steps is still faster.
reply
tracker1
2 days ago
[-]
Except if you're using a sharding or clustering database system, where the record itself may be stored to separate servers as well as the key generation itself.
reply
morshu9001
2 days ago
[-]
In those cases yes. There's still a case for sequential there depending on the use pattern, but write-heavy benefits from not waiting on one server for IDs.
reply
clintonb
1 day ago
[-]
Client-generated IDs are necessary for distributed or offline-first systems. My company, Vori, builds a POS for grocery stores. The POS generates UUIDv7 IDs for all data it creates and that data is eventually synced to our backend. The sync time can range from less than 1 second for a store with fast Internet to hours if a store is offline.

Is a collision possible? Yes, but the likelihood of a collision is so low that it's not worth agonizing over (although I did when I was designing the system).

reply
markstos
2 days ago
[-]
Why?
reply
jrochkind1
2 days ago
[-]
yup, I'd say those are the two biggies.
reply
Deadron
2 days ago
[-]
For when you inevitably need to expose the ids to the public the uuids prevent a number of attacks that sequential numbers are vulnerable to. In theory they can also be faster/convenient in a certain view as you can generate a UUID without needing something like a central index to coordinate how they are created. They can also be treated as globally unique which can be useful in certain contexts. I don't think anyone would argue that their performance overall is better than serial/bigserial though as they take up more space in indexes.
reply
xienze
2 days ago
[-]
People really overthink this. You can safely expose internal IDs by doing a symmetric cipher, like a Feistel cipher. Even sequential IDs will appear random.
reply
whiskey-one
2 days ago
[-]
Looks easy on the surface, but the problem is key rotation.
reply
morshu9001
2 days ago
[-]
But these are internal IDs only, and public ones should be a separate col. Being able to generate uuid7 without a central index is useful in distributed systems, but this is a Postgres DB already.

Now, the index on the public IDs would be faster with a uuid7 than a uuid4, but you have a similar info leak risk that the article mentions.

reply
rcfox
2 days ago
[-]
"Distributed systems" doesn't have to mean some fancy, purpose-built thing. Just correlating between two Postgres databases might be a thing you need to do. Or a database and a flat text file.
reply
morshu9001
2 days ago
[-]
I usually just have a uuid4 secondary for those correlations, with a serial primary. I've done straight uuid4 PK before, things got slow on not very large data because it affected every single join.
reply
ibejoeb
2 days ago
[-]
If you need an opaque ID like a uuid because, for example, you need the capability to generate non-colliding IDs generated by disparate systems, the best way I've found is to separate these two concerns. Use a UUIDv4 for public purposes and a bigint internally. You don't need to worry about exposing creation time, and you can still manage your data in the home system with all the properties that a total ordering affords.
reply
tracker1
2 days ago
[-]
Now coordinate those sequential ids on a sharded or otherwise clustered database system.
reply
ibejoeb
2 days ago
[-]
That's the point. Those are only system-unique, not universally. It's a lower-level attribute that is an implementation detail, like for referential integrity in an rdbms. At that point, if you need it, you have atomic increment.
reply
nextaccountic
2 days ago
[-]
uuids can be generated by multiple services across your stack

bigserial must by generated by the db

reply
coolspot
2 days ago
[-]
But what if we just use milliseconds as our bigserial? And maybe add some hw-random number at the end to avoid conflicts? Wait
reply
tracker1
2 days ago
[-]
Somehow +1 on this comment just doesn't feel like enough.
reply
crazygringo
2 days ago
[-]
Oh yeah, it would be an identifier but it would be unique. Across the universe of all devices, effectively. Should come up with a name for that
reply
simongr3dal
2 days ago
[-]
I believe the concern is if your primary key in the database is a serial number it might be exposed to users unless you do extra work to hide that ID from any external APIs and if there are any flaws in your authorization checks it can allow enumeration attacks exposing private or semi-private info. With UUIDs being virtually unguessable that makes it less of a concern.
reply
morshu9001
2 days ago
[-]
uuid7 is still guessable though, as the article says. The assumption is that these are internal only PKs.
reply
molf
2 days ago
[-]
There is a big difference though. Serial keys allow attackers to guess the rate at which data is being added.

UUID7 allows anyone to know the time of creation, but not how many records have been created (approximately) in a particular time frame. It leaks data about the record itself, but not about other records.

reply
tracker1
2 days ago
[-]
Far, far less than sequential Ids, and the random part is some pretty big values numerically... I mean there's billions of possible values for every MS on the generating server... you aren't going to practically "guess" at them.
reply
e12e
2 days ago
[-]
Guessable with 80 bits of entropy?
reply
mhuffman
2 days ago
[-]
>why you'd do either instead of just serial/bigserial, which has always been my goto. Did I miss something?

So the common response is sequential ID crawling by bad actors. UUIDs are generally un-guessable and you can throw them into slop DBs like Mongo or storage like S3 as primary identifiers without worrying about permissions or having a clever interested party pwn your whole database. A common case of security through obscurity.

reply
martinky24
2 days ago
[-]
You don’t scale horizontally, do you?
reply
rcfox
2 days ago
[-]
Do most people? Not everyone is Google.
reply
martinky24
2 days ago
[-]
Many people have more than 1 server that need to generate coherent identifiers amongst one another. That's not a "Google scale" thing.
reply
rcfox
2 days ago
[-]
Your comment heavily implied (to me) scaling databases horizontally. Yes, it's not necessarily "Google scale" either, but it's a ton of extra complexity that I'm happy to avoid. But a Google employee is probably going to approach every public-facing project with the assumption of scaling everything horizontally.

With multiple servers talking to a single database, I'd still prefer to let the database handle generating IDs.

reply
morshu9001
2 days ago
[-]
Yeah, there's too much advice jumping straight to uuid4 or 7 PKs for no particular reason. If you're doing a sharded DB, maybe, and even then it depends.

Speaking of Google, Spanner recommends uuid4, and specifically not any uuid that includes a timestamp at the start like uuid7.

reply
morshu9001
2 days ago
[-]
This is Postgres. There is Citus, but that still supports (maybe recommends?) serial PKs.
reply
caymanjim
2 days ago
[-]
Tangential, but I'm grateful to this article for teaching me that Postgres has "table foo" as shorthand for "select * from foo". I won't use that in code, but I'll happily use it for interactive queries.
reply
stickfigure
2 days ago
[-]
It never occurred to me that Postgres is more efficient when inserting monotonic values. It's the nature of B+ trees so it makes sense. But in the world of distributed databases, monotonic inserts create hot partitions and scalability problems, so evenly-distributed ids are preferred.

In other words, "don't try this with CRDB".

reply
chuckadams
2 days ago
[-]
It's the nature of B+ trees, multiplied by the nature of clustered indexes: if you use a UUIDv4 as a primary key, your entire row gets moved to random locations, which really sucks when you normally retrieve them sequentially. With a non-clustered index (say, your UUIDv4 id you use for public APIs when you don't want to leak the v7 info) then you'll still get more fragmentation with the random data, but it's something autovacuum can usually keep up with. But it's more work it has to do on top of everything else it does.
reply
masklinn
2 days ago
[-]
Gp mentioned Postgres, which does not have clustered indexes. It has table clustering, which is a point operation rewriting the entire table but not a persistent property.
reply
chuckadams
2 days ago
[-]
Ah, I forgot CLUSTER was something run by hand on PG. Same footgun then, but you have to load and aim it yourself instead of being fully automatic like it is in MySQL, where it appears you can't opt out of clustering by the PK (similar story in SQL Server, but you can change which index it clusters by). Thanks for the clarification.
reply
evanelias
2 days ago
[-]
This is a bit pedantic, but you’re conflating MySQL with InnoDB. (In MySQL’s model, you can opt out of having a clustered index by choosing a storage engine that doesn’t use clustering.)

In practical terms, choosing a non-InnoDB storage engine is pretty rare with upstream MySQL, but perhaps slightly less so in Percona Server or MariaDB.

reply
baq
2 days ago
[-]
Leaky abstractions in databases are one of the reasons every developer should read the table of contents of the hot databases used by the things he’s working on. IME almost no one does that.
reply
therealdrag0
2 days ago
[-]
Can you elaborate on the hot partition bit?
reply
pmontra
2 days ago
[-]
My customers return created_at attributes in all their API calls so UUIDv7 won't harm them at all. They also use sequential ids. Only one of them ever used UUIDv4 as primary key. We didn't have any performance problem but the whole production system was run by one psql insurance and one Elixir application server. Probably almost any architectural choice is good at that scale.
reply
perrygeo
2 days ago
[-]
Is there an unavoidable tradeoff here? Keys that order nicely (auto-incrementing integers, UUIDv7) naturally leak information. Keys that are more secure (UUIDv4) can have performance problems because they have poor locality.

Or are there any random id generators that can compromise, remain sequential-ish without leaking exact timestamps and global ordering?

reply
inopinatus
2 days ago
[-]
Symmetric encryption of IDs at the edge. Optional embedded HMAC. Optional text encoding. For monotonic bigserial values I'm somewhat fond of base58(AES_K1(id{8} || HMAC_K2(id{8})[0..7])) with purpose/table-salted HKDF subkeys from a scrypt'd system passphrase. The hot path of this is pretty fast. As with all cryptographic solutions it comes with a whole new jungle of pitfalls, caveats, and tradeoffs, but it works.
reply
morshu9001
1 day ago
[-]
How big would the resulting public ID be?
reply
inopinatus
22 hours ago
[-]
That depends on exact scheme and text encoding, but in the example I give above, they are 22 characters, and I will even pad them in the text encoder for length consistency.
reply
mjb
2 days ago
[-]
Yes. The spatial locality benefits drop off quite quickly. A hashed uuidv7-like scheme with a rotating salt, for example, would keep short term locality and it's performance benefits while not having long term locality and it's downsides.
reply
AlotOfReading
2 days ago
[-]
The tradeoff is unavoidable. At one end is UUIDv4. At the far end is a gray code with excellent locality, but inherently allows you to know which half of the indices the record is from (even without inverting it). UUIDv7 is a pretty good middle ground.
reply
Rafert
2 days ago
[-]
> Using UUIDv7 is generally discouraged for security when the primary key is exposed to end users in external-facing applications or APIs.

I would not call this “generally discouraged” when APIs generally surface a created_at timestamp in their responses. A real life example are Stripe IDs which have similar properties (k-sorted) as UUIDv7: https://brandur.org/nanoglyphs/026-ids#ulids

reply
mfrye0
2 days ago
[-]
I can confirm on the performance benefits. I wanted to start with uuidv7 for a new DB earlier this year, so I put together a function to use in the meantime. Once the function is available natively, we'll just migrate to use it instead.

For anyone interested:

CREATE FUNCTION uuidv7() RETURNS uuid AS $$ -- Get base random UUID and overlay timestamp select encode( set_bit( set_bit( overlay(uuid_send(gen_random_uuid()) placing substring(int8send((extract(epoch from clock_timestamp())*1000)::bigint) from 3) from 1 for 6), 52, 1), -- Set version bits to 0111 53, 1), 'hex')::uuid; $$ LANGUAGE sql volatile;

reply
delifue
2 days ago
[-]
I disagree with this

> While UUIDv7 still contains random data, relying on the primary key for security is considered a flawed approach

The correct way is 1. generate ID on server side, not client side 2. always validate data access permission of all IDs sent from client

Predictable ID is only unsafe if you don't validate data access permission of IDs sent from client. Also, UUIDv7 is much less predictable than auto-increment ID.

But I do agree that having create time in public-facing ID can leak analytical information.

reply
bearjaws
2 days ago
[-]
I really disagree that the privacy risk is enough to not use it at all, even in a healthcare setting.

There are wild scenarios you can come up with where you may leak something, but that assumes the information isn't coming over anyway.

"Reveals account creation time" - most APIs return this in API responses by default.

When have you seen just a list of UUIDs and no other more revealing metadata?

Meanwhile what pwns 99% of companies? Phishing.

reply
sverhagen
2 days ago
[-]
API responses should be limited to authenticated users. IDs are often present in hyperlinks that are included in insecure emails, or in URLs that, being routed through all sorts of networking hops may be captured and available as metadata.
reply
gnatolf
2 days ago
[-]
For me, the shear length of uuids is annoying in payloads of tokens etc. I wish there was a common way to abbreviate those, similar to the git way.
reply
pmontra
2 days ago
[-]
It's a 128 bit number. If you express that number in base 62 (26 upper case letters + 26 downcase letters + 10 digits) you need only a bit more than 20 characters. You can compress it further by increasing the base and using other 8 bit ASCII characters.
reply
Merad
2 days ago
[-]
Crockford base32 [0] is the best compromise, IMO. Reasonable length of 26 chars. Uses only alphanumeric characters and avoids issues with case sensitivity and confusing characters (0 vs O, etc.).

0: https://www.crockford.com/base32.html

reply
lucasyvas
2 days ago
[-]
These are all non-issues - don’t allow an end user to determine a serial primary key as always.

And the amount of information it leaks is negligible - they might know the oldest and the newest and there’s an infinite gulf in between.

It’s better and more practical than SERIAL or BIGSERIAL in every way - if you need a random/external ID, add a second column. Done.

reply
morshu9001
2 days ago
[-]
Why not serial PK with uuid4 secondary? Every join uses your PK and will be faster.
reply
Biganon
2 days ago
[-]
> if you need a random/external ID, add a second column. Done.

As others have stated, it completely defeats the performance purpose, if you need to lookup using another ID.

reply
pilif
2 days ago
[-]
One thing that’s not quite clear to me is how safe it is to generate v7 uuids on the client.

That’s one of the nice properties of v4 uuids: you can make up a primary key of a new entity directly on the client and the database can use it directly. Sure: there is tiny collision risk, but it’s so small, you can get away with mostly ignoring it

With v7 however, such a large chunk of the uuid is based on the time, so I’m not sure whether it’s still safe to ignore collisions in any application, especially when you consider client’s clocks to probably be very inaccurate.

Am I overthinking things here?

reply
PhilippGille
2 days ago
[-]
How many clients requests do you get in the same millisecond?

With UUIDv7 it's split into:

- 48 bits: Unix timestamp in milliseconds

- 12 bis: Sub-millisecond timestamp fraction for additional ordering

- 62 bits: Random data for uniqueness

- 6 bits: Version and variant identifiers

So >4,600,000,000,000,000,000 IDs per fraction of a millisecond.

And unprecise time on the client doesn't matter, because some are ahead and some behind, vut that doesn't make them more likely to clash.

reply
cenamus
2 days ago
[-]
Does that factor in the birthday paradox?
reply
qeternity
2 days ago
[-]
If the client can generate a uuid4 they can also reuse a known uuid4
reply
turrini
2 days ago
[-]
Something like this [1] or an adaptation may address their security considerations. Discussed here [2]

[1] https://github.com/stateless-me/uuidv47

[2] https://news.ycombinator.com/item?id=45275973

reply
burnt-resistor
2 days ago
[-]
Sequential primary keys are pretty important for scalable, stable sorting by record creation time using the primary keys' index similar to serial (int) but avoids the guessing vulnerability. For this use-case, an UUID "v9"-like approach can be a better option: https://uuidv9.jhunt.dev
reply
klysm
1 day ago
[-]
I don’t care at all about “leaking” the creation time for records. I think the documentation is overly zealous
reply
rvitorper
2 days ago
[-]
Does anyone have performance issues with uuidv4? I worked with a db with 10s of billions of rows, no issues whatsoever. Would love to hear the mileage of fellow engineers
reply
zerd
1 day ago
[-]
I've had issues in a database with billions of rows where the PKs were a UUID. Indices on PK, and also foreign keys from other tables pointing to that table were pretty big, so much so that the indices themselves didn't all fit in memory. Like we would have an index on customer_id, document_id, both UUIDv4. DB didn't have UUID support, so they were stored as strings, so just 1 billion rows took ~30 GiB memory for PK index, 60GiB for the composite indices etc. So eventually the indices would not fit in memory. If we had UUID support or stored as bytes it might have halved it, but eventually become too big.

If you needed to look up say the 100 most recent documents, that would require ~100+ disk seeks at random locations just to look up the index due to the random nature of UUIDv4. If they were sequential or even just semi-sequential that would reduce the number of lookups to just a few, and they would be more likely to be cached since most hits would be to more recent rows. Having it roughly ordered by time would also help with e.g. partitioning. With no partitioning, as the table grows, it'll still have to traverse the B-Tree that has lots of entries from 5 years ago. With partitioning by year or year-month it only has to look at a small subset of that, which could fit easily in memory.

reply
cipehr
2 days ago
[-]
What database were you using? For example with SQL server, by default it clusters data on disk by primary key. Random (non-sequential) PKs like uuidv4 require random cluster shuffling to insert a row “in the middle” of a cluster, increasing io load and causing performance issues.

Postgres on the other hand doesn’t do clustered indexing on the PK… if I recall correctly.

reply
rvitorper
2 days ago
[-]
Postgres. It was also a single instance, which made it significantly easier. But nice to know that this is an issue on SQL Server
reply
masklinn
2 days ago
[-]
Postgres is not immune to uuid issues, just less sensitive, uuidv4 still does not play well with btree indexes, bloating them and impacting their performance.
reply
ahoka
2 days ago
[-]
Do you also require your users to register in the alphabetical order of their names?
reply
rkomorn
2 days ago
[-]
New viral marketing idea just dropped: registration open to b-names only! Get in now before you lose your chance forever to the c-names!
reply
ahoka
2 days ago
[-]
Then cluster it differently? The whole problem uuidv7 in databases solves is a non-issue in most cases.
reply
crazygringo
2 days ago
[-]
Honestly not really. Yes random keys make inserts slower. But if inserts are only 1% of your database load, then yeah it's basically no issues whatsoever.

On the other hand, if you're basically logging to your database so inserts are like 99% of the load, then it's something to consider.

reply
rvitorper
2 days ago
[-]
Makes sense. Thanks for the comment
reply
MaKey
2 days ago
[-]
Interesting that aiven is still around after they've lost customer data a few years back.
reply
qntmfred
2 days ago
[-]
any thoughts on uuidv7 vs ulid, nanoid, etc for url-safe encodings?
reply
nikisweeting
2 days ago
[-]
ULID is the best balance imo, it's more compact, can be double clicked to select, and case-insensitive so it can be saved on macOS filesystems without conflicts.

Now someone should make a UUIDv7 -> ULID adapter lib that 1:1 translates UUIDv7 <-> ULID preserving all the timestamp resolution and randomness bits so we can use the db-level UUIDv7 support to store ULIDs.

reply
masklinn
2 days ago
[-]
A uuid is a 128b number with a specific structure. You can encode them in base32 if you want, there is no need for any sort of conversion scheme.
reply
nikisweeting
2 days ago
[-]
You need to convert it to perserve the timestamp info correctly so that a ULID library reading the base32 format would reproduce the same timestamp.
reply
masklinn
2 days ago
[-]
What I'm saying is that ULID is irrelevant and unnecessary, if you want "double clicked to select, and case-insensitive" you just encode your UUIDs in base32. They're still UUIDs.
reply
clintonb
1 day ago
[-]
reply
thewisenerd
2 days ago
[-]
i guess that depends on what you mean by url-safe

uuidv7 (-) and nanoid (_-) have special characters which urlencode to themselves.

none are small enough that you want someone reading them over the phone; but from a character legibility, ulid makes more sense.

reply
6r17
3 days ago
[-]
Great read - short, effective ; I know what I learned. Very good job
reply