Fri 07 May 2021

How to optimize ORDER BY RANDOM()

Tom Witowsky (@devgummibeer) shared on Twitter a scaling issue with his service, which helps any developer share and highlight their open source work. As the service grows, more and more data is stored in the database and needs to be browsed. One particularly slow query that he needed help optimizing is fetching random users, organizations, and repositories that are already part of the service.

Source: How to optimize ORDER BY RANDOM(), an article by Tobias Petry.

Cryptographic shuffle

What if I needed to shuffle a list but couldn't hold the whole thing in memory? Or what if I didn't want to shuffle a list, but just traverse it in a shuffled manner? (That is, visit each element once and only once, in a randomized way.) What if I wanted to traverse it, but didn't want to precompute or store the traversal for some reason?

This would allow me to publish items from a list in an order that was unpredictable from the outside, but in fact deterministic and based on a secret key, and without precomputing anything (or worrying about collisions). Or I could use it to assign small non-sequential IDs that would eventually saturate the space of n-character strings in a pseudorandom order, obscuring the true size of the set for anyone who could just view some subset of the assigned IDs. They wouldn't even be able to tell if there were gaps in the list of IDs they could observe.

Source: Cryptographic shuffle, an article by Tim McCormack.