UUIDs solve the global uniqueness and ease of use problem at the same time, but introduces another problem in the database layer - memory footprint. A UUID is a 36 byte ASCII string of characters representing a 16 byte value. I've seen them stored as CHAR(36), VARCHAR(36), etc. The problem with storing them in their ASCII form is that the index will grow beyond the capacity of the database server faster than when they're stored in a binary form. Memory will fill faster, queries will hit the disk more frequently, and it will ultimately cost more money to save the exact same information. It's not worth it.
Data-Driven Analysis
What follows is the result of a test I performed on my laptop to see just how much worse storing a UUID as VARCHAR(36) is, as compared with the less bulky BINARY(16). Just guessing, it will be 36/16 worse, right? I wanted to be sure, since MySQL does many things under the hood.
Database Tables
We'll create 2 separate tables, one for binary representation and the other for ascii. These two tables will have a key on the UUID field, but that key will not be primary. Typically, you'll use the UUID as the primary key, but I found key sizes easier to report if they were not primary.
CREATE TABLE `uuidbinary` (`id` BIGINT NOT NULL AUTO_INCREMENT PRIMARY KEY, `uuid` BINARY(16) NOT NULL, `value` BIGINT NOT NULL, KEY (`uuid`) USING HASH) ENGINE=INNODB;
CREATE TABLE `uuidvarchar` (`id` BIGINT NOT NULL AUTO_INCREMENT PRIMARY KEY, `uuid` VARCHAR(36) NOT NULL, `value` BIGINT NOT NULL, KEY (`uuid`) USING HASH) ENGINE=INNODB;
Script
Rather than bore you with a stupid-simple PHP script, let me give you pseudo code.
connect to db
for 0 to 1000
query "insert into uuidbinary values (null, unhex(<uuid without dashes>), <random int>)"
for 0 to 1000
query "insert into uuidvarchar values (null, <uuid with dashes>, <random int>"
I split the inserts so that I could monitor their execution time independently. The only queries that were run were INSERT INTO queries.
Results
Total Index Size - Linear Axis |
Total Index Size - Logarithmic Axis |
Storage Size - Linear Axis |
Storage Size - Logarithmic Axis |
Index Size / Total Size |
Insertion Time / row |
Extrapolation
Using either the data size, the index size, or the sum of both, the varchar format ends up using about 1.5x as much space as the binary format. You're probably doing something wrong if you're using a relational database like MySQL, and only storing IDs once, though. For something like a user ID, you are likely storing the ID at least 5 times. The 1.5x becomes 7.5x. If your database server is running with 7.5 GB of RAM dedicated to user IDs, you can reduce it to 1.5 GB by storing those UUIDs in the BINARY(16) format.
Storing the user ID 5 times seems excessive, but at Lucidchart, we store it in the following ways:
- Once per user
- User Record
- User Attributes
- Zero to many per user
- Payments
- Documents
- Team
- Folders
- Invitations
- Valid Sessions
- Newsletters
- Images
- Image Tags
- Spellchecker Words
- Statistics (millions per user)
- Aggregated statistics (thousands per user)
As you can imagine, the more times you use the same ID, the more you can save by using my recommended method.
Update: I've made a new post about how to use UUIDs with Play and Anorm.
Update: I've made a new post about how to use UUIDs with Play and Anorm.
nice
ReplyDelete