4min

Snowflake at Instagram

Snowflake at Instagram

Instagram uses a technique called snowflaking to generate IDs in their databases.

Requirements

Instagram's ID generation system is designed with specific requirements in mind:

  • IDs Sortable by Time: Essential for pagination, filtering, and batch processing.
  • ~64 Bits for Efficient Indexing: Ensuring optimal performance for data indexing.
  • No New Services Required: Implementation within the existing infrastructure.

Structure of Snowflake IDs

The core of Instagram's strategy lies in the structure of its snowflake IDs:

  • 41 Bits - Epoch since Jan 1, 2011: Establishing a time reference point.
  • 13 Bits - DB Shard ID: Identifying the logical shard within the database.
  • 10 Bits - Per Shard Sequence Number: Ensuring uniqueness within each shard.

Implementation Details

To meet these requirements, Instagram's logical shards are distributed across physical DB servers, usually ranging from 10 to 15 servers. Each logical shard corresponds to a database created on a physical server, such as a MySQL server.

Instagram utilizes a technique known as "snowflaking" in its database during the INSERT process to meet specific requirements efficiently. Sorting IDs by time is crucial for tasks like pagination, filtering, and batch processing. To ensure this, the IDs are approximately 64 bits in size for efficient indexing.

Another vital requirement is to implement the snowflaking technique within the existing infrastructure, avoiding the introduction of new services.

The snowflake ID structure consists of three parts: 41 bits for the epoch since Jan 1, 2011, 13 bits for the DB shard ID, and 10 bits for the per-shard sequence number.

Instagram employs logical shards distributed across physical DB servers, typically in the range of 10 to 15 servers. Each logical shard represents a database created on a physical server, such as a MySQL server.

Example

  • MySQL Server → Physical
  • CREATE DATABASE → Logical
CREATE DATABASE insta5;
\c insta5;
CREATE SEQUENCE insta5.photos_id_seq;
CREATE TABLE insta5.photos (
    id bigint DEFAULT nextval('insta5.photos_id_seq'),
    -- Other columns in the table
);
CREATE OR REPLACE FUNCTION
  insta5.next_id(OUT result bigint) AS $$
DECLARE
    epoch bigint := 1314220021721;
    seq_id bigint;
    now_ms bigint;
    shard_id bigint := 5;
BEGIN
    SELECT nextval('insta5.photos_id_seq') % 1024 INTO seq_id;
    now_ms := extract(epoch from now()) * 1000;
    result := (now_ms - epoch) << 23;
    result := result | (shard_id << 10);
    result := result | seq_id;
END;
$$ LANGUAGE plpgsql;

First, we're creating a brand new database called "insta5" where we'll store all sorts of data.

Next, we're setting up a specific table within this database named "photos." This table will have several columns, but they're not shown here.

Now, let's talk about the interesting part – a function called "insta5.next_id." This function is designed to generate unique identifiers for records in our "photos" table.

Inside this function:

  • We have a variable called "epoch," which represents a point in time (1314220021721 milliseconds since some reference point).
  • Another variable called "shard_id" is set to 5.
  • We calculate a "seq_id" by taking the remainder of a certain sequence value divided by 1024. This "seq_id" will help make our identifiers unique.

Then, we calculate the final "result" identifier:

  • First, we figure out the current time in milliseconds, which is not shown here.
  • We subtract the "epoch" time from the current time and shift the result left by 23 bits.
  • We also incorporate the "shard_id" by shifting it left by 10 bits.
  • Finally, we combine everything to create a unique identifier for our records.