Cassandra data modeling isn’t really something you can do “in the brief” and is itself a subject that can take years to fully grasp, but this should be a good starting point.
Cassandra distributes data around the cluster via the partition key.
CREATE TABLE my_key.my_table_by_postal_code (postal_code text, id uuid, balance float, PRIMARY KEY(postal_code, id));
In the above table the partition key is postal_code
and the clustering column isid
. The partition key will locate the data on the cluster for us. The clustering column allows us multiple rows per partition key so that we can filter how much data we read per partition.
The ‘optimal’ query is one that retrieves data from only one node and not so much data that GC pressure or latency issues result. The following query is breaking that rule and retrieving 2 partitions at once via the IN parameter.
SELECT * FROM my_key.my_table_by_postal_code WHERE postal_code IN ('77002', '77043');
This can be slower than doing two separate queries asynchronously, especially if those partitions are on two different nodes (imagine if there are 1000+ partitions in the IN statement). In summary, the simple rule to stick to is “1 partition per query”.
A common mistake when data modeling is to jam as much data as possible into a single partition.
A common approach to optimize around partition lookup is to create a table per query, and write to all of them on update. The following example has two related tables both to solve two different queries
--query by postal_code
CREATE TABLE my_key.my_table_by_postal_code (postal_code text, id uuid, balance float, PRIMARY KEY(postal_code, id));
SELECT * FROM my_key.my_table_by_postal_code WHERE postal_code = '77002';
--query by id
CREATE TABLE my_key.my_table (id uuid, name text, address text, city text, state text, postal_code text, country text, balance float, PRIMARY KEY(id));
SELECT * FROM my_key.my_table WHERE id = 7895c6ff-008b-4e4c-b0ff-ba4e4e099326;
You can update both tables at once with a logged batch:
BEGIN BATCH
INSERT INTO my_key.my_table (id, name, address, city, state, postal_code, country, balance) VALUES (7895c6ff-008b-4e4c-b0ff-ba4e4e099326, 'Bordeaux', 'Gironde', '33000', 'France', 56.20);
INSERT INTO my_key.my_table_by_postal_code (postal_code, id, balance) VALUES ('33000', 7895c6ff-008b-4e4c-b0ff-ba4e4e099326, 56.20) ;
APPLY BATCH;
A common design pattern is to have one table act as the authoritative one over data, and if for some reason there is a mismatch or conflict in other tables as long as there is one considered “the source of truth” it makes it easy to fix any conflicts later. This is typically the table that would match what we see in typical relational databases and has all the data needed to generate all related views or indexes for different query methods. Taking the prior example, my_table
is the source of truth:
--source of truth table
CREATE TABLE my_key.my_table (id uuid, name text, address text, city text, state text, postal_code text, country text, balance float, PRIMARY KEY(id));
SELECT * FROM my_key.my_table WHERE id = 7895c6ff-008b-4e4c-b0ff-ba4e4e099326;
--based on my_key.my_table and so we can query by postal_code
CREATE TABLE my_key.my_table_by_postal_code (postal_code text, id uuid, balance float, PRIMARY KEY(postal_code, id));
SELECT * FROM my_key.my_table_by_postal_code WHERE postal_code = '77002';
Next we discuss strategies for keeping tables of related in sync.
Materialized views are a feature that ships with Cassandra but is currently considered rather experimental. If you want to use them anyway:
CREATE MATERIALIZED VIEW my_key.my_table_by_postal_code
AS SELECT postal_code text, id uuid, balance float
FROM my_key.my_table
WHERE postal_code IS NOT NULL AND id IS NOT NULL
PRIMARY KEY(postal_code, id));
Materialized views at least run faster than the comparable BATCH insert pattern, but they have a number of bugs and known issues that are still pending fixes.
This are the original server side approach to handling different query patterns but it has a large number of downsides:
For just those two reasons I think it’s rare that one can use secondary indexes and expect reasonable performance. However, you can make one by hand and just query that data asynchronously to avoid some of the downsides.
CREATE TABLE my_key.my_table_by_postal_code_2i (postal_code text, id uuid, PRIMARY KEY(postal_code, id));
SELECT * FROM my_key.my_table_by_postal_code_2i WHERE postal_code = '77002';
--retrieve all rows then asynchronously query the resulting ids
SELECT * FROM my_key.my_table WHERE id = ad004ff2-e5cb-4245-94b8-d6acbc22920a;
SELECT * FROM my_key.my_table WHERE id = d30e9c65-17a1-44da-bae0-b7bb742eefd6;
SELECT * FROM my_key.my_table WHERE id = e016ae43-3d4e-4093-b745-8583627eb1fe;
This is a good basic first use case as one needs to use multiple tables for the same data, but there should not be too many.
Takes the basics from the previous exercise and requires a more involved understanding of the concepts. It will require many tables and some difficult trade-offs on partition sizing. There is no one correct way to do this.
This will require some extensive time series modeling and takes some of the lessons from the Music Service further. The table(s) used will be informed by the query.