foundev.github.io

Lambda+: Cassandra and Spark for Scalable Architecture

UPDATE: For some background on Spark Streaming and Cassandra please consult some of my previous blog post on the subject.

Many of you have heard of and a few of you may have used the Lambda Architecture. If you’ve not heard of it the name doesn’t communicate a lot. In short the Lambda Architecture has 3 primary components with a 2 pronged approach to analytics:

  • Speed Layer: Records are analyzed as they’re created in the Speed Layer.
  • Batch Layer will run the same analytics as the Speed Layer but on all records with a database designed for this type of work such as ElephantDB.
  • A Serving Layer will read data from the Speed Layer and data from the Batch Layer. If there is a disagreement on the same records, then data from the Batch Layer will be treated as authoritative.
  • This is a very brief overview with some generalizations, so for those interested in the detail I suggest reading up more at http://lambda-architecture.net/.

    Ok great so how does Cassandra & Spark fit in?

    With Cassandra & Spark we can build something that achieves the same goals as the Lambda Architecture but more simply and with fewer moving pieces by combining your Speed Layer and your Batch Layer into a single data store running on Cassandra and utilizing Spark and Spark Streaming to have a single code base responsible for analytics and streaming. This paper will detail how to achieve this savings in complexity compared to the traditional Lambda Architecture.

    Deployment Overview

  • 5 Cassandra also running Spark. We place them on the same hardware for data locality reasons.
  • 3 Messaging servers to handle ingest. I have chosen Kafka brokers for fast ingest. This can however be any message queue.
  • 3 application servers running the preferred Application server of a given organization. This can be Jetty, NodeJS+Nginix, Ruby + Puma or really anything the organization chooses. I have chosen Tomcat for this demonstration.
  • 1 Spark Streaming Job running continuously on the cluster every second.
  • 2 Spark batch jobs running every hour at different time intervals.
  • Dashboard for reading the data served up by the Application Servers.
  • </div>

    Schema

    CREATE KEYSPACE lambda_plus WITH REPLICATION = 
    { ‘class’:’NetworkTopologyStrategy’, ‘Analytics’:3 }
    
    CREATE TABLE lambda_plus.records ( time_bucket ts,
     sensor_id int, data double, ts timestamp, primary key(
    time_bucket, sensor_id)); //populated by the Application Servers
    
    CREATE TABLE lambda_plus.bucket_rollups 
    ( time_bucket datetime, average_reading double, 
    max_reading double, min_reading double); 
    // populated by Spark Streaming and 
    //Spark Job and therefore populated every second
    
    CREATE TABLE lambda_plus.sensor_rollups (sensor_id, 
    last_reading ts, average_reading double, max_reading double, 
    min_reading double); 
    // populated by Spark Job only and therefore
    // stale for up to an hour

    Application Components

    Sensor Data Ingest

    </div>

    Sensors themselves submit an HTTP request to the application servers which then perform 2 functions:

    1. Submit a record to Kafka
    2. Submit a record directly into the lambda_plus.records table on a 100ms time bucket (failure handling for the 2 current writes is an exercise left to the reader and a paper in and of itself) to Cassandra.

    Spark Streaming Job

    </div>

    A Spark Streaming job will be running on the cluster every second. This will take messages from Kafka and aggregate the results and flush them to the lambda_plus.bucket_rollups table.

    Spark Batch Job — Bucket Rollups

    </div>

    This will operate on every data row older than 5 minutes in the lambda_plus.records table (This is to allow for time series buckets to be totally complete and all retries to have been already sent by sensor data.). The role of this job is to correct any errors in the Spark Streaming Job that may have occurred due to message loss.

    Spark Batch Job — Sensor Rollups

    </div>

    This will operate on every data row older than 5 minutes in the lambda_plus.records table (This is to allow for time series buckets to be totally complete and all retries to have been already sent by sensor data.). This will take each sensor found and aggregate the results in lambda_plus.sensor_rollups. You’ll note this is not filled by the Spark Streaming Job as it would require a historical lookup, while this can be doable it can be expensive if there is a lot of historical data and may not be efficient. This is a classic design tradeoff, and one must decide how important really fresh data is.

    Dashboard

    The application servers driving the dashboard will be able to use pure CQL queries to get up to the second data for bucket rollups and 1 hour delayed data for sensor rollups. This provides low latency and high throughput for answering queries. Our dashboard is using Cassandra in the optimal fashion.

    Lambda Architecture compared

    Speed Layer

    Identical between this architecture and Lambda. A lot of customers will use Spark and Cassandra for the traditional Speed Layer.

    Batch Layer

    Instead of writing to a separate database our Batch Layer will write directly to the same Speed Layer tables that are then served up by Cassandra. Contrast this with the traditional Lambda Architecture where the Batch Layer is using a different code base and a different database than the Speed Layer, I think you’ll agree the Lambda+ approach is much simpler.

    Serving Layer

    Can all be done out of Cassandra, instead of out of a Batch Layer database. This provides operational simplicity over the traditional Lambda Architecture and not only results in less servers but in less code complexity.

    Conclusion

    This approach will not only scale as you add more nodes and more data, it will allow you the best of several approaches at once and with the operational simplicity of a single data store.