Several years ago DataStax change DataStax Enterprise from a version of Cassandra with some closed source plugins and a handful of fixes most of which were backports of fixes in very old versions, to a fundamentally different architecture. This new architecure we will refer to as TPC, however, there were also several internals that were changed that may have relied or depended on TPC but were not actually themselves part of TPC.
Today I want to review one of those features TPC Backpressure. TPC Backpressure in theory stops the node from running out of memory or becoming unresponsive and being somewhat down (in Cassandra this is often worse than being down). TPC Backpressure has a complex implementation and can be difficult to explain, but for the sake of brevity I am going to focus on two primary thresholds that when passed the node will pause new requests allowing the existing requests to complete.
backpressure active regex, tells you when it is engaged
"DEBUG\s+\[(?P<thread>.*)\] (?P<date>.{10} .{12}) *(?P<source_file>[^:]*):(?P<source_line>[0-9]*) - TPC backpressure is active on core (?P<core_num>\d+) with global local/remote pending tasks at (?P<global_pending>\d+)/(?P<remote_pending>\d+)"
You can also just use ripgrep for TPC backpressure is active
and see how often that is enabled. If I want to be extra fancy I can take the regex above, combine it with regrep and output the remoting and global pending values:
for global:
rg "DEBUG\s+\[(?P<thread>.*)\] (?P<date>.{10} .{12}) *(?P<source_file>[^:]*):(?P<source_line>[0-9]*) - TPC backpressure is active on core (?P<core_num>\d+) with global local/remote pending tasks at (?P<global_pending>\d+)/(?P<remote_pending>\d+)" -or '$global_pending'
for remote:
rg "DEBUG\s+\[(?P<thread>.*)\] (?P<date>.{10} .{12}) *(?P<source_file>[^:]*):(?P<source_line>[0-9]*) - TPC backpressure is active on core (?P<core_num>\d+) with global local/remote pending tasks at (?P<global_pending>\d+)/(?P<remote_pending>\d+)" -or '$remote_pending'
The higher it is the longer it will take to clear out, this can be used to measure severity, once you have the total you can try and find the time it occured at. You can find the 5 worse totals with the following
rg "DEBUG\s+\[(?P<thread>.*)\] (?P<date>.{10} .{12}) *(?P<source_file>[^:]*):(?P<source_line>[0-9]*) - TPC backpressure is active on core (?P<core_num>\d+) with global local/remote pending tasks at (?P<global_pending>\d+)/(?P<remote_pending>\d+)" -or '$global_pending' --no-filename | sort -n | tail -n 5
134
813
1111
1831
1911
do your grep again but filter out for the worst value
rg "DEBUG\s+\[(?P<thread>.*)\] (?P<date>.{10} .{12}) *(?P<source_file>[^:]*):(?P<source_line>[0-9]*) - TPC backpressure is active on core (?P<core_num>\d+) with global local/remote pending tasks at (?P<global_pending>\d+)/(?P<remote_pending>\d+)" | rg 1911
debug.log:DEBUG [CoreThread-1] 2020-01-01 01:01:01,000 NoSpamLogger.java:92 - TPC backpressure is active on core 1 with global local/remote pending tasks at 1911/120
Now you can compare that to node latencies at that time and see if that matches the problems you are trying to find with latency and timesouts.
There is some art to the next step, one needs to evaluate if raising backpressure will actually help or hurt the situation. But let us say you see the GC is low, the CPU usage is low and the latencies are bad, it just could be a workload that the defaults are too low.
To know the next steps we need to take state of the system
sperf core gc
is a good tool for this, and can quickly tell you how severe gc is. For example in this report:
gcinspector version 0.6.4
. <300ms + 301-500ms ! >500ms
------------------------------
2020-01-10 16:28:37 236 ...............+++........++.....+.+...+++.++.+.+......+.....+.!!+!.!++!+..++.+.!..+.!++++.!.++++.+.+++..+!!.......!..+.+....+.....++.+...+.+.++.+.+!!++.+.+.!++.+.+...+.+..+.+.+.+..+++....++..+++....+..++.+++.+!+..+.+.+.+!......+++....+
busiest period: 2020-01-10 16:28:37 (75041ms)
GC pauses max p99 p75 p50 p25 min
--- --- --- --- --- ---
ms 800 729 358 282 243 201
Worst pauses in ms:
[800, 735, 729]
Collections by type
--------------------
* G1 Young: 236
The GC is extremely heavy and the target gc rate was 500ms, this is not a cluster you wan to raise backpressure on. Now let us look at a cluster that has good GC and would be at least safe to raise backpressure limits on:
gcinspector version 0.6.4
. <300ms + 301-500ms ! >500ms
------------------------------
2020-01-10 16:28:37 10 .........
busiest period: 2020-01-10 16:28:37 (75041ms)
GC pauses max p99 p75 p50 p25 min
--- --- --- --- --- ---
ms 300 289 250 225 210 201
Worst pauses in ms:
[300, 289, 250]
Collections by type
--------------------
* G1 Young: 10
In this case the system is not very busy with GC activity and if there are high pending writes and reads, and we find backpressure is being enabled, raising those limits may help system performance.
sperf core statuslogger
can generate some very useful reports to track the number of pending operations in your cluster. Below I’ve shared an artificial example:
sperf core statuslogger version: 0.6.13
Summary (111,111 lines)
Summary (0 skipped lines)
dse versions: {'6.x'}
cassandra versions: unknown
first log time: 2021-01-01 01:11:01.111000+00:00
last log time: 2021-01-01 01:01:01.111000+00:00
duration: 10.00 minutes
total stages analyzed: 111
total nodes analyzed: 1
GC pauses max p99 p75 p50 p25 min
--- --- --- --- --- ---
ms 500 500 400 350 250 250
total GC events: 11
busiest tables by ops across all nodes
------------------------------
* debug.log: test.tester: 1,000,000 ops / 11.11 mb data
busiest table by data across all nodes
------------------------------
* debug.log: test.tester: 111,111 ops / 11.11 mb data
busiest stages across all nodes
------------------------------
* TPC/9 local backpressure: 11115 (debug.log)
* TPC/4 local backpressure: 6791 (debug.log)
* TPC/all/BACKPRESSURED_MESSAGE_DECODE active: 6618 (debug.log)
* TPC/4/BACKPRESSURED_MESSAGE_DECODE active: 5743 (debug.log)
* TPC/1 local backpressure: 5414 (debug.log)
* TPC/2 local backpressure: 5111 (debug.log)
* TPC/all/WRITE_REMOTE pending: 4967 (debug.log)
* TPC/all/WRITE_RESPONSE active: 4771 (debug.log)
* TPC/3/WRITE_RESPONSE active: 4460 (debug.log)
busiest stages in PENDING
------------------------------
debug.log:
TPC/all/WRITE_REMOTE: 4967
TPC/all/READ_LOCAL: 4599
TPC/3/WRITE_REMOTE: 1455
TPC/1/WRITE_REMOTE: 1442
TPC/2/READ_LOCAL: 1441
TPC/3/READ_LOCAL: 1034
TPC/9/READ_LOCAL: 935
TPC/9/WRITE_REMOTE: 988
busiest LOCAL BACKPRESSURE
------------------------------
debug.log:
avg backpressure 1332.06
max backpressure 11111
Here we can see there is a lot of backpressure and realitively little GC happening in this case this would suggest to me we can raise the backpressure limits to see if latency improves and more of the system is utilized.
The two main paramters I like to focus on are (both in the cassandra.yaml):
tpc_concurrent_requests_limit
- the number of concurrently executed read/write requests per core, to be increased only if the CPU is under utilized, or decreased only in case of excessive memory pressure (i.e. long GCs/OOM).
tpc_pending_requests_limit
- the number of pending read/write requests per core, to be increased only if getting too many timeouts.
Try raising this and repeat your tests, continue to monitor GC and CPU utilization.
DSE 6.8 adds work stealing, and badly balanced data models, or apps with anti-patterns have a less significant impact on the cluster health. The downside to this is writes will still go to a particular Core and this means data models
By default in DSE 6.8 work stealing is pretty conservative as the developers wanted us to get the benefit of TPC, but nearly every cluster does seem to have an application that would benefit from work stealing so it is there. However, some clusters contain apps that primarily have anti-patterns, and in these cases the defaults for work stealing maybe too minimal.
sperf core statuslogger
try and see if there is a high number of pending operations, especially on singular cores, if so you may benefit from raising work stealingAnothing thing you can use is ripgrep in your debug.log with this regex DEBUG \[(?P<thread>.*)\] (?P<date>.{10} .{12}) *(?P<source_file>[^:]*):(?P<source_line>[0-9]*) - Stole (?P<tasks>\d+) tasks since last time\.
to the jmv-server.options add the following and reboot:
-Ddse.tpc.work_stealing_max_unparks=<half of tpc_cores> (defaults to 1/4 of the TPC cores so if I had 8 cores, the default is 2, raising that to 4 may help work stealing)
-Ddse.tpc.work_stealing_max_tasks=512 (default = 32)
Review if this helps the total pending tasks or not.
Writes will continue to pin to an individual core. The classic very naive time series data model where the partition key is a time bucket, will in the case of TPC pin all writes to one core, and all requests to that core will be backed up. Eventually this will lead to backpressure, despite the node itself not being very much stress and in fact being very underused.
Scenario:
cores list total pending writes
| node | cpu usage | cores1 | core2 | core3 | core4 |
| ---- | --------- | –––------ | --–-— | ----- | ----- |
| 1 | 25% | 134211 | 0 | 0 | 0 |
| 2 | 25% | 0 | 0 | 0 | 12441 |
| 3 | 25% | 0 | 13441 | 0 | 0 |
| 4 | 0% | 0 | 0 | 0 | 0 |
| 5 | 0% | 0 | 0 | 0 | 0 |
| 6 | 0% | 0 | 0 | 0 | 0 |
| node | cpu usage | cores1 | core2 | core3 | core4 |
| ---- | --------- | –––------ | --–-— | ----- | ----- |
| 1 | 0% | 0 | 0 | 0 | 0 |
| 2 | 25% | 0 | 0 | 0 | 14552 |
| 3 | 25% | 0 | 12442 | 0 | 0 |
| 4 | 25% | 0 | 0 | 12498 | 0 |
| 5 | 0% | 0 | 0 | 0 | 0 |
| 6 | 0% | 0 | 0 | 0 | 0 |
You get the idea, this repeats like this for every minute, using effectivley only one CPU core on 3 nodes for an entire minute which is not something useful from a scaling standpoint an exposes problematic data models and hotspots pretty quickly, worse it will have global effects if those individual cores hit the backpressure threshold levels.
The only fix for this is to change the data model and add in an extra column to the partition key which will require a new table. There are a few common approaches:
This was a rather simplistic view of backpressure, and those that have moved beyond this level of depth can safely ignore this article, it is intended to make an audience understand the most basic behaviors of TPC Backpressure and is not to be viewed as the final word on backpressure tuning. This will help you out with 80 % of the cases where backpressure is enabled and what are the next steps.