The code was simple, read a bunch of information, do some minor transformations and flush to Cassandra. This was nothing crazy. But during the users fault tolerance testing, the job would just seamingly hang indefinitely when a node was down.
That was it, in fact that’s always it, if something myseriously “just stops” usually you have a small cluster and no replicas (RF 1). Now one may ask why anyone would ever have 1 replica with Cassandra and while I concede it is a very fair question, this was the case here.
Example if I have RF1 and three nodes, when I wrote a row it’s only going to go to 1 of those nodes. If it dies, then how would I retrieve the data? Wouldn’t it just fail? Ok yeah wait a minute why is the job not failing?
This is a bit of misdirection, the other nodes were timing out (probably slow IO layer). If we read the connector docs we get query.retry.count which gives us 10 retries, and the default read.timeout_ms is 120000 (which confusingly is also the write timeout), so 1.2 million milliseconds or 20 minutes to fail a task that is timing out. If you retry that task 4 times (which is the Spark default) it could take you 80 minutes to fail the job, this is of course assuming all the writes timeout.
Short term
Long term