This has come up some in the last few days so I thought I’d share the available options and the tradeoffs.
Probabilistic tracing is a handy feature for finding expensive queries in use cases where there is little control over who has access to the cluster IE most enterprises in my experience (ironic considering all the process). However, it’s too expensive to turn it up too high in production, but in development it’s a good way to give you an idea of what a query turns into. Read more about probalistic tracing here:
and the command syntax:
Set TRACE logging level on the java-driver request handler on the spark nodes you’re curious about.
Say I have a typical join query:
On the spark nodes now configure the DataStax java driver RequestHandler.
In my case using the tarball this is dse-4.8.4/resources/spark/conf/logback-spark-executor.xml. In that file I just added the following inside the
On the spark nodes in the executor logs you’ll now have. In my case /var/lib/spark/worker/worker-0/app-20160203094945–0003/0/stdout, app-20160203094945–0003 is the job name.
You’ll note this is a dumb table scan that is only limited to the tokens that the node owns. You’ll note the tokens involved are not visible, I leave it to the reader to repeat this exercise with pushdown like 2i and partitions.