Tuning Indexing Speed in DSE Search 5.x

What follows is a summary of the things you need to consider dealing with indexing speed problems in DSE 5.x (note the model of indexing is wildly different in DSE 6.x and will not apply, I will have a guide on that soon).

Dropped Mutations

With Solr 5.x dropped mutations have three causes over regular Cassandra:

To minimize IO contention and merge stalling do the following steps:

Delayed Updates

Records aren’t available for search as soon as the users would like them or are “out of sync” between Cassandra data and Solr. Sometimes, this is just how Solr works with DSE. However, there is one nasty bug that affects 5.1.x where stale data may stick around in the Solr Index on Solr enabled tables using <rt>true</rt> in solrconfig.xml. The fix is simple, upgrade to DSE 5.1.17

Consistency Problems

Solr only consults a SINGLE replica when querying data, and Cassandra can query all the replicas. The fix is easy: nodetool repair -pr on all the nodes. However, this will not prevent further inconsistency, and periodic inconsistency is fundamentally a DSE search design issue.

Mutation lag problems The users may complain about matching records on DSE Search that do not match the actual data found. An example of this follows:

SELECT * from my_table where solr_query='{"q":"status:processed"}'

-- [id] | [status]
--  1   | processed
--- 2   | completed

This behavior is easy to explain, the Solr index updates AFTER Cassandra is already updated, so there are periods Cassandra and Solr will potentially be out of sync. Since this is the reality, the fixes for this can be challenging:

Shrinking an index

A good way to index more rows per second is to index less per row. Some text types generate more data on disk and require more IO and CPU to index the values. The following is the list of most expensive text field types to index in order of most (at #1) to least (#5)


This is a very whirlwind walk through of the various tunings one can do to get indexing throughput up in DSE Search. I myself have used this framework many times and realized gains of 10x in indexing throughput just doing these steps. For more in deph tuning taking into account IO and CPU stay tuned.