What follows is a summary of the things you need to consider dealing with indexing speed problems in DSE 5.x (note the model of indexing is wildly different in DSE 6.x and will not apply, I will have a guide on that soon).
With Solr 5.x dropped mutations have three causes over regular Cassandra:
To minimize IO contention and merge stalling do the following steps:
solr_data_dir
to it’s own dedicated drive or raid array see set the location of search indexes for 5.0 or 5.1maxMergeCount
to 8 x num_tokens
or 2 x max_concurrency_solr_per_core
, whichever is higher, in solrconfig.xml for each Solr enabled table.back_pressure_threshold_per_core
up to 100k (assuming adequate heap available). This can also be done at runtime via nodetool sjk mx -b "com.datastax.bdp:type=search,index=keyspace.table_name,name=IndexPool" -f BackPressureThreshold --set -v 100000
.enable_back_pressure_adaptive_nrt_commit
to minimize dropped mutations. Note: this setting will not eliminate them and sometimes cause them. Therefore, disable enable_back_pressure_adaptive_nrt_commit
to get rid of them entirely.max_solr_concurrency_per_core
to reduce the IO load, raise the same to increase indexing rate. We need to find the right level for each use case and set of hardware.<rt>false</rt>
in their solrconfig.xml:
ALTER SEARCH INDEX CONFIG ON demo.health_data SET autoCommitTime = 60000
; This will mean updates to search are visible only after a minute.RELOAD SEARCH INDEX ON demo.health_data
;<ramBufferSizeMB></ramBufferSizeMB>
by some multiple (up to several gigabytes).Records aren’t available for search as soon as the users would like them or are “out of sync” between Cassandra data and Solr. Sometimes, this is just how Solr works with DSE.
However, there is one nasty bug that affects 5.1.x where stale data may stick around in the Solr Index on Solr enabled tables using <rt>true</rt>
in solrconfig.xml. The fix is simple, upgrade to DSE 5.1.17
Solr only consults a SINGLE replica when querying data, and Cassandra can query all the replicas. The fix is easy: nodetool repair -pr on all the nodes. However, this will not prevent further inconsistency, and periodic inconsistency is fundamentally a DSE search design issue.
Mutation lag problems The users may complain about matching records on DSE Search that do not match the actual data found. An example of this follows:
SELECT * from my_table where solr_query='{"q":"status:processed"}'
-- [id] | [status]
-- 1 | processed
--- 2 | completed
This behavior is easy to explain, the Solr index updates AFTER Cassandra is already updated, so there are periods Cassandra and Solr will potentially be out of sync. Since this is the reality, the fixes for this can be challenging:
autoCommitTime
, consider all the steps above in the “Dropped Mutations” section as this may be a throughput issue.A good way to index more rows per second is to index less per row. Some text types generate more data on disk and require more IO and CPU to index the values. The following is the list of most expensive text field types to index in order of most (at #1) to least (#5)
NGramTokenizer
StandardTokenizer
or WhiteSpaceTokenizer
.omitNorms=true
(this disables length normalization for the field and saves some memory)omitNorms=true
and omitTermFreqAndPositions=true
(Queries that rely on term position with this option will silently fail to find documents.)This is a very whirlwind walk through of the various tunings one can do to get indexing throughput up in DSE Search. I myself have used this framework many times and realized gains of 10x in indexing throughput just doing these steps. For more in deph tuning taking into account IO and CPU stay tuned.