Skip to content

(improvement)Optimize DCAware/RackAware/TokenAware/HostFilter policies with host distance caching and overall perf. improvements#651

Open
mykaul wants to merge 9 commits intoscylladb:masterfrom
mykaul:query_plan_opt_2
Open

(improvement)Optimize DCAware/RackAware/TokenAware/HostFilter policies with host distance caching and overall perf. improvements#651
mykaul wants to merge 9 commits intoscylladb:masterfrom
mykaul:query_plan_opt_2

Conversation

@mykaul
Copy link

@mykaul mykaul commented Jan 22, 2026

Refactor DCAwareRoundRobinPolicy to use a Copy-On-Write (COW) strategy for managing host distances.

Results (5 DCs × 3 racks × 3 nodes = 45 nodes, 100K queries, median of 5 iterations):

Policy                         | Kops/s | master Kops/s | Speedup
---------------------------------------------------------------
DCAware                        |    209 |            88 |   2.4x
RackAware                      |    173 |            59 |   2.9x
TokenAware(DCAware)            |     62 |            15 |   4.1x
TokenAware(RackAware)          |     60 |            14 |   4.3x
Default(DCAware)               |    132 |            73 |   1.8x
HostFilter(DCAware)            |     63 |            44 |   1.4x

Key changes:

  • Introduce _remote_hosts to cache REMOTE hosts, enabling O(1) distance lookups during query planning for distance. IGNORED hosts do not need to be stored in the cache.
    For 'LOCAL' we do a simple comparison.
  • Add _refresh_remote_hosts to handle node changes.
  • LRU cache for token-to-replicas lookup in TokenAwarePolicy (default 1024 entries, auto-invalidated on topology change).
  • TokenAwarePolicy skips distance re-sorting for DCAware/RackAware child policies (they already yield in distance order), with a fallback re-sort for other child policies.
  • TokenAwarePolicy no longer uses __slots__ to avoid breaking downstream subclasses.
  • LWT queries skip replica shuffling for deterministic plans.

This is a different attempt from #650 to add caching to host distance to make query planning faster.

Pre-review checklist

  • I have split my patch into logically separate commits.
  • All commit messages clearly explain what they change and why.
  • I added relevant tests for new features and bug fixes.
  • All commits compile, pass static checks and pass test.
  • PR description sums up the changes and reasons why they should be introduced.
  • I have provided docstrings for the public items that I want to introduce.
  • I have adjusted the documentation in ./docs/source/.
  • I added appropriate Fixes: annotations to PR description.

@mykaul mykaul marked this pull request as draft January 22, 2026 18:14
@mykaul
Copy link
Author

mykaul commented Jan 23, 2026

This is interesting, my change has exposed this -

2026-01-23 18:26:45.488 DEBUG [libevreactor:376]: Message pushed from server: <EventMessage(event_type='STATUS_CHANGE', event_args={'change_type': 'DOWN', 'address': ('127.0.0.3', 9042)}, stream_id=-1, trace_id=None)>

2026-01-23 18:26:45.489 WARNING [libevreactor:376]: Host 127.0.0.3:9042 has been marked down                      <--- host .3 is marked as DOWN

2026-01-23 18:26:45.489 DEBUG [thread:73]: First connection created to 127.0.0.2:9042 for shard_id=0
2026-01-23 18:26:45.489 DEBUG [thread:73]: Finished initializing connection for host 127.0.0.2:9042
2026-01-23 18:26:45.489 DEBUG [thread:73]: Added pool for host 127.0.0.2:9042 to session
2026-01-23 18:26:45.489 DEBUG [thread:73]: Removed connection pool for <Host: 127.0.0.3:9042 dc1>
2026-01-23 18:26:45.490 DEBUG [thread:73]: Shutting down connections to 127.0.0.3:9042
2026-01-23 18:26:45.490 DEBUG [thread:73]: Closing connection (139753730215760) to 127.0.0.3:9042
2026-01-23 18:26:48.496 DEBUG [test_ip_change:35]: Change IP address for node3
2026-01-23 18:26:48.534 DEBUG [test_ip_change:40]: Start node3 again with ip address 127.0.0.33
2026-01-23 18:26:48.551 DEBUG [cluster:772]: node3: Starting scylla: args=['/home/ykaul/github/python-driver/tests/integration/ccm/test_ip_change/node3/bin/scylla', '--options-file', '/home/ykaul/github/python-driver/tests/integration/ccm/test_ip_change/node3/conf/scylla.yaml', '--log-to-stdout', '1', '--api-address', '127.0.0.33', '--smp', '1', '--memory', '512M', '--developer-mode', 'true', '--default-log-level', 'info', '--overprovisioned', '--prometheus-address', '127.0.0.33', '--unsafe-bypass-fsync', '1', '--kernel-page-cache', '1', '--commitlog-use-o-dsync', '0', '--max-networking-io-control-blocks', '1000'] wait_other_notice=False wait_for_binary_proto=True
2026-01-23 18:26:49.947 INFO [cluster:775]: node3: Started scylla: pid: 186960
2026-01-23 18:26:49.947 DEBUG [test_ip_change:45]: ['127.0.0.1', '127.0.0.3', '127.0.0.2']
2026-01-23 18:26:50.164 DEBUG [libevreactor:376]: Message pushed from server: <EventMessage(event_type='TOPOLOGY_CHANGE', event_args={'change_type': 'NEW_NODE', 'address': ('127.0.0.33', 9042)}, stream_id=-1, trace_id=None)>
2026-01-23 18:26:50.165 DEBUG [libevreactor:376]: Message pushed from server: <EventMessage(event_type='STATUS_CHANGE', event_args={'change_type': 'UP', 'address': ('127.0.0.33', 9042)}, stream_id=-1, trace_id=None)>
2026-01-23 18:26:50.448 DEBUG [test_ip_change:45]: ['127.0.0.1', '127.0.0.3', '127.0.0.2']
2026-01-23 18:26:50.948 DEBUG [test_ip_change:45]: ['127.0.0.1', '127.0.0.3', '127.0.0.2']
2026-01-23 18:26:51.449 DEBUG [test_ip_change:45]: ['127.0.0.1', '127.0.0.3', '127.0.0.2']
2026-01-23 18:26:51.569 DEBUG [thread:73]: [control connection] Refreshing node list and token map
2026-01-23 18:26:51.570 DEBUG [thread:73]: [control connection] Updating host ip from 127.0.0.3:9042 to 127.0.0.33:9042 for (c989a851-2dcb-4b05-8a0c-fb1658a32e21)

2026-01-23 18:26:51.570 WARNING [thread:73]: Host 127.0.0.33:9042 has been marked down            <-- due to an IP change, the host is marked as down!?!

2026-01-23 18:26:51.571 DEBUG [thread:73]: [control connection] Finished fetching ring info
2026-01-23 18:26:51.949 DEBUG [test_ip_change:45]: ['127.0.0.1', '127.0.0.33', '127.0.0.2']

Need to understand this better :-/

@mykaul
Copy link
Author

mykaul commented Jan 23, 2026

            if host is None:
                host = self._cluster.metadata.get_host_by_host_id(host_id)
                if host and host.endpoint != endpoint:
                    log.debug("[control connection] Updating host ip from %s to %s for (%s)", host.endpoint, endpoint, host_id)
                    old_endpoint = host.endpoint
                    host.endpoint = endpoint
                    self._cluster.metadata.update_host(host, old_endpoint)
                    reconnector = host.get_and_set_reconnection_handler(None)
                    if reconnector:
                        reconnector.cancel()
                    self._cluster.on_down(host, is_host_addition=False, expect_host_to_be_down=True)

So first we update the host with the new endpoint, then mark it as down?

@mykaul
Copy link
Author

mykaul commented Jan 23, 2026

This fixes it for me:

diff --git a/cassandra/cluster.py b/cassandra/cluster.py
index a9c1d00e..099043ea 100644
--- a/cassandra/cluster.py
+++ b/cassandra/cluster.py
@@ -3831,14 +3831,16 @@ class ControlConnection(object):
                 host = self._cluster.metadata.get_host_by_host_id(host_id)
                 if host and host.endpoint != endpoint:
                     log.debug("[control connection] Updating host ip from %s to %s for (%s)", host.endpoint, endpoint, host_id)
-                    old_endpoint = host.endpoint
-                    host.endpoint = endpoint
-                    self._cluster.metadata.update_host(host, old_endpoint)
                     reconnector = host.get_and_set_reconnection_handler(None)
                     if reconnector:
                         reconnector.cancel()
                     self._cluster.on_down(host, is_host_addition=False, expect_host_to_be_down=True)
 
+                    old_endpoint = host.endpoint
+                    host.endpoint = endpoint
+                    self._cluster.metadata.update_host(host, old_endpoint)
+                    self._cluster.on_up(host)
+
             if host is None:
                 log.debug("[control connection] Found new host to connect to: %s", endpoint)
                 host, _ = self._cluster.add_host(endpoint, datacenter=datacenter, rack=rack, signal=True, refresh_nodes=False, host_id=host_id)

which also makes sense to me.
@dkropachev - I think this fix should go in a separate issue and PR, no? (context - start with #651 (comment) - my changes here failed, due to a wrong order of update of a host which changed its IP)

@mykaul
Copy link
Author

mykaul commented Jan 23, 2026

I think CI failure is unrelated and is #359

@mykaul
Copy link
Author

mykaul commented Jan 24, 2026

By using the (not amazing) benchmark from #653 , I got the following results:

For master branch as a baseline:

Policy                         | Ops        | Time (s)   | Kops/s    
----------------------------------------------------------------------
DCAware                        | 100000     | 0.2309     | 433       
RackAware                      | 100000     | 0.3607     | 277       
TokenAware(DCAware)            | 100000     | 1.3262     | 75        
TokenAware(RackAware)          | 100000     | 1.4343     | 69        

This branch (with just DC aware improvements):

Policy                         | Ops        | Time (s)   | Kops/s    
----------------------------------------------------------------------
DCAware                        | 100000     | 0.1280     | 781       
RackAware                      | 100000     | 0.3572     | 279       
TokenAware(DCAware)            | 100000     | 1.1620     | 86        
TokenAware(RackAware)          | 100000     | 1.4435     | 69        

** 433 -> 781 Kops/sec improvement **

With improvement to rack aware (on top of master), I got:

=== Performance Benchmarks ===
Policy                         | Ops        | Time (s)   | Kops/s    
----------------------------------------------------------------------
DCAware                        | 100000     | 0.2306     | 433       
RackAware                      | 100000     | 0.3084     | 324       
TokenAware(DCAware)            | 100000     | 1.3031     | 76        
TokenAware(RackAware)          | 100000     | 1.3440     | 74        

** 277 -> 324 Kops/sec improvement **

And on top of this branch:

Policy                         | Ops        | Time (s)   | Kops/s    
----------------------------------------------------------------------
DCAware                        | 100000     | 0.1283     | 779       
RackAware                      | 100000     | 0.2905     | 344       
TokenAware(DCAware)            | 100000     | 1.1454     | 87        
TokenAware(RackAware)          | 100000     | 1.3293     | 75        

** 277 -> 344 Kops/sec improvement **

And finally, for #650 :

Policy                         | Ops        | Time (s)   | Kops/s    
----------------------------------------------------------------------
DCAware                        | 100000     | 0.2325     | 430       
RackAware                      | 100000     | 0.3611     | 276       
TokenAware(DCAware)            | 100000     | 1.5826     | 63        
TokenAware(RackAware)          | 100000     | 1.6927     | 59        

which kinda makes me suspect that branch is no good :-/

@mykaul
Copy link
Author

mykaul commented Jan 24, 2026

This fixes it for me:

diff --git a/cassandra/cluster.py b/cassandra/cluster.py
index a9c1d00e..099043ea 100644
--- a/cassandra/cluster.py
+++ b/cassandra/cluster.py
@@ -3831,14 +3831,16 @@ class ControlConnection(object):
                 host = self._cluster.metadata.get_host_by_host_id(host_id)
                 if host and host.endpoint != endpoint:
                     log.debug("[control connection] Updating host ip from %s to %s for (%s)", host.endpoint, endpoint, host_id)
-                    old_endpoint = host.endpoint
-                    host.endpoint = endpoint
-                    self._cluster.metadata.update_host(host, old_endpoint)
                     reconnector = host.get_and_set_reconnection_handler(None)
                     if reconnector:
                         reconnector.cancel()
                     self._cluster.on_down(host, is_host_addition=False, expect_host_to_be_down=True)
 
+                    old_endpoint = host.endpoint
+                    host.endpoint = endpoint
+                    self._cluster.metadata.update_host(host, old_endpoint)
+                    self._cluster.on_up(host)
+
             if host is None:
                 log.debug("[control connection] Found new host to connect to: %s", endpoint)
                 host, _ = self._cluster.add_host(endpoint, datacenter=datacenter, rack=rack, signal=True, refresh_nodes=False, host_id=host_id)

which also makes sense to me. @dkropachev - I think this fix should go in a separate issue and PR, no? (context - start with #651 (comment) - my changes here failed, due to a wrong order of update of a host which changed its IP)

Sent separate PR - #654

@mykaul mykaul changed the title (improvement)Optimize DCAwareRoundRobinPolicy with host distance caching (improvement)Optimize DCAware/RackAware RoundRobinPolicy with host distance caching Jan 24, 2026
@mykaul
Copy link
Author

mykaul commented Jan 24, 2026

With rack aware added (3rd commit), these are the current numbers:

Policy                         | Ops        | Time (s)   | Kops/s    
----------------------------------------------------------------------
DCAware                        | 100000     | 0.1235     | 809       
RackAware                      | 100000     | 0.2934     | 340       
TokenAware(DCAware)            | 100000     | 1.1371     | 87        
TokenAware(RackAware)          | 100000     | 1.3291     | 75    

@mykaul
Copy link
Author

mykaul commented Jan 24, 2026

With rack aware added (3rd commit), these are the current numbers:

Policy                         | Ops        | Time (s)   | Kops/s    
----------------------------------------------------------------------
DCAware                        | 100000     | 0.1235     | 809       
RackAware                      | 100000     | 0.2934     | 340       
TokenAware(DCAware)            | 100000     | 1.1371     | 87        
TokenAware(RackAware)          | 100000     | 1.3291     | 75    

Now that I also cache non-local hosts, not just remote (duh!), perf. is better:

Policy                         | Ops        | Time (s)   | Kops/s    
----------------------------------------------------------------------
DCAware                        | 100000     | 0.1247     | 802       
RackAware                      | 100000     | 0.1624     | 615       
TokenAware(DCAware)            | 100000     | 1.2408     | 80        
TokenAware(RackAware)          | 100000     | 1.3087     | 76   

@mykaul
Copy link
Author

mykaul commented Jan 24, 2026

Added for TokenAware as well some optimization (need to improve commit message).
Current results:

Policy                         | Ops        | Time (s)   | Kops/s    | master | (improv from master)
----------------------------------------------------------------------
DCAware                        | 100000     | 0.1266   | 790   | 433 | (x1.8)
RackAware                      | 100000     | 0.1670   | 598   | 277 | (x2.1)
TokenAware(DCAware)            | 100000     | 0.2663   | 375   | 75   | (x5)     
TokenAware(RackAware)          | 100000     | 0.3009   | 332  | 69  | (x4.8)

So reasonable improvement, at least in this micro-benchmark.

@mykaul mykaul changed the title (improvement)Optimize DCAware/RackAware RoundRobinPolicy with host distance caching (improvement)Optimize DCAware/RackAware/TokenAware RoundRobinPolicy with host distance caching Jan 25, 2026
@mykaul
Copy link
Author

mykaul commented Jan 25, 2026

Last push, I think I'm done:

    Policy                         | Ops        | Time (s)   | Kops/s | (master)
    ----------------------------------------------------------------------
    DCAware                        | 100000     | 0.0989     | 1010 | 433
    Default(DCAware)               | 100000     | 0.1532     | 652  | ?
    HostFilter(DCAware)            | 100000     | 0.3303     | 302  | ?
    RackAware                      | 100000     | 0.1149     | 870  | 277 
    TokenAware(DCAware)            | 100000     | 0.2112     | 473  | 75
    TokenAware(RackAware)          | 100000     | 0.2249     | 444  | 69

@mykaul mykaul changed the title (improvement)Optimize DCAware/RackAware/TokenAware RoundRobinPolicy with host distance caching (improvement)Optimize DCAware/RackAware/TokenAware/HostFilter policies with host distance caching and overall perf. improvements Jan 26, 2026
@mykaul mykaul marked this pull request as ready for review January 27, 2026 17:40
@mykaul mykaul force-pushed the query_plan_opt_2 branch from cb24c45 to 5f283d1 Compare March 7, 2026 21:26
mykaul added 7 commits March 12, 2026 18:50
…tate

Refactor `DCAwareRoundRobinPolicy` to simplify distance calculations and memory usage.

Key changes:
- Remove `_hosts_by_distance` and the complex caching of LOCAL hosts.
- `distance()` now checks `host.datacenter` directly for LOCAL calculation, which is correct and static.
- Only cache `_remote_hosts` to efficiently handle `used_hosts_per_remote_dc`.
- Optimize control plane operations (`on_up`, `on_down`) to only rebuild the remote cache when necessary (when remote hosts change or local DC changes).
- This removes the overhead of maintaining a redundant LOCAL cache and ensures correct behavior even if a local host is marked down.
- Snapshot `self._remote_hosts` to a local variable before reads for GIL-free Python 3.13+ safety.

Benchmark (100K queries, 45-node/5-DC topology, Python 3.14, median of 5 runs):
Policy                    |  Kops/s | vs master | Mem KB
---------------------------------------------------------
DCAware                   |     195 |      +84% |    1.5
RackAware                 |      68 |        0% |    9.3
TokenAware(DCAware)       |      19 |       +6% |    1.7
TokenAware(RackAware)     |      16 |       -6% |    2.8
Default(DCAware)          |     136 |      +49% |    1.6
HostFilter(DCAware)       |      69 |      +30% |    1.7

DCAware nearly doubles throughput (+84%), memory footprint drops from 9.3 KB to 1.5 KB.
DefaultPolicy and HostFilter benefit transitively since they wrap DCAware.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
…distances

Refactor `RackAwareRoundRobinPolicy` to simplify distance calculations and memory usage.

Add self._remote_hosts to cache remote hosts distance, self._non_local_rack_hosts for non-local rack host distance.
- Only cache `_remote_hosts` to efficiently handle `used_hosts_per_remote_dc`.
- Optimize control plane operations (`on_up`, `on_down`) to only rebuild the remote cache when necessary (when remote hosts change or local DC changes).
- Snapshot `self._remote_hosts` to a local variable before reads for GIL-free Python 3.13+ safety.

Benchmark (100K queries, 45-node/5-DC topology, Python 3.14, median of 5 runs):
Policy                    |  Kops/s | vs master | delta | Mem KB
-----------------------------------------------------------------
DCAware                   |     188 |      +77% |       |    1.5
RackAware                 |     146 |     +115% | +115% |    2.0
TokenAware(DCAware)       |      20 |      +11% |       |    1.7
TokenAware(RackAware)     |      19 |      +12% |  +19% |    2.2
Default(DCAware)          |     123 |      +35% |       |    1.6
HostFilter(DCAware)       |      57 |       +8% |       |    1.7

RackAware more than doubles throughput (+115%), from 68 to 146 Kops/s.
TokenAware(RackAware) also benefits transitively (+12% cumulative).

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
…y planning.

Optimize TokenAwarePolicy query plan generation

This patch significantly improves the performance of TokenAwarePolicy by
reducing overhead in list materialization and distance calculation.

Key changes:
1. Introduced `make_query_plan_with_exclusion()` to the LoadBalancingPolicy
   interface.
   - This allows a parent policy (like TokenAware) to request a plan from
     a child policy while efficiently skipping a set of already-yielded
     hosts (replicas).
   - Implemented optimized versions in `DCAwareRoundRobinPolicy` and
     `RackAwareRoundRobinPolicy`. These implementations integrate the
     exclusion check directly into their generation loops, avoiding the
     need for inefficient external filtering or full list materialization.

2. Optimized `TokenAwarePolicy.make_query_plan`:
   - Removed list materialization of the child query plan.
   - Replaced multiple passes over replicas (checking `child.distance`
     each time) with a single pass that buckets replicas into local/remote
     lists.
   - Utilizes `make_query_plan_with_exclusion` to yield the remainder
     of the plan.
   - Added `__slots__` to reduce memory overhead and attribute access cost.

Benchmark (100K queries, 45-node/5-DC topology, Python 3.14, median of 5 runs):
Policy                    |  Kops/s | vs master | delta | Mem KB
-----------------------------------------------------------------
DCAware                   |     159 |      +50% |       |    1.5
RackAware                 |     128 |      +88% |       |    2.0
TokenAware(DCAware)       |      81 |     +350% | +305% |    2.9
TokenAware(RackAware)     |      64 |     +276% | +237% |    3.6
Default(DCAware)          |     199 |     +119% |  +62% |    1.6
HostFilter(DCAware)       |      56 |       +6% |       |    1.7

TokenAware throughput increases ~4x: DCAware-wrapped goes from 18 to 81 Kops/s
(+350%), RackAware-wrapped from 17 to 64 Kops/s (+276%).
DefaultPolicy also benefits strongly (+119%) from the new exclusion interface.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Use index-based loops and excluded fast-paths in DC/Rack-aware policies
to speed up TokenAware replica selection and exclusion handling.
Snapshot `self._remote_hosts` to local variables in make_query_plan
and make_query_plan_with_exclusion for GIL-free Python 3.13+ safety.

Removed hot-path try/except and avoid eager list conversion.
Updated TokenAware tests/mocks for token_map and deterministic ordering.

Benchmark (100K queries, 45-node/5-DC topology, Python 3.14, median of 5 runs):
Policy                    |  Kops/s | vs master | delta | Mem KB
-----------------------------------------------------------------
DCAware                   |     201 |      +90% |  +26% |    1.5
RackAware                 |     163 |     +140% |  +27% |    2.0
TokenAware(DCAware)       |      96 |     +433% |  +19% |    1.7
TokenAware(RackAware)     |      88 |     +418% |  +38% |    2.2
Default(DCAware)          |     137 |      +51% |  -31% |    1.6
HostFilter(DCAware)       |      65 |      +23% |  +16% |    1.7

Index-based loops add +19-38% to TokenAware, +26-27% to DC/RackAware.
Cumulative: DCAware ~2x, RackAware ~2.4x, TokenAware ~5x vs master.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
DefaultLoadBalancingPolicy: add make_query_plan_with_exclusion
- forward exclusions to child policy
- preserve target_host preference while skipping excluded hosts

HostFilterPolicy: add make_query_plan_with_exclusion
- forward exclusions to child policy
- filter excluded hosts via predicate in exclusion-aware plans
- add isinstance guard to avoid redundant set() conversion when
  excluded is already a set

Also add isinstance guard in DefaultLoadBalancingPolicy.make_query_plan_with_exclusion
for consistency.

Benchmark (100K queries, 45-node/5-DC topology, Python 3.14, median of 5 runs):
Policy                    |  Kops/s | vs master | delta | Mem KB
-----------------------------------------------------------------
DCAware                   |     206 |      +94% |       |    1.5
RackAware                 |     172 |     +153% |       |    2.0
TokenAware(DCAware)       |      97 |     +439% |       |    1.7
TokenAware(RackAware)     |      89 |     +424% |       |    2.2
Default(DCAware)          |     132 |      +45% |   -4% |    1.6
HostFilter(DCAware)       |      54 |       +2% |  -17% |    1.7

This commit primarily adds exclusion-aware interfaces to Default and
HostFilter policies. The throughput deltas are within noise; the main
benefit is architectural — enabling these policies to participate in
the exclusion protocol for composition with TokenAware.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Add an LRU cache (OrderedDict-based, default size 1024) to
TokenAwarePolicy that avoids repeated token-to-replica lookups for the
same (keyspace, routing_key) pair.  The cache is automatically
invalidated when the token_map object identity changes (topology
rebuild), using direct reference comparison (`is not`) instead of
`id()` to avoid stale cache hits from id reuse after GC.  Set
cache_replicas_size=0 to disable.

Only the non-tablet code path is cached; the tablet path is unchanged.

Thread-safety fixes:
- Add `super().__init__()` call to initialize `_hosts_lock` from
  LoadBalancingPolicy base class.
- Add `_cache_lock` (threading.Lock) to protect the OrderedDict-based
  LRU cache, since `move_to_end()` + `popitem()` sequences are not
  atomic even under CPython's GIL.
- Add `_hosts_lock` and `_cache_lock` to `__slots__`.

Includes 7 new unit tests for cache hit, miss (different key/keyspace),
topology invalidation, eviction, disabled mode, and tablet bypass.

Benchmark (100K queries, 45-node/5-DC topology, Python 3.14, median of 5 runs):
Policy                    |  Kops/s | vs master | delta | Mem KB
-----------------------------------------------------------------
DCAware                   |     200 |      +89% |       |    1.5
RackAware                 |     167 |     +146% |       |    2.0
TokenAware(DCAware)       |      64 |     +256% |  -34% |  207.5
TokenAware(RackAware)     |      62 |     +265% |  -30% |   87.1
Default(DCAware)          |     142 |      +56% |       |    1.6
HostFilter(DCAware)       |      66 |      +25% |       |    1.7

Note: The cache shows a regression vs the previous commit in this
micro-benchmark because mock get_replicas is O(1). In production with
real metadata token ring lookups, the cache amortizes that cost. The
cache adds ~87-208 KB memory for 1024 entries. The primary value of
this commit is correctness (thread-safety, cache invalidation) and
amortized lookup cost for real workloads with repeated partition keys.
Restructure the shuffle block in TokenAwarePolicy.make_query_plan to
use a nested 'if' instead of 'and', making it explicit that LWT queries
skip both the shuffle and the list copy (Paxos leader optimization).

Includes 4 new unit tests verifying: LWT deterministic ordering, LWT
skips list copy, non-LWT shuffle works, and LWT+cache determinism.

Benchmark (100K queries, 45-node/5-DC topology, Python 3.14, median of 5 runs):
Policy                    |  Kops/s | vs master | Mem KB
---------------------------------------------------------
DCAware                   |     204 |      +92% |    1.5
RackAware                 |     180 |     +165% |    2.0
TokenAware(DCAware)       |      60 |     +233% |  207.5
TokenAware(RackAware)     |      57 |     +235% |   87.1
Default(DCAware)          |     132 |      +45% |    1.6
HostFilter(DCAware)       |      66 |      +25% |    1.7

Cumulative improvement over master (all 7 commits):
  DCAware:              +92%  (106 -> 204 Kops/s)
  RackAware:           +165%  ( 68 -> 180 Kops/s)
  TokenAware(DCAware): +233%  ( 18 ->  60 Kops/s)
  TokenAware(RackAware):+235% ( 17 ->  57 Kops/s)
  Default(DCAware):     +45%  ( 91 -> 132 Kops/s)
  HostFilter(DCAware):  +25%  ( 53 ->  66 Kops/s)

Note: TokenAware benchmark uses unique keys per query (0% cache hit
rate). Real workloads with repeated partition keys will benefit further
from the replica cache added in the previous commit. The LWT shortcut
itself has no measurable impact on non-LWT queries (which this
benchmark exercises); its benefit is deterministic ordering for LWT.
@mykaul mykaul force-pushed the query_plan_opt_2 branch from 5f283d1 to bd6a9c5 Compare March 12, 2026 20:46
@mykaul
Copy link
Author

mykaul commented Mar 12, 2026

Latest numbers:
Cumulative results (master -> final branch):
Policy Master Branch Improvement
DCAware 106 Kops/s 204 Kops/s +92%
RackAware 68 Kops/s 180 Kops/s +165%
TokenAware(DCAware) 18 Kops/s 60 Kops/s +233%
TokenAware(RackAware) 17 Kops/s 57 Kops/s +235%
Default(DCAware) 91 Kops/s 132 Kops/s +45%
HostFilter(DCAware) 53 Kops/s 66 Kops/s +25%

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes load balancing policies (DCAwareRoundRobinPolicy, RackAwareRoundRobinPolicy, TokenAwarePolicy, HostFilterPolicy) with host distance caching and general performance improvements. The key insight is caching computed host distance data (remote hosts, non-local-rack hosts) and replica lookups to avoid repeated computation in the hot query-planning path.

Changes:

  • Introduce _remote_hosts (COW dict) on DCAwareRoundRobinPolicy and RackAwareRoundRobinPolicy for O(1) distance lookups, plus _non_local_rack_hosts for rack-aware iteration; both refreshed on topology changes.
  • Add an LRU replica cache to TokenAwarePolicy (keyed by (keyspace, routing_key), invalidated on token map changes) and restructure make_query_plan to use direct distance bucketing instead of repeated yield_in_order scans.
  • Add make_query_plan_with_exclusion API to LoadBalancingPolicy and all subclasses, enabling TokenAwarePolicy to skip already-yielded replicas when querying the child policy for remaining hosts.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
cassandra/policies.py Core optimization: distance caching in DC/RackAware policies, LRU replica cache in TokenAwarePolicy, new make_query_plan_with_exclusion method across policies, formatting cleanup
tests/unit/test_policies.py New tests for make_query_plan_with_exclusion, replica cache (hit/miss/eviction/invalidation/disabled), LWT determinism, tablet bypass; test mocks updated for new token_map-based replica resolution; formatting cleanup

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

mykaul added 2 commits March 16, 2026 18:51
- Reapply distance sorting (LOCAL_RACK -> LOCAL -> REMOTE) to non-replica
  hosts in TokenAwarePolicy.make_query_plan, restoring behavioral parity
  with the original yield_in_order approach
- Remove __slots__ from TokenAwarePolicy to avoid breaking downstream
  subclasses that add instance attributes
- Fix DCAwareRoundRobinPolicy.on_up to only refresh remote host cache
  when a host is actually new, avoiding unnecessary rebuilds during
  reconnection storms
…olicies

TokenAwarePolicy.make_query_plan now uses isinstance to detect when the
child policy already yields hosts in distance order (DCAware, RackAware),
streaming non-replica hosts directly instead of collecting and re-sorting
them by distance.  Other child policies (e.g. RoundRobinPolicy) still
get the distance re-sort fallback to guarantee correct ordering.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes the load balancing policies in the ScyllaDB Python driver by introducing distance caching via Copy-On-Write (COW) strategy, an LRU cache for token-to-replica lookups in TokenAwarePolicy, and a new make_query_plan_with_exclusion API to avoid redundant iteration.

Changes:

  • Introduces _remote_hosts and _non_local_rack_hosts cached dictionaries/lists in DCAwareRoundRobinPolicy and RackAwareRoundRobinPolicy for O(1) distance lookups, rebuilt on topology changes.
  • Adds an LRU cache (OrderedDict) in TokenAwarePolicy for token-to-replica lookups, invalidated by token_map object identity change; includes LWT deterministic ordering (no shuffle).
  • Adds make_query_plan_with_exclusion() to LoadBalancingPolicy and its subclasses to efficiently skip already-yielded replicas in TokenAwarePolicy.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
cassandra/policies.py Core performance optimizations: COW distance caching, LRU replica cache, make_query_plan_with_exclusion API, LWT shuffle skip, code formatting
tests/unit/test_policies.py New tests for exclusion-based query planning, LRU cache behavior, LWT determinism, cache invalidation, and test infrastructure updates for new mock requirements

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1236 to +1238
child_policy.make_query_plan_with_exclusion.assert_called()
elif child_policy.make_query_plan_with_exclusion.called:
child_policy.make_query_plan_with_exclusion.assert_called()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants