(Improvement) faster metadata schema parsing by mykaul · Pull Request #745 · scylladb/python-driver

mykaul · 2026-03-14T09:10:36Z

Reduce the time and memory required to parse schema metadata when refreshing. The biggest item is change select query of system_schema.columns, which is quite big - especially with multiple tables.
Overall perf. results:
Row creation + access: 323 ns/row vs 485 ns/row (1.50x faster)
_build_table_columns: 9.0 us/table vs 9.9 us/table (1.10x faster)
Full pipeline (100 tables x 20 cols): 0.79 ms vs 1.57 ms (1.98x faster) <--- that's not bad, finally in ms improvements.
Memory per row: 48 bytes vs 272 bytes (5.7x reduction)
slots per instance: 80 bytes (saves ~104 bytes dict overhead)

Pre-review checklist

I have split my patch into logically separate commits.
All commit messages clearly explain what they change and why.
I added relevant tests for new features and bug fixes.
All commits compile, pass static checks and pass test.
PR description sums up the changes and reasons why they should be introduced.
I have provided docstrings for the public items that I want to introduce.
I have adjusted the documentation in ./docs/source/.
I added appropriate Fixes: annotations to PR description.

Copilot

Pull request overview

This PR aims to reduce CPU time and memory usage during schema metadata refresh by avoiding per-row dict allocations and trimming schema column queries, while also shrinking metadata object overhead.

Changes:

Introduces an internal _RowView + _row_factory and routes schema query result handling through it to reduce per-row allocations.
Adds __slots__ to several metadata model classes and replaces some OrderedDict usages with plain dict to reduce memory overhead.
Narrows the system_schema.columns query in SchemaParserV3 to only the fields needed by the parser and refactors _build_table_columns to classify rows in a single pass.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

cassandra/metadata.py

Add __slots__ to KeyspaceMetadata, UserType, Aggregate, Function, TableMetadata, TableMetadataV3, TableMetadataDSE68, ColumnMetadata, IndexMetadata, TriggerMetadata, and MaterializedViewMetadata. This reduces per-instance memory overhead by eliminating __dict__ on these frequently-instantiated objects during schema refresh. All attributes previously set as class-level defaults are now initialized in __init__ to satisfy the slots contract.

Python 3.7+ guarantees dict preserves insertion order, making OrderedDict unnecessary. Replace OrderedDict() with {} in TableMetadata.columns, TableMetadata.triggers, and MaterializedViewMetadata.columns. Remove the now-unused OrderedDict import.

….columns Replace SELECT * with an explicit column list for the system_schema.columns query in SchemaParserV3 (inherited by V4). Only the 7 columns actually consumed by the parser are fetched: keyspace_name, table_name, column_name, clustering_order, kind, position, type. This reduces network transfer and deserialization overhead during schema refresh.

Introduce _RowView, a __slots__-based read-only row wrapper that stores data as tuples with a shared column-name-to-index map, and _row_factory that creates these views. Replace dict_factory in _SchemaParser._handle_results and get_column_from_system_local (both reachable from the V4 code path). This eliminates per-row dict allocation during schema parsing. All rows from the same result set share a single index map object. Also refactor SchemaParserV4._build_keyspace_metadata_internal to read from the row without mutating it, since _RowView is read-only. Note: V22-only dict_factory call sites are left unchanged as they do not affect the V3/V4 code path (V3 and V4 fully override _query_all).

…able_columns Rewrite _build_table_columns to classify columns by kind in a single pass instead of iterating col_rows three times with list comprehensions. This also fixes a bug where the third pass filtered on 'clustering_key' instead of 'clustering', causing clustering columns to leak through and get re-processed as regular columns. Additionally, use in-place sort() instead of sorted() to avoid creating intermediate list copies, and append the already-built column_meta object to partition_key/clustering_key instead of re-looking it up from meta.columns by name. Combined benchmark results for the full optimization series (A-F): Row creation + access: 323 ns/row vs 485 ns/row (1.50x faster) _build_table_columns: 9.0 us/table vs 9.9 us/table (1.10x faster) Full pipeline (100 tables x 20 cols): 0.79 ms vs 1.57 ms (1.98x faster) Memory per row: 48 bytes vs 272 bytes (5.7x reduction) __slots__ per instance: 80 bytes (saves ~104 bytes __dict__ overhead)

Copilot

Pull request overview

This PR optimizes schema metadata refresh by reducing per-row allocations during schema parsing and narrowing the system_schema.columns select list to only the fields needed for building table/column metadata.

Changes:

Introduces an internal lightweight row representation (_RowView + _row_factory) and uses it in schema parser result handling to reduce time/memory overhead.
Reduces the system_schema.columns query to fetch only required columns and refactors _build_table_columns to classify rows in a single pass.
Adds unit tests for _RowView and _row_factory behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
cassandra/metadata.py	Adds `_RowView`/`_row_factory`, switches schema parser row handling away from per-row dicts, tightens `system_schema.columns` query, and updates metadata classes/docstrings/slots.
tests/unit/test_metadata.py	Adds unit tests validating `_RowView` and `_row_factory` semantics (getitem/get/contains/read-only/shared index map).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

mykaul marked this pull request as draft March 14, 2026 09:10

mykaul changed the title ~~(Improvement) metadata schema parsing~~ (Improvement) faster metadata schema parsing Mar 14, 2026

mykaul requested a review from Copilot March 14, 2026 09:11

Copilot started reviewing on behalf of mykaul March 14, 2026 09:11 View session

Copilot AI reviewed Mar 14, 2026

View reviewed changes

cassandra/metadata.py Show resolved Hide resolved

cassandra/metadata.py Show resolved Hide resolved

cassandra/metadata.py Show resolved Hide resolved

This was referenced Mar 14, 2026

Tracking: Vector search (VectorType) performance improvement PRs #746

Open

Tracking: General (non-vector) performance improvement PRs #747

Open

mykaul added 5 commits March 14, 2026 23:42

mykaul force-pushed the improvement/metadata-schema-parsing branch from 3a7f4bc to 4f9e4f5 Compare March 14, 2026 21:42

mykaul requested a review from Copilot March 15, 2026 15:40

Copilot started reviewing on behalf of mykaul March 15, 2026 15:41 View session

Copilot AI reviewed Mar 15, 2026

View reviewed changes

mykaul self-assigned this Mar 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(Improvement) faster metadata schema parsing#745

(Improvement) faster metadata schema parsing#745
mykaul wants to merge 5 commits intoscylladb:masterfrom
mykaul:improvement/metadata-schema-parsing

mykaul commented Mar 14, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mykaul commented Mar 14, 2026

Pre-review checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants