(improvement) query: add Cython-aware serializer path in BoundStatement.bind()#749
(improvement) query: add Cython-aware serializer path in BoundStatement.bind()#749mykaul wants to merge 2 commits intoscylladb:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces an optional, Cython-accelerated serialization path for BoundStatement.bind() to reduce per-value overhead (especially for large VectorType columns) when Cython serializers are available and no column encryption policy is enabled.
Changes:
- Add a new Cython
cassandra.serializersextension (with.pyx+.pxd) providingSerializerimplementations and amake_serializers()factory. - Add lazy caching of per-column serializer objects on
PreparedStatementvia a_serializersproperty. - Split the bind loop into three branches: column encryption policy, Cython fast path, and pure-Python fallback (with reduced per-value overhead in the fallback).
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
cassandra/serializers.pyx |
Adds Cython serializer implementations (scalar + VectorType) and lookup/factory functions. |
cassandra/serializers.pxd |
Exposes the Serializer cdef interface for Cython usage. |
cassandra/query.py |
Integrates the optional Cython serializer path into PreparedStatement/BoundStatement.bind() with lazy caching and branch selection. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Adds an optimized parameter binding path that can leverage Cython Serializer objects (when available) to speed up BoundStatement.bind(), while preserving the existing column-encryption behavior and improving the plain-Python fallback.
Changes:
- Add
PreparedStatement._serializerslazy cache and a three-way bind loop inBoundStatement.bind()(CE policy / Cython fast path / Python fallback). - Introduce Cython
Serializerimplementations for Float/Double/Int32/Vector and amake_serializers()factory. - Extend unit tests to exercise the Cython-serializer bind branch via injected stub serializers.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
cassandra/query.py |
Adds Cython-serializer availability detection, caches serializers on PreparedStatement, and selects the new bind fast path when safe. |
cassandra/serializers.pyx |
Implements Cython serializers (including optimized VectorType) and factory/lookup helpers. |
cassandra/serializers.pxd |
Declares the Cython Serializer interface for cross-module typing. |
tests/unit/test_parameter_binding.py |
Adds tests for the new bind branch and error-wrapping behavior using stub serializers. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Adds a Cython-aware fast path to BoundStatement.bind() so prepared statements can use cached per-column serializer objects (when available and no column encryption policy is active), reducing Python dispatch and per-value overhead during binding.
Changes:
- Add lazy
PreparedStatement._serializerscache and splitBoundStatement.bind()into CE-policy, Cython-serializer, and pure-Python paths. - Introduce Cython
Serializerimplementations (including optimizedVectorTypeserialization) and amake_serializers()factory. - Expand/adjust unit tests to cover the new Cython bind branch via injected stub serializers.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
cassandra/query.py |
Adds serializer caching and a new bind fast path (plus significant formatting-only edits in the same module). |
cassandra/serializers.pyx |
Introduces Cython serializer implementations and factory/lookup helpers. |
cassandra/serializers.pxd |
Declares the Cython Serializer interface for cross-module use. |
tests/unit/test_parameter_binding.py |
Adds tests to exercise the Cython bind path without requiring compiled Cython. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…torType Add cassandra/serializers.pyx and cassandra/serializers.pxd implementing Cython-optimized serialization that mirrors the deserializers.pyx architecture. Implements type-specialized serializers for the three subtypes commonly used in vector columns: - SerFloatType: 4-byte big-endian IEEE 754 float - SerDoubleType: 8-byte big-endian double - SerInt32Type: 4-byte big-endian signed int32 SerVectorType pre-allocates a contiguous buffer and uses C-level byte swapping for float/double/int32 vectors, with a generic fallback for other subtypes. GenericSerializer delegates to the Python-level cqltype.serialize() classmethod. Range checks for float32 and int32 values prevent silent truncation from C-level casts, matching the behavior of struct.pack(). Factory functions find_serializer() and make_serializers() allow easy lookup and batch creation of serializers for column types. Benchmarks show ~30x speedup over the current io.BytesIO baseline and ~3x speedup over Python struct.pack for Vector<float, 1536> serialization. No setup.py changes needed - the existing cassandra/*.pyx glob already picks up new .pyx files.
…nt.bind() When Cython serializers (from cassandra.serializers) are available and no column encryption policy is active, BoundStatement.bind() now uses pre-built Serializer objects cached on the PreparedStatement instead of calling cqltype classmethods. This avoids per-value Python method dispatch overhead and enables the ~30x vector serialization speedup from the Cython serializers module. The bind loop is split into three paths: 1. Column encryption policy path (unchanged behavior) 2. Cython serializers path (new fast path) 3. Plain Python path (no CE, no Cython -- removes per-value ColDesc/CE check) Depends on PR scylladb#748 (Cython serializers module) and PR scylladb#630 (CE-policy bind split).
7de782f to
8c03c2f
Compare
There was a problem hiding this comment.
Pull request overview
Adds a Cython-aware fast path to BoundStatement.bind() by reworking the bind loop into distinct branches (column encryption vs Cython serializers vs pure-Python), and introduces the Cython serializer module used by the new path.
Changes:
- Add lazy cached
PreparedStatement._serializersand updateBoundStatement.bind()to use CythonSerializerobjects when available and CE policy is not active. - Factor bind-time serialization error wrapping into a shared helper and extend wrapping to include
OverflowError. - Add unit tests that exercise the new Cython bind branch via injected stub serializers, plus overflow wrapping coverage.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
cassandra/query.py |
Adds _serializers cache on PreparedStatement, a shared bind error wrapper, and splits BoundStatement.bind() into CE / Cython / Python paths. |
cassandra/serializers.pyx |
Introduces Cython Serializer implementations (scalar + vector) and serializer factory helpers. |
cassandra/serializers.pxd |
Exposes the Serializer cdef interface for Cython interop. |
tests/unit/test_parameter_binding.py |
Adds unit tests to validate the new Cython bind path behavior and error wrapping. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| """Wrap serialization errors with column context for all bind loop paths.""" | ||
| actual_type = type(value) | ||
| message = ('Received an argument of invalid type for column "%s". ' | ||
| 'Expected: %s, Got: %s; (%s)' % (col_spec.name, col_spec.type, actual_type, exc)) | ||
| raise TypeError(message) |
| """Raise OverflowError for values outside the signed int32 range. | ||
|
|
||
| This matches the behavior of struct.pack('>i', value), which raises | ||
| struct.error for values outside [-2147483648, 2147483647]. The check | ||
| must be done on the Python int *before* the C-level <int32_t> cast, | ||
| which would silently truncate. | ||
| """ | ||
| if value > 2147483647 or value < -2147483648: | ||
| raise OverflowError( | ||
| "Value %r out of range for int32 " | ||
| "(must be between -2147483648 and 2147483647)" % (value,) | ||
| ) |
| cdef inline bytes _serialize_generic(self, object values, int protocol_version): | ||
| """Fallback: element-by-element Python serialization for non-optimized types.""" | ||
| import io | ||
| from cassandra.marshal import uvint_pack | ||
|
|
||
| serialized_size = self.subtype.serial_size() | ||
| buf = io.BytesIO() | ||
| for item in values: |
Summary
cassandra.serializersfrom PR (improvement) serializers: add Cython-optimized serialization for VectorType #748) are available and no column encryption policy is active,BoundStatement.bind()uses pre-builtSerializerobjects cached on thePreparedStatementinstead of calling cqltype classmethodsColDescconstruction andce_policycheck that was previously done unconditionallyDependencies
cassandra/serializers.pyx) — providesmake_serializers()and theSerializerclasses used by the new fast pathPerformance
End-to-end
BoundStatement.bind()benchmarksMeasured on a single CPU core (pinned), Python 3.14, comparing the Cython serializers path vs the plain Python path:
Vector<float, 128>(1 column)Vector<float, 768>(1 column)Vector<float, 1536>(1 column)Vector<float, 1536>+ 3 scalars (realistic INSERT)Key observations:
io.BytesIOloop with a pre-allocated C buffer and inline byte-swapColDescnamedtuple construction andce_policychecksMemory savings
The old bind loop created a
ColDescnamedtuple (~72 bytes) for every bound value, even when no column encryption policy was set. Both new non-CE paths eliminate this entirely.Per
bind()call, for N columns:ColDescnamedtuplesce_policy and ...evaluationsce_policy.column_type(...)ternaryPer VectorType column (D-dimensional float/double/int32), additionally:
VectorType.serialize()SerVectorTypeio.BytesIO()instancebytesper elementbuf.getvalue()final copymalloc)subtype.serialize+buf.write)Concrete example: A prepared INSERT with 5 scalar columns + 1
Vector<float, 1536>column eliminates ~1545 transient Python objects and ~88 KB of transient allocations perbind()call, plus ~3076 Python method calls.Design
PreparedStatement._serializers(lazy cached property)On first access, calls
make_serializers([col.type for col in self.column_metadata])which returns a list of CythonSerializerobjects (one per bind column). ReturnsNoneif:The cache is safe because
column_metadatais immutable afterPreparedStatementcreation. Thread safety is guaranteed by the benign-race pattern (idempotent computation + atomic attribute assignment).Three-way bind loop
Path 2 uses
zip(serializers, values, col_meta)to iterate all three in lockstep without index overhead. Path 3 removes the per-valueColDescconstruction andce_policycheck from the original code.Testing
test_query.py,test_types.py)_HAVE_CYTHON_SERIALIZERS = False)