tests/benchmarks: Add VectorType deserialization benchmarks and expand test coverage#733
tests/benchmarks: Add VectorType deserialization benchmarks and expand test coverage#733mykaul wants to merge 5 commits intoscylladb:masterfrom
Conversation
Add comprehensive benchmark comparing different deserialization strategies for VectorType with various numeric types and vector sizes. The benchmark measures: - Current element-by-element baseline - struct.unpack bulk deserialization - numpy frombuffer with tolist() - numpy frombuffer zero-copy approach Tested with common ML/AI embedding dimensions: - Small vectors: 3-4 elements - Medium vectors: 128-384 elements - Large vectors: 768-1536 elements Usage: export CASS_DRIVER_NO_CYTHON=1 # Test pure Python implementation python benchmarks/vector_deserialize.py Includes CPU pinning for consistent measurements and result verification to ensure correctness of all optimization approaches. Baseline Performance (per-operation deserialization time): Vector<float, 3> : 0.88 μs Vector<float, 4> : 0.78 μs Vector<float, 128> : 4.72 μs Vector<float, 384> : 15.38 μs Vector<float, 768> : 32.43 μs Vector<float, 1536> : 63.74 μs Vector<double, 128> : 4.83 μs Vector<int, 128> : 2.27 μs Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Vector type is supported on Scylla 2025.4 and above. Enable the integration tests. Tested locally against both 2025.4.2 and 2026.1 and they pass. Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
…numpy large vector deserialization Add test_vector_cython_deserializer_variable_size_subtype to verify that DesVectorType correctly raises ValueError for variable-size subtypes (e.g. UTF8Type) and that the pure Python path handles them. Add test_vector_numpy_large_deserialization to exercise the numpy deserialization path for vectors with >= 32 elements across all supported numeric types (float, double, int32, int64).
There was a problem hiding this comment.
Pull request overview
Adds new benchmark and test coverage around VectorType deserialization, and refreshes integration test formatting to support vector-related testing scenarios.
Changes:
- Add a new
benchmarks/vector_deserialize.pyharness comparing multiple vector deserialization strategies across sizes/types. - Add unit tests for
VectorTypelarge-vector deserialization and intended Cython fallback behavior. - Reformat/clean up
tests/integration/standard/test_types.py(imports/string literals/line wrapping) and keep vector test class enabled via@requires_vector_type.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| tests/unit/test_types.py | Adds new unit tests for vector deserialization behavior (including a Cython-deserializer expectation). |
| tests/integration/standard/test_types.py | Largely formatting/refactoring; keeps/organizes vector integration tests under @requires_vector_type. |
| benchmarks/vector_deserialize.py | New benchmark script to measure vector deserialization performance across approaches and configurations. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| from cassandra.deserializers import find_deserializer | ||
| except ImportError: | ||
| self.skipTest('Cython deserializers not available') | ||
|
|
||
| vt_text = VectorType.apply_parameters(['UTF8Type', 3], {}) | ||
| des_text = find_deserializer(vt_text) | ||
| self.assertEqual(des_text.__class__.__name__, 'DesVectorType') | ||
|
|
||
| # Cython path should raise for variable-size subtypes | ||
| data = vt_text.serialize(['abc', 'def', 'ghi'], 5) | ||
| with self.assertRaises(ValueError) as cm: | ||
| des_text.deserialize_bytes(data, 5) | ||
| self.assertIn('variable-size subtype', str(cm.exception)) |
There was a problem hiding this comment.
find_deserializer() in cassandra/deserializers.pyx has no DesVectorType implementation (it falls back to GenericDeserializer for VectorType), and the returned deserializer classes also don’t expose a deserialize_bytes() Python method. As written, this test will fail when Cython deserializers are available (asserting DesVectorType and calling deserialize_bytes). Update the test to reflect the actual public API (or add a real DesVectorType + a Python-callable bytes entrypoint) and make the test skip/adjust when the vector-specific Cython deserializer isn’t present.
| @test_category data_types:vector | ||
| """ | ||
| import struct | ||
| from cassandra.cqltypes import DoubleType |
There was a problem hiding this comment.
Unused import: DoubleType is imported but never referenced in this test. Please remove it to keep the test module clean.
| from cassandra.cqltypes import DoubleType |
| if set(COLLECTION_TYPES) != set(["tuple", "list", "map", "set"]): | ||
| raise NotImplemented( | ||
| "Missing datatype not implemented: {}".format( | ||
| set(COLLECTION_TYPES) - set(["tuple", "list", "map", "set"]) | ||
| ) | ||
| ) |
There was a problem hiding this comment.
raise NotImplemented(...) will itself raise a TypeError because NotImplemented is a sentinel, not an exception type. If this guard is meant to fail the test when unexpected collection types are added, raise NotImplementedError instead.
|
|
||
| # Add parent directory to path | ||
| sys.path.insert(0, '.') |
There was a problem hiding this comment.
sys.path.insert(0, '.') makes the benchmark’s imports depend on the current working directory (it will break if invoked from a different cwd). Prefer deriving the repo root from __file__ (as benchmarks/base.py does) and appending that path instead.
| # Add parent directory to path | |
| sys.path.insert(0, '.') | |
| import os | |
| # Add repository root (parent of this file's directory) to sys.path | |
| _this_dir = os.path.dirname(os.path.abspath(__file__)) | |
| _repo_root = os.path.dirname(_this_dir) | |
| if _repo_root not in sys.path: | |
| sys.path.append(_repo_root) |
| elif element_type == ShortType: | ||
| values = list(range(min(vector_size, 32767))) | ||
| pack_fn = int16_pack | ||
| else: |
There was a problem hiding this comment.
create_test_data() returns fewer than vector_size elements for ShortType when vector_size > 32767 (range(min(vector_size, 32767))), which can silently produce undersized serialized buffers and misleading benchmark results. Either always generate exactly vector_size values (wrapping/clamping into the valid smallint range) or raise when an unsupported vector size is requested for ShortType.
| try: | ||
| from cassandra.deserializers import find_deserializer | ||
| except ImportError: | ||
| return None, None, None | ||
|
|
||
| protocol_version = 4 | ||
|
|
||
| # Get the Cython deserializer | ||
| deserializer = find_deserializer(vector_type) | ||
|
|
||
| # Check if we got the Cython deserializer | ||
| if deserializer.__class__.__name__ != 'DesVectorType': | ||
| return None, None, None | ||
|
|
||
| start = time.perf_counter() | ||
| for _ in range(iterations): | ||
| result = deserializer.deserialize_bytes(serialized_data, protocol_version) | ||
| end = time.perf_counter() |
There was a problem hiding this comment.
cassandra.deserializers.Deserializer (the Cython extension in this repo) does not expose a Python-callable deserialize_bytes() method, so this benchmark’s Cython path won’t work as written (it will either never run because there is no DesVectorType, or it will raise AttributeError if such a class is added without that method). Consider benchmarking via an existing public entrypoint (e.g., the row parser / protocol decoding path) or add an explicit Python wrapper method in the Cython deserializer API.
Adds benchmarks/vector_serialize.py mirroring the existing deserialization benchmark (vector_deserialize.py). Tests four serialization strategies: 1. Current VectorType.serialize() baseline (io.BytesIO per-element loop) 2. Python struct.pack with batch format string (e.g., '>1536f') 3. Cython SerVectorType serializer (placeholder, not yet implemented) 4. BoundStatement.bind() end-to-end with 1 vector column Covers float, double, and int32 subtypes at dimensions 3, 128, 768, 1536. Initial results show struct.pack is 5-11x faster than the io.BytesIO baseline, confirming the opportunity for Cython serialization optimization.
Summary
Commits (4)
1. benchmarks: Add VectorType deserialization performance benchmark
New
benchmarks/vector_deserialize.py(320 lines) testing:VectorType.deserialize(), rawstruct.unpack,numpy.frombuffer().tolist(), CythonDesVectorType2. benchmarks: expand vector sizes
Add double[768], double[1536], int32[64] configurations.
3. tests: enable vector integration tests on Scylla 2025.4+
Re-enable vector integration tests that were previously skipped for Scylla. Tested against Scylla 2025.4.2 and 2026.1.
4. tests: add coverage for variable-size VectorType Cython fallback and numpy large vector deserialization
DesVectorTyperaisesValueErrorfor variable-size subtypes (UTF8Type) while pure Python handles themNo production code changes — benchmark and test files only.