tests/benchmarks: Add VectorType deserialization benchmarks and expand test coverage by mykaul · Pull Request #733 · scylladb/python-driver

mykaul · 2026-03-07T10:02:05Z

Summary

Add VectorType deserialization benchmark harness testing 4 strategies across multiple vector sizes and types
Expand benchmark configurations to include larger vector sizes and more type combinations
Enable vector integration tests on Scylla 2025.4+
Add unit test coverage for variable-size VectorType Cython fallback and numpy large vector deserialization

Commits (4)

1. benchmarks: Add VectorType deserialization performance benchmark

New benchmarks/vector_deserialize.py (320 lines) testing:

4 strategies: VectorType.deserialize(), raw struct.unpack, numpy.frombuffer().tolist(), Cython DesVectorType
Vector sizes: 3, 4, 128, 384, 768, 1536 (float); 128 (double, int)
Iteration counts scaled by vector size for stable measurements

2. benchmarks: expand vector sizes

Add double[768], double[1536], int32[64] configurations.

3. tests: enable vector integration tests on Scylla 2025.4+

Re-enable vector integration tests that were previously skipped for Scylla. Tested against Scylla 2025.4.2 and 2026.1.

4. tests: add coverage for variable-size VectorType Cython fallback and numpy large vector deserialization

Test that DesVectorType raises ValueError for variable-size subtypes (UTF8Type) while pure Python handles them
Exercise the numpy deserialization path for 64-element vectors across float, double, int32, int64

No production code changes — benchmark and test files only.

Add comprehensive benchmark comparing different deserialization strategies for VectorType with various numeric types and vector sizes. The benchmark measures: - Current element-by-element baseline - struct.unpack bulk deserialization - numpy frombuffer with tolist() - numpy frombuffer zero-copy approach Tested with common ML/AI embedding dimensions: - Small vectors: 3-4 elements - Medium vectors: 128-384 elements - Large vectors: 768-1536 elements Usage: export CASS_DRIVER_NO_CYTHON=1 # Test pure Python implementation python benchmarks/vector_deserialize.py Includes CPU pinning for consistent measurements and result verification to ensure correctness of all optimization approaches. Baseline Performance (per-operation deserialization time): Vector<float, 3> : 0.88 μs Vector<float, 4> : 0.78 μs Vector<float, 128> : 4.72 μs Vector<float, 384> : 15.38 μs Vector<float, 768> : 32.43 μs Vector<float, 1536> : 63.74 μs Vector<double, 128> : 4.83 μs Vector<int, 128> : 2.27 μs Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Vector type is supported on Scylla 2025.4 and above. Enable the integration tests. Tested locally against both 2025.4.2 and 2026.1 and they pass. Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

…numpy large vector deserialization Add test_vector_cython_deserializer_variable_size_subtype to verify that DesVectorType correctly raises ValueError for variable-size subtypes (e.g. UTF8Type) and that the pure Python path handles them. Add test_vector_numpy_large_deserialization to exercise the numpy deserialization path for vectors with >= 32 elements across all supported numeric types (float, double, int32, int64).

Copilot

Pull request overview

Adds new benchmark and test coverage around VectorType deserialization, and refreshes integration test formatting to support vector-related testing scenarios.

Changes:

Add a new benchmarks/vector_deserialize.py harness comparing multiple vector deserialization strategies across sizes/types.
Add unit tests for VectorType large-vector deserialization and intended Cython fallback behavior.
Reformat/clean up tests/integration/standard/test_types.py (imports/string literals/line wrapping) and keep vector test class enabled via @requires_vector_type.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File	Description
tests/unit/test_types.py	Adds new unit tests for vector deserialization behavior (including a Cython-deserializer expectation).
tests/integration/standard/test_types.py	Largely formatting/refactoring; keeps/organizes vector integration tests under `@requires_vector_type`.
benchmarks/vector_deserialize.py	New benchmark script to measure vector deserialization performance across approaches and configurations.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-08T20:41:09Z

tests/unit/test_types.py

+            from cassandra.deserializers import find_deserializer
+        except ImportError:
+            self.skipTest('Cython deserializers not available')
+
+        vt_text = VectorType.apply_parameters(['UTF8Type', 3], {})
+        des_text = find_deserializer(vt_text)
+        self.assertEqual(des_text.__class__.__name__, 'DesVectorType')
+
+        # Cython path should raise for variable-size subtypes
+        data = vt_text.serialize(['abc', 'def', 'ghi'], 5)
+        with self.assertRaises(ValueError) as cm:
+            des_text.deserialize_bytes(data, 5)
+        self.assertIn('variable-size subtype', str(cm.exception))


find_deserializer() in cassandra/deserializers.pyx has no DesVectorType implementation (it falls back to GenericDeserializer for VectorType), and the returned deserializer classes also don’t expose a deserialize_bytes() Python method. As written, this test will fail when Cython deserializers are available (asserting DesVectorType and calling deserialize_bytes). Update the test to reflect the actual public API (or add a real DesVectorType + a Python-callable bytes entrypoint) and make the test skip/adjust when the vector-specific Cython deserializer isn’t present.

Copilot · 2026-03-08T20:41:10Z

tests/unit/test_types.py

+        @test_category data_types:vector
+        """
+        import struct
+        from cassandra.cqltypes import DoubleType


Unused import: DoubleType is imported but never referenced in this test. Please remove it to keep the test module clean.

Suggested change

from cassandra.cqltypes import DoubleType

Copilot · 2026-03-08T20:41:10Z

tests/integration/standard/test_types.py

+        if set(COLLECTION_TYPES) != set(["tuple", "list", "map", "set"]):
+            raise NotImplemented(
+                "Missing datatype not implemented: {}".format(
+                    set(COLLECTION_TYPES) - set(["tuple", "list", "map", "set"])
+                )
+            )


raise NotImplemented(...) will itself raise a TypeError because NotImplemented is a sentinel, not an exception type. If this guard is meant to fail the test when unexpected collection types are added, raise NotImplementedError instead.

Copilot · 2026-03-08T20:41:10Z

benchmarks/vector_deserialize.py

+
+# Add parent directory to path
+sys.path.insert(0, '.')


sys.path.insert(0, '.') makes the benchmark’s imports depend on the current working directory (it will break if invoked from a different cwd). Prefer deriving the repo root from __file__ (as benchmarks/base.py does) and appending that path instead.

Suggested change

# Add parent directory to path

sys.path.insert(0, '.')

import os

# Add repository root (parent of this file's directory) to sys.path

_this_dir = os.path.dirname(os.path.abspath(__file__))

_repo_root = os.path.dirname(_this_dir)

if _repo_root not in sys.path:

sys.path.append(_repo_root)

Copilot · 2026-03-08T20:41:10Z

benchmarks/vector_deserialize.py

+    elif element_type == ShortType:
+        values = list(range(min(vector_size, 32767)))
+        pack_fn = int16_pack
+    else:


create_test_data() returns fewer than vector_size elements for ShortType when vector_size > 32767 (range(min(vector_size, 32767))), which can silently produce undersized serialized buffers and misleading benchmark results. Either always generate exactly vector_size values (wrapping/clamping into the valid smallint range) or raise when an unsupported vector size is requested for ShortType.

Copilot · 2026-03-08T20:41:11Z

benchmarks/vector_deserialize.py

+    try:
+        from cassandra.deserializers import find_deserializer
+    except ImportError:
+        return None, None, None
+
+    protocol_version = 4
+
+    # Get the Cython deserializer
+    deserializer = find_deserializer(vector_type)
+
+    # Check if we got the Cython deserializer
+    if deserializer.__class__.__name__ != 'DesVectorType':
+        return None, None, None
+
+    start = time.perf_counter()
+    for _ in range(iterations):
+        result = deserializer.deserialize_bytes(serialized_data, protocol_version)
+    end = time.perf_counter()


cassandra.deserializers.Deserializer (the Cython extension in this repo) does not expose a Python-callable deserialize_bytes() method, so this benchmark’s Cython path won’t work as written (it will either never run because there is no DesVectorType, or it will raise AttributeError if such a class is added without that method). Consider benchmarking via an existing public entrypoint (e.g., the row parser / protocol decoding path) or add an explicit Python wrapper method in the Cython deserializer API.

Adds benchmarks/vector_serialize.py mirroring the existing deserialization benchmark (vector_deserialize.py). Tests four serialization strategies: 1. Current VectorType.serialize() baseline (io.BytesIO per-element loop) 2. Python struct.pack with batch format string (e.g., '>1536f') 3. Cython SerVectorType serializer (placeholder, not yet implemented) 4. BoundStatement.bind() end-to-end with 1 vector column Covers float, double, and int32 subtypes at dimensions 3, 128, 768, 1536. Initial results show struct.pack is 5-11x faster than the io.BytesIO baseline, confirming the opportunity for Cython serialization optimization.

mykaul added 4 commits March 7, 2026 12:00

benchmarks: expand vector sizes

4fa93b6

tests: enable vector integration tests on Scylla 2025.4+

05219ea

Vector type is supported on Scylla 2025.4 and above. Enable the integration tests. Tested locally against both 2025.4.2 and 2026.1 and they pass. Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

mykaul marked this pull request as draft March 7, 2026 10:23

mykaul requested a review from Copilot March 8, 2026 20:36

Copilot started reviewing on behalf of mykaul March 8, 2026 20:37 View session

Copilot AI reviewed Mar 8, 2026

View reviewed changes

mykaul mentioned this pull request Mar 14, 2026

Tracking: Vector search (VectorType) performance improvement PRs #746

Open

mykaul mentioned this pull request Mar 14, 2026

(improvement) serializers: add Cython-optimized serialization for VectorType #748

Draft

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests/benchmarks: Add VectorType deserialization benchmarks and expand test coverage#733

tests/benchmarks: Add VectorType deserialization benchmarks and expand test coverage#733
mykaul wants to merge 5 commits intoscylladb:masterfrom
mykaul:vector-tests-benchmarks

mykaul commented Mar 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 8, 2026

Uh oh!

Copilot AI Mar 8, 2026

Uh oh!

Copilot AI Mar 8, 2026

Uh oh!

Copilot AI Mar 8, 2026

Uh oh!

Copilot AI Mar 8, 2026

Uh oh!

Copilot AI Mar 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-# Add parent directory to path
-sys.path.insert(0, '.')
+import os
+# Add repository root (parent of this file's directory) to sys.path
+_this_dir = os.path.dirname(os.path.abspath(__file__))
+_repo_root = os.path.dirname(_this_dir)
+if _repo_root not in sys.path:
+    sys.path.append(_repo_root)

Conversation

mykaul commented Mar 7, 2026

Summary

Commits (4)

1. benchmarks: Add VectorType deserialization performance benchmark

2. benchmarks: expand vector sizes

3. tests: enable vector integration tests on Scylla 2025.4+

4. tests: add coverage for variable-size VectorType Cython fallback and numpy large vector deserialization

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants