Skip to content

tests/benchmarks: Add VectorType deserialization benchmarks and expand test coverage#733

Draft
mykaul wants to merge 5 commits intoscylladb:masterfrom
mykaul:vector-tests-benchmarks
Draft

tests/benchmarks: Add VectorType deserialization benchmarks and expand test coverage#733
mykaul wants to merge 5 commits intoscylladb:masterfrom
mykaul:vector-tests-benchmarks

Conversation

@mykaul
Copy link

@mykaul mykaul commented Mar 7, 2026

Summary

  • Add VectorType deserialization benchmark harness testing 4 strategies across multiple vector sizes and types
  • Expand benchmark configurations to include larger vector sizes and more type combinations
  • Enable vector integration tests on Scylla 2025.4+
  • Add unit test coverage for variable-size VectorType Cython fallback and numpy large vector deserialization

Commits (4)

1. benchmarks: Add VectorType deserialization performance benchmark

New benchmarks/vector_deserialize.py (320 lines) testing:

  • 4 strategies: VectorType.deserialize(), raw struct.unpack, numpy.frombuffer().tolist(), Cython DesVectorType
  • Vector sizes: 3, 4, 128, 384, 768, 1536 (float); 128 (double, int)
  • Iteration counts scaled by vector size for stable measurements

2. benchmarks: expand vector sizes

Add double[768], double[1536], int32[64] configurations.

3. tests: enable vector integration tests on Scylla 2025.4+

Re-enable vector integration tests that were previously skipped for Scylla. Tested against Scylla 2025.4.2 and 2026.1.

4. tests: add coverage for variable-size VectorType Cython fallback and numpy large vector deserialization

  • Test that DesVectorType raises ValueError for variable-size subtypes (UTF8Type) while pure Python handles them
  • Exercise the numpy deserialization path for 64-element vectors across float, double, int32, int64

No production code changes — benchmark and test files only.

mykaul added 4 commits March 7, 2026 12:00
Add comprehensive benchmark comparing different deserialization strategies
for VectorType with various numeric types and vector sizes.

The benchmark measures:
- Current element-by-element baseline
- struct.unpack bulk deserialization
- numpy frombuffer with tolist()
- numpy frombuffer zero-copy approach

Tested with common ML/AI embedding dimensions:
- Small vectors: 3-4 elements
- Medium vectors: 128-384 elements
- Large vectors: 768-1536 elements

Usage:
  export CASS_DRIVER_NO_CYTHON=1  # Test pure Python implementation
  python benchmarks/vector_deserialize.py

Includes CPU pinning for consistent measurements and result verification
to ensure correctness of all optimization approaches.

Baseline Performance (per-operation deserialization time):
  Vector<float, 3>     :  0.88 μs
  Vector<float, 4>     :  0.78 μs
  Vector<float, 128>   :  4.72 μs
  Vector<float, 384>   : 15.38 μs
  Vector<float, 768>   : 32.43 μs
  Vector<float, 1536>  : 63.74 μs
  Vector<double, 128>  :  4.83 μs
  Vector<int, 128>     :  2.27 μs

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Vector type is supported on Scylla 2025.4 and above.
Enable the integration tests.

Tested locally against both 2025.4.2 and 2026.1 and they pass.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
…numpy large vector deserialization

Add test_vector_cython_deserializer_variable_size_subtype to verify that
DesVectorType correctly raises ValueError for variable-size subtypes
(e.g. UTF8Type) and that the pure Python path handles them.

Add test_vector_numpy_large_deserialization to exercise the numpy
deserialization path for vectors with >= 32 elements across all supported
numeric types (float, double, int32, int64).
@mykaul mykaul marked this pull request as draft March 7, 2026 10:23
@mykaul mykaul requested a review from Copilot March 8, 2026 20:36
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds new benchmark and test coverage around VectorType deserialization, and refreshes integration test formatting to support vector-related testing scenarios.

Changes:

  • Add a new benchmarks/vector_deserialize.py harness comparing multiple vector deserialization strategies across sizes/types.
  • Add unit tests for VectorType large-vector deserialization and intended Cython fallback behavior.
  • Reformat/clean up tests/integration/standard/test_types.py (imports/string literals/line wrapping) and keep vector test class enabled via @requires_vector_type.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File Description
tests/unit/test_types.py Adds new unit tests for vector deserialization behavior (including a Cython-deserializer expectation).
tests/integration/standard/test_types.py Largely formatting/refactoring; keeps/organizes vector integration tests under @requires_vector_type.
benchmarks/vector_deserialize.py New benchmark script to measure vector deserialization performance across approaches and configurations.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +542 to +554
from cassandra.deserializers import find_deserializer
except ImportError:
self.skipTest('Cython deserializers not available')

vt_text = VectorType.apply_parameters(['UTF8Type', 3], {})
des_text = find_deserializer(vt_text)
self.assertEqual(des_text.__class__.__name__, 'DesVectorType')

# Cython path should raise for variable-size subtypes
data = vt_text.serialize(['abc', 'def', 'ghi'], 5)
with self.assertRaises(ValueError) as cm:
des_text.deserialize_bytes(data, 5)
self.assertIn('variable-size subtype', str(cm.exception))
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

find_deserializer() in cassandra/deserializers.pyx has no DesVectorType implementation (it falls back to GenericDeserializer for VectorType), and the returned deserializer classes also don’t expose a deserialize_bytes() Python method. As written, this test will fail when Cython deserializers are available (asserting DesVectorType and calling deserialize_bytes). Update the test to reflect the actual public API (or add a real DesVectorType + a Python-callable bytes entrypoint) and make the test skip/adjust when the vector-specific Cython deserializer isn’t present.

Copilot uses AI. Check for mistakes.
@test_category data_types:vector
"""
import struct
from cassandra.cqltypes import DoubleType
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused import: DoubleType is imported but never referenced in this test. Please remove it to keep the test module clean.

Suggested change
from cassandra.cqltypes import DoubleType

Copilot uses AI. Check for mistakes.
Comment on lines +725 to +730
if set(COLLECTION_TYPES) != set(["tuple", "list", "map", "set"]):
raise NotImplemented(
"Missing datatype not implemented: {}".format(
set(COLLECTION_TYPES) - set(["tuple", "list", "map", "set"])
)
)
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raise NotImplemented(...) will itself raise a TypeError because NotImplemented is a sentinel, not an exception type. If this guard is meant to fail the test when unexpected collection types are added, raise NotImplementedError instead.

Copilot uses AI. Check for mistakes.
Comment on lines +31 to +33

# Add parent directory to path
sys.path.insert(0, '.')
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sys.path.insert(0, '.') makes the benchmark’s imports depend on the current working directory (it will break if invoked from a different cwd). Prefer deriving the repo root from __file__ (as benchmarks/base.py does) and appending that path instead.

Suggested change
# Add parent directory to path
sys.path.insert(0, '.')
import os
# Add repository root (parent of this file's directory) to sys.path
_this_dir = os.path.dirname(os.path.abspath(__file__))
_repo_root = os.path.dirname(_this_dir)
if _repo_root not in sys.path:
sys.path.append(_repo_root)

Copilot uses AI. Check for mistakes.
Comment on lines +53 to +56
elif element_type == ShortType:
values = list(range(min(vector_size, 32767)))
pack_fn = int16_pack
else:
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create_test_data() returns fewer than vector_size elements for ShortType when vector_size > 32767 (range(min(vector_size, 32767))), which can silently produce undersized serialized buffers and misleading benchmark results. Either always generate exactly vector_size values (wrapping/clamping into the valid smallint range) or raise when an unsupported vector size is requested for ShortType.

Copilot uses AI. Check for mistakes.
Comment on lines +148 to +165
try:
from cassandra.deserializers import find_deserializer
except ImportError:
return None, None, None

protocol_version = 4

# Get the Cython deserializer
deserializer = find_deserializer(vector_type)

# Check if we got the Cython deserializer
if deserializer.__class__.__name__ != 'DesVectorType':
return None, None, None

start = time.perf_counter()
for _ in range(iterations):
result = deserializer.deserialize_bytes(serialized_data, protocol_version)
end = time.perf_counter()
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cassandra.deserializers.Deserializer (the Cython extension in this repo) does not expose a Python-callable deserialize_bytes() method, so this benchmark’s Cython path won’t work as written (it will either never run because there is no DesVectorType, or it will raise AttributeError if such a class is added without that method). Consider benchmarking via an existing public entrypoint (e.g., the row parser / protocol decoding path) or add an explicit Python wrapper method in the Cython deserializer API.

Copilot uses AI. Check for mistakes.
Adds benchmarks/vector_serialize.py mirroring the existing deserialization
benchmark (vector_deserialize.py). Tests four serialization strategies:

1. Current VectorType.serialize() baseline (io.BytesIO per-element loop)
2. Python struct.pack with batch format string (e.g., '>1536f')
3. Cython SerVectorType serializer (placeholder, not yet implemented)
4. BoundStatement.bind() end-to-end with 1 vector column

Covers float, double, and int32 subtypes at dimensions 3, 128, 768, 1536.
Initial results show struct.pack is 5-11x faster than the io.BytesIO baseline,
confirming the opportunity for Cython serialization optimization.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants