Skip to content

(improvement) Optimize VectorType deserialization with struct.unpack and numpy#730

Draft
mykaul wants to merge 2 commits intoscylladb:masterfrom
mykaul:vector-struct-numpy-deser
Draft

(improvement) Optimize VectorType deserialization with struct.unpack and numpy#730
mykaul wants to merge 2 commits intoscylladb:masterfrom
mykaul:vector-struct-numpy-deser

Conversation

@mykaul
Copy link

@mykaul mykaul commented Mar 7, 2026

Summary

  • Replace element-by-element VectorType deserialization with bulk struct.unpack for known numeric types (float, double, int32, int64, short), caching a struct.Struct object at type-creation time
  • Add numpy fast-path (np.frombuffer().tolist()) for vectors with >= 32 elements, delivering ~4x speedup for 768/1536-dimension float vectors

Performance (pure Python path, CASS_DRIVER_NO_CYTHON=1)

Vector Config Before After (struct) After (numpy) Total Speedup
Vector<float, 3> 0.88 µs 0.25 µs — (uses struct) 3.58x
Vector<float, 128> 4.72 µs 4.06 µs 1.87 µs 2.5x
Vector<float, 768> 32.43 µs 30.72 µs 8.45 µs 3.8x
Vector<float, 1536> 63.74 µs 63.24 µs 15.77 µs 4.0x

Details

Commit 1 — struct.unpack optimization:

  • At apply_parameters() time, cache a struct.Struct('>Nf') for the vector's subtype+dimension
  • deserialize() calls list(struct.unpack(byts)) — single C-level bulk unpack
  • Also optimizes serialization via struct.pack(*v)
  • Fallback for non-numeric fixed-size types uses pre-allocated result list + cached method reference

Commit 2 — numpy for large vectors:

  • For vectors >= 32 elements with a known numeric dtype, use np.frombuffer(byts, dtype='>f4', count=N).tolist()
  • numpy avoids intermediate Python object creation during unpacking; .tolist() batch-converts with better cache locality
  • Threshold of 32 chosen empirically: below this, struct.unpack is faster due to lower fixed overhead
  • _numpy_dtype cached on the class at type-creation time

Both commits modify only cassandra/cqltypes.py. No Cython dependency.

mykaul added 2 commits March 7, 2026 12:00
…ct.unpack

Add bulk deserialization using struct.unpack for common numeric vector types
instead of element-by-element deserialization. This provides significant
performance improvements, especially for small vectors and integer types.

Optimized types:
- FloatType  ('>Nf' format)
- DoubleType ('>Nd' format)
- Int32Type  ('>Ni' format)
- LongType   ('>Nq' format)
- ShortType  ('>Nh' format)

Performance improvements (measured with CASS_DRIVER_NO_CYTHON=1):

Small vectors (3-4 elements):
  Vector<float, 3>  : 0.88 μs → 0.25 μs  (3.58x faster)
  Vector<float, 4>  : 0.78 μs → 0.28 μs  (2.79x faster)

Medium vectors (128 elements):
  Vector<float, 128>  : 4.72 μs → 4.06 μs  (1.16x faster)
  Vector<double, 128> : 4.83 μs → 4.01 μs  (1.20x faster)
  Vector<int, 128>    : 2.27 μs → 1.25 μs  (1.82x faster)

Large vectors (384-1536 elements):
  Vector<float, 384>  : 15.38 μs → 14.67 μs  (1.05x faster)
  Vector<float, 768>  : 32.43 μs → 30.72 μs  (1.06x faster)
  Vector<float, 1536> : 63.74 μs → 63.24 μs  (1.01x faster)

The optimization is most effective for:
- Small vectors (3-4 elements): 2.8-3.6x speedup
- Integer vectors: 1.8x speedup
- Medium-sized float/double vectors: 1.2-1.3x speedup

For very large vectors (384+ elements), the benefit is minimal as the
deserialization time is dominated by data copying rather than function
call overhead.

Variable-size subtypes and other numeric types continue to use the
element-by-element fallback path.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
For vectors with 32 or more elements, use numpy.frombuffer() which provides
1.3-1.5x speedup for large vectors (128+ elements) compared to struct.unpack.

The hybrid approach:
- Small vectors (< 32 elements): struct.unpack (2.8-3.6x faster than baseline)
- Large vectors (>= 32 elements): numpy.frombuffer().tolist() (1.3-1.5x faster than struct.unpack)

Threshold of 32 elements balances code complexity with performance gains.

Benchmark results:
- float[128]:  2.15 μs → 1.87 μs (1.15x faster)
- float[384]:  6.17 μs → 4.44 μs (1.39x faster)
- float[768]: 12.25 μs → 8.45 μs (1.45x faster)
- float[1536]: 24.44 μs → 15.77 μs (1.55x faster)

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes VectorType (de)serialization in cassandra/cqltypes.py by introducing bulk numeric (de)serialization via a cached struct.Struct, and an optional numpy-based deserialization fast path for larger vectors.

Changes:

  • Cache a per-parameterized-vector struct.Struct to bulk unpack/pack common numeric vector subtypes.
  • Add an optional numpy frombuffer(...).tolist() deserialization fast-path for vectors with vector_size >= 32.
  • Refactor variable-size vector deserialization to a fixed-iteration loop with stricter bounds checks.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +56 to 57
import numpy as np

Comment on lines 1500 to +1504
try:
size, bytes_read = uvint_unpack(byts[idx:])
idx += bytes_read
rv.append(cls.subtype.deserialize(byts[idx:idx + size], protocol_version))
idx += size
except:
except (IndexError, KeyError):
raise ValueError("Error reading additional data during vector deserialization after successfully adding {} elements"\
.format(len(rv)))
.format(i))
Comment on lines +1476 to +1479
if cls._vector_struct is not None:
if HAVE_NUMPY and cls.vector_size >= 32 and cls._numpy_dtype is not None:
return np.frombuffer(byts, dtype=cls._numpy_dtype, count=cls.vector_size).tolist()
return list(cls._vector_struct.unpack(byts))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants