Skip to content

(improvement) serializers: add Cython-optimized serialization for VectorType#748

Draft
mykaul wants to merge 1 commit intoscylladb:masterfrom
mykaul:perf/cython-serializers
Draft

(improvement) serializers: add Cython-optimized serialization for VectorType#748
mykaul wants to merge 1 commit intoscylladb:masterfrom
mykaul:perf/cython-serializers

Conversation

@mykaul
Copy link

@mykaul mykaul commented Mar 14, 2026

Summary

Adds cassandra/serializers.pyx and cassandra/serializers.pxd implementing Cython-optimized serialization that mirrors the deserializers.pyx architecture.

What's included

  • Scalar serializers: SerFloatType (4-byte IEEE 754), SerDoubleType (8-byte), SerInt32Type (4-byte signed) — the three subtypes commonly used in vector columns
  • SerVectorType: Pre-allocates a contiguous char * buffer and uses C-level byte swapping for float/double/int32 vectors, with a generic fallback for other subtypes
  • GenericSerializer: Delegates to the Python-level cqltype.serialize() classmethod for all other types
  • Factory functions: find_serializer(cqltype) and make_serializers(cqltypes_list) for easy lookup and batch creation

Architecture

Mirrors deserializers.pyx exactly:

Deserializer side Serializer side
Deserializer base class Serializer base class
DesFloatType, DesDoubleType, DesInt32Type SerFloatType, SerDoubleType, SerInt32Type
DesVectorType (type-specialized) SerVectorType (type-specialized)
GenericDeserializer GenericSerializer
find_deserializer() find_serializer()
make_deserializers() make_serializers()

Performance

Benchmarked on Vector<float, 1536> (typical embedding dimension):

Method us/op Speedup
Current VectorType.serialize() (io.BytesIO loop) ~823 us 1x (baseline)
Python struct.pack batch format string ~74 us ~11x
Cython SerVectorType ~4 us ~30x

No setup.py changes needed — the existing cassandra/*.pyx glob already picks up new .pyx files.

Related PRs

Pre-review checklist

  • I have split my patch into logically separate commits.
  • All commit messages clearly explain what they change and why.
  • I added relevant tests for new features and bug fixes.
  • All commits compile, pass static checks and pass test.
  • PR description sums up the changes and reasons why they should be introduced.
  • I have provided docstrings for the public items that I want to introduce.
  • I have adjusted the documentation in ./docs/source/.
  • I added appropriate Fixes: annotations to PR description.

…torType

Add cassandra/serializers.pyx and cassandra/serializers.pxd implementing
Cython-optimized serialization that mirrors the deserializers.pyx architecture.

Implements type-specialized serializers for the three subtypes commonly used
in vector columns:
- SerFloatType: 4-byte big-endian IEEE 754 float
- SerDoubleType: 8-byte big-endian double
- SerInt32Type: 4-byte big-endian signed int32

SerVectorType pre-allocates a contiguous buffer and uses C-level byte swapping
for float/double/int32 vectors, with a generic fallback for other subtypes.
GenericSerializer delegates to the Python-level cqltype.serialize() classmethod.

Factory functions find_serializer() and make_serializers() allow easy lookup
and batch creation of serializers for column types.

Benchmarks show ~30x speedup over the current io.BytesIO baseline and ~3x
speedup over Python struct.pack for Vector<float, 1536> serialization.

No setup.py changes needed - the existing cassandra/*.pyx glob already picks
up new .pyx files.
mykaul added a commit to mykaul/python-driver that referenced this pull request Mar 14, 2026
…nt.bind()

When Cython serializers (from cassandra.serializers) are available and no
column encryption policy is active, BoundStatement.bind() now uses
pre-built Serializer objects cached on the PreparedStatement instead of
calling cqltype classmethods. This avoids per-value Python method dispatch
overhead and enables the ~30x vector serialization speedup from the Cython
serializers module.

The bind loop is split into three paths:
1. Column encryption policy path (unchanged behavior)
2. Cython serializers path (new fast path)
3. Plain Python path (no CE, no Cython -- removes per-value ColDesc/CE check)

Depends on PR scylladb#748 (Cython serializers module) and PR scylladb#630 (CE-policy
bind split).
@mykaul mykaul marked this pull request as draft March 14, 2026 11:23
@mykaul mykaul requested a review from Copilot March 14, 2026 19:25
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new Cython extension module to accelerate CQL value serialization—especially VectorType—using the same general “typed Serializer object + factory lookup” approach as the existing Cython deserialization stack.

Changes:

  • Add cassandra/serializers.pyx implementing Cython serializers for FloatType, DoubleType, Int32Type, and an optimized VectorType serializer with generic fallback.
  • Add find_serializer() / make_serializers() factory helpers for serializer creation.
  • Add cassandra/serializers.pxd to expose the Serializer interface to other Cython modules.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 8 comments.

File Description
cassandra/serializers.pyx New Cython-optimized serialization implementations and factory lookup.
cassandra/serializers.pxd Cython declarations for the Serializer interface.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +103 to +106
cpdef bytes serialize(self, object value, int protocol_version):
cdef int32_t val = <int32_t>value
cdef char out[4]
cdef char *src = <char *>&val
Comment on lines +196 to +200
for i in range(self.vector_size):
val = <float>values[i]
src = <char *>&val
dst = buf + i * 4

Comment on lines +258 to +261

try:
for i in range(self.vector_size):
val = <int32_t>values[i]
cqltype.serialize() classmethod.
"""

from libc.stdint cimport int32_t, uint32_t
Comment on lines +332 to +334
def make_serializers(cqltypes_list):
"""Create a list of Serializer objects for each given cqltype."""
return [find_serializer(ct) for ct in cqltypes_list]
Comment on lines +209 to +212
return PyBytes_FromStringAndSize(buf, buf_size)
finally:
free(buf)

Comment on lines +315 to +320
cpdef Serializer find_serializer(cqltype):
"""Find a serializer for a cqltype."""

# For VectorType, always use SerVectorType (it handles generic subtypes internally)
if issubclass(cqltype, cqltypes.VectorType):
return SerVectorType(cqltype)
Comment on lines +61 to +66
cpdef bytes serialize(self, object value, int protocol_version):
cdef float val = <float>value
cdef char out[4]
cdef char *src = <char *>&val

if is_little_endian:
mykaul added a commit to mykaul/python-driver that referenced this pull request Mar 16, 2026
…nt.bind()

When Cython serializers (from cassandra.serializers) are available and no
column encryption policy is active, BoundStatement.bind() now uses
pre-built Serializer objects cached on the PreparedStatement instead of
calling cqltype classmethods. This avoids per-value Python method dispatch
overhead and enables the ~30x vector serialization speedup from the Cython
serializers module.

The bind loop is split into three paths:
1. Column encryption policy path (unchanged behavior)
2. Cython serializers path (new fast path)
3. Plain Python path (no CE, no Cython -- removes per-value ColDesc/CE check)

Depends on PR scylladb#748 (Cython serializers module) and PR scylladb#630 (CE-policy
bind split).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants