Skip to content

fix: add speaker embedding matching to offline sync (issue #5907)#5946

Open
sungdark wants to merge 1 commit intoBasedHardware:mainfrom
sungdark:fix/offline-sync-speaker-diarization
Open

fix: add speaker embedding matching to offline sync (issue #5907)#5946
sungdark wants to merge 1 commit intoBasedHardware:mainfrom
sungdark:fix/offline-sync-speaker-diarization

Conversation

@sungdark
Copy link
Copy Markdown

Fix: Offline sync no speaker diarization (issue #5907)

Problem

Offline recording sync (sync_local_files / process_segment) was skipping the speaker identification pipeline, causing all transcribed segments to show generic 'SPEAKER_00', 'SPEAKER_01' labels instead of being matched against stored person embeddings. Live recording worked correctly because it runs speaker_identification_task which calls get_speech_profile_matching_predictions to identify speakers from their voice embeddings.

Solution

Added the same speaker embedding matching call to process_segment after postprocess_words returns. The get_speech_profile_matching_predictions function extracts speaker embeddings from the audio and matches them against stored person embeddings, setting is_user and person_id on each segment.

Changes

  • backend/routers/sync.py:
    • Added import for get_speech_profile_matching_predictions
    • Added speaker matching call in process_segment (after getting transcript segments, before storing them)

Testing

The fix follows the same pattern used in postprocess_conversation.py's _handle_segment_embedding_matching function and the speaker_identification_task in transcribe.py.

Closes #5907

Offline sync (sync_local_files / process_segment) was skipping the
speaker identification pipeline, causing all transcribed segments to
show generic 'SPEAKER_00', 'SPEAKER_01' labels instead of being
matched against stored person embeddings.

Live recording runs speaker_identification_task which calls
get_speech_profile_matching_predictions to identify speakers from
their voice embeddings. This fix adds the same call to process_segment
after postprocess_words returns.

Fixes BasedHardware#5907
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 23, 2026

Greptile Summary

This PR fixes issue #5907 by adding speaker embedding matching to the offline sync path (process_segment in backend/routers/sync.py), bringing it in line with the live-recording pipeline that already calls get_speech_profile_matching_predictions. The change is small and well-scoped: it mirrors the pattern established in _handle_segment_embedding_matching (postprocess_conversation.py) and wraps the new call in a try/except so failures degrade gracefully.

Key changes:

  • Imports get_speech_profile_matching_predictions from utils.stt.speech_profile
  • Calls the speaker-matching API after postprocess_words returns, before segments are stored or merged with an existing conversation
  • On failure the exception is caught and logged, segments retain their default SPEAKER_* labels rather than crashing the sync

Minor issues found:

  • path.replace('.bin', '.wav') is a no-op — paths passed to process_segment are always .wav files from segmented_paths; the variable should simply be wav_path = path
  • No bounds check before matches[i]: if the remote API returns fewer items than transcript_segments, the loop raises an IndexError that is swallowed by except Exception, silently skipping all speaker attribution

Confidence Score: 4/5

  • Safe to merge — the fix is wrapped in a try/except and only adds new behaviour to a previously-broken code path; any failure leaves offline sync no worse than before.
  • The logic is correct and follows the established pattern. The two issues flagged are both style/defensive-coding concerns (P2), not runtime blockers — failures are caught and logged. The IndexError risk is real but only manifests in an edge case where the speech-profile API returns a malformed response, and even then the silent fallback is acceptable rather than data-corrupting.
  • No files require special attention; backend/routers/sync.py is the only changed file and the concerns are minor.

Important Files Changed

Filename Overview
backend/routers/sync.py Adds speaker embedding matching to the offline sync path by calling get_speech_profile_matching_predictions after transcription. Functionally mirrors the live-recording pipeline. Two style-level concerns: (1) path.replace('.bin', '.wav') is a no-op since segmented paths are already .wav, and (2) no bounds check before indexing into matches, which would silently skip all speaker data if the API returns a shorter list.

Sequence Diagram

sequenceDiagram
    participant Client
    participant sync_local_files
    participant process_segment
    participant deepgram_prerecorded
    participant get_speech_profile_matching_predictions
    participant SpeechProfileAPI
    participant DB

    Client->>sync_local_files: POST /v1/sync-local-files (audio .bin files)
    sync_local_files->>sync_local_files: decode_files_to_wav (.bin → .wav)
    sync_local_files->>sync_local_files: retrieve_vad_segments (split into speech segments)
    sync_local_files->>process_segment: process each segmented .wav (thread)

    process_segment->>deepgram_prerecorded: transcribe via signed URL
    deepgram_prerecorded-->>process_segment: transcript_segments (SPEAKER_00, SPEAKER_01…)

    Note over process_segment: NEW: speaker embedding matching
    process_segment->>get_speech_profile_matching_predictions: uid + wav_path + segments
    get_speech_profile_matching_predictions->>SpeechProfileAPI: POST audio + segments
    SpeechProfileAPI-->>get_speech_profile_matching_predictions: [{is_user, person_id}, …]
    get_speech_profile_matching_predictions-->>process_segment: matches list
    process_segment->>process_segment: set seg.is_user / seg.person_id

    process_segment->>DB: store/merge conversation with identified speakers
Loading

Reviews (1): Last reviewed commit: "fix: add speaker embedding matching to o..." | Re-trigger Greptile

# Speaker identification: match segments against stored person embeddings
# This uses the same pipeline as live recording (speaker_identification_task)
try:
wav_path = path.replace('.bin', '.wav')
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Unnecessary .bin.wav path substitution

path passed to process_segment is always a .wav file — it comes from segmented_paths, which are populated inside retrieve_vad_segments with paths like f'{path_dir}/{segment_timestamp}.wav'. The .replace('.bin', '.wav') call has no effect here (there is no .bin in the path), so wav_path is always equal to path.

The misleading substitution is a silent no-op today, but it implies that a .bin path might arrive here. If the call-site ever changes, the speaker matching step would silently try to open a file whose name was never transformed, causing the except block to swallow the error with no speaker data written.

Suggested change
wav_path = path.replace('.bin', '.wav')
wav_path = path # path is already a .wav segment from retrieve_vad_segments

Comment on lines +654 to +656
for i, seg in enumerate(transcript_segments):
seg.is_user = matches[i]['is_user']
seg.person_id = matches[i].get('person_id')
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 No bounds-check before indexing matches

get_speech_profile_matching_predictions returns [{'is_user': False, 'person_id': None}] * len(segments) on the error paths, but on a successful API response it simply returns whatever the remote service returned — there is no guarantee the length matches transcript_segments. If the response contains fewer items, matches[i] raises an IndexError; if it contains more, extra matches are silently ignored.

The current except Exception wrapper will catch the IndexError and log it, so this is not a crash, but it means speaker identification is completely skipped when the API returns even one fewer result than expected.

Consider guarding the loop or falling back to a safe default when lengths differ:

for i, seg in enumerate(transcript_segments):
    if i < len(matches):
        seg.is_user = matches[i]['is_user']
        seg.person_id = matches[i].get('person_id')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Offline sync: no speaker diarization (works fine for live sync)

1 participant