fix: add speaker embedding matching to offline sync (issue #5907)#5946
fix: add speaker embedding matching to offline sync (issue #5907)#5946sungdark wants to merge 1 commit intoBasedHardware:mainfrom
Conversation
Offline sync (sync_local_files / process_segment) was skipping the speaker identification pipeline, causing all transcribed segments to show generic 'SPEAKER_00', 'SPEAKER_01' labels instead of being matched against stored person embeddings. Live recording runs speaker_identification_task which calls get_speech_profile_matching_predictions to identify speakers from their voice embeddings. This fix adds the same call to process_segment after postprocess_words returns. Fixes BasedHardware#5907
Greptile SummaryThis PR fixes issue #5907 by adding speaker embedding matching to the offline sync path ( Key changes:
Minor issues found:
Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Client
participant sync_local_files
participant process_segment
participant deepgram_prerecorded
participant get_speech_profile_matching_predictions
participant SpeechProfileAPI
participant DB
Client->>sync_local_files: POST /v1/sync-local-files (audio .bin files)
sync_local_files->>sync_local_files: decode_files_to_wav (.bin → .wav)
sync_local_files->>sync_local_files: retrieve_vad_segments (split into speech segments)
sync_local_files->>process_segment: process each segmented .wav (thread)
process_segment->>deepgram_prerecorded: transcribe via signed URL
deepgram_prerecorded-->>process_segment: transcript_segments (SPEAKER_00, SPEAKER_01…)
Note over process_segment: NEW: speaker embedding matching
process_segment->>get_speech_profile_matching_predictions: uid + wav_path + segments
get_speech_profile_matching_predictions->>SpeechProfileAPI: POST audio + segments
SpeechProfileAPI-->>get_speech_profile_matching_predictions: [{is_user, person_id}, …]
get_speech_profile_matching_predictions-->>process_segment: matches list
process_segment->>process_segment: set seg.is_user / seg.person_id
process_segment->>DB: store/merge conversation with identified speakers
Reviews (1): Last reviewed commit: "fix: add speaker embedding matching to o..." | Re-trigger Greptile |
| # Speaker identification: match segments against stored person embeddings | ||
| # This uses the same pipeline as live recording (speaker_identification_task) | ||
| try: | ||
| wav_path = path.replace('.bin', '.wav') |
There was a problem hiding this comment.
Unnecessary
.bin → .wav path substitution
path passed to process_segment is always a .wav file — it comes from segmented_paths, which are populated inside retrieve_vad_segments with paths like f'{path_dir}/{segment_timestamp}.wav'. The .replace('.bin', '.wav') call has no effect here (there is no .bin in the path), so wav_path is always equal to path.
The misleading substitution is a silent no-op today, but it implies that a .bin path might arrive here. If the call-site ever changes, the speaker matching step would silently try to open a file whose name was never transformed, causing the except block to swallow the error with no speaker data written.
| wav_path = path.replace('.bin', '.wav') | |
| wav_path = path # path is already a .wav segment from retrieve_vad_segments |
| for i, seg in enumerate(transcript_segments): | ||
| seg.is_user = matches[i]['is_user'] | ||
| seg.person_id = matches[i].get('person_id') |
There was a problem hiding this comment.
No bounds-check before indexing
matches
get_speech_profile_matching_predictions returns [{'is_user': False, 'person_id': None}] * len(segments) on the error paths, but on a successful API response it simply returns whatever the remote service returned — there is no guarantee the length matches transcript_segments. If the response contains fewer items, matches[i] raises an IndexError; if it contains more, extra matches are silently ignored.
The current except Exception wrapper will catch the IndexError and log it, so this is not a crash, but it means speaker identification is completely skipped when the API returns even one fewer result than expected.
Consider guarding the loop or falling back to a safe default when lengths differ:
for i, seg in enumerate(transcript_segments):
if i < len(matches):
seg.is_user = matches[i]['is_user']
seg.person_id = matches[i].get('person_id')
Fix: Offline sync no speaker diarization (issue #5907)
Problem
Offline recording sync (sync_local_files / process_segment) was skipping the speaker identification pipeline, causing all transcribed segments to show generic 'SPEAKER_00', 'SPEAKER_01' labels instead of being matched against stored person embeddings. Live recording worked correctly because it runs speaker_identification_task which calls get_speech_profile_matching_predictions to identify speakers from their voice embeddings.
Solution
Added the same speaker embedding matching call to process_segment after postprocess_words returns. The get_speech_profile_matching_predictions function extracts speaker embeddings from the audio and matches them against stored person embeddings, setting is_user and person_id on each segment.
Changes
Testing
The fix follows the same pattern used in postprocess_conversation.py's _handle_segment_embedding_matching function and the speaker_identification_task in transcribe.py.
Closes #5907