Skip to content

Add structured metadata filtering to CZI dataset search #5

@LiudengZhang

Description

@LiudengZhang

Follow-up to #4 — thanks @hansen7 for the duplicate-output fix (055305e). The parsing error is resolved, but the underlying relevance problem persists: when the agent calls search_czi_datasets({"query": "lung, Mus musculus", "n_datasets": 5}), the top results are embryo and skin datasets (similarity ~0.77) rather than lung datasets. The root cause is twofold. First, the function's docstring says "input is a string containing: tissue, condition, and organism," which tells the agent to pack everything into the query parameter — so the organism and tissue filter parameters that already exist in the signature are never actually used. Second, even when those filters are passed, they are silently skipped if the filtered set has fewer than n_datasets rows, with no warning or relaxation strategy, so the caller has no idea filtering was dropped.

Proposed fix: (1) Update the docstring to explicitly instruct the agent to pass organism and tissue as separate parameters when the query contains those constraints, so the existing filter logic actually gets invoked. (2) When strict filtering returns fewer than n_datasets rows, apply controlled relaxation (e.g., drop tissue filter but keep organism, then fall back to unfiltered) and include a warning in the output so the agent knows the results are broader than requested. This keeps the current embedding-ranking approach intact but ensures structured metadata is used as a hard filter first. Happy to open a PR for this if it sounds reasonable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions