Skip to content

fast-get: fix a recent regression#10639

Open
lyakh wants to merge 4 commits intothesofproject:mainfrom
lyakh:fastget
Open

fast-get: fix a recent regression#10639
lyakh wants to merge 4 commits intothesofproject:mainfrom
lyakh:fastget

Conversation

@lyakh
Copy link
Collaborator

@lyakh lyakh commented Mar 20, 2026

Recent commits 7d0621e ("fast-get: fix partition leak for multi-thread usage") and 3c757dd ("fast-get: fix crash by freeing buffer before removing partition") introduced a regression when using SRC in DP mode in userspace on PTL and running two aplay -r 44100
back to back. The reason is that SRC isn't calling mod_fast_put() explicitly and is instead relying on automatic clean up, which then happens in a different thread.

In fact those two commits weren't needed AFAICS, since fast_put() shouldn't be called directly from userspace, instead only mod_fast_put() should be called, which is a syscall, and therefore fast_put() then will have access to all the memory.

Copilot AI review requested due to automatic review settings March 20, 2026 15:55
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a userspace regression in Zephyr’s fast-get cleanup path where memory-domain partition removal can occur from a different thread than the one that originally fast-get’d the buffer (observed with SRC in DP mode and back-to-back aplay -r 44100).

Changes:

  • Change fast_put() userspace partition removal to use the entry->thread memory domain instead of k_current_get().
  • Update debug logging to reference entry->thread instead of the current thread.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Member

@lgirdwood lgirdwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ,can you respond to copilot. Thanks !

@lyakh lyakh added the DNM Do Not Merge tag label Mar 23, 2026
#if CONFIG_USERSPACE
struct k_mem_domain *mdom = k_current_get()->mem_domain_info.mem_domain;

/* We only get there for large buffers */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The relation between num_partitions and large buffers is not obvious. Could you make the comment clearer?

/* A userspace thread makes the request */
if (k_current_get() != entry->thread &&
if (mdom != entry->mdom &&
!fast_get_domain_exists(k_current_get(), ret,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that the function fast_get_domain_exists iterates through partitions. Why not calling it fast_get_partition_exists?

Recent commits 7d0621e ("fast-get: fix partition leak for
multi-thread usage") and 3c757dd ("fast-get: fix crash by
freeing buffer before removing partition") introduced a regression
when using SRC in DP mode in userspace on PTL and running two
aplay -r 44100
back to back. The reason is that SRC isn't calling mod_fast_put()
explicitly and is instead relying on automatic clean up, which then
happens in a different thread.

In fact those two commits weren't needed AFAICS, since fast_put()
shouldn't be called directly from userspace, instead only
mod_fast_put() should be called, which is a syscall, and therefore
fast_put() then will have access to all the memory.

Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>
lyakh added 2 commits March 23, 2026 16:31
In fast_put() sram_ptr == entry->sram_ptr, use the former since it's
shorter. Also clarify a comment and rename a function to better match
its role.

Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>
Before dropping the DP thread memory domain, check if any partitions
are left still attached to it. If so, warn and remove them.

Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>
/* We only get there for large buffers */
/*
* We only get there for large buffers, since small buffers with
* enabled userspace don't create fast-get entries
Copy link
Contributor

@wjablon1 wjablon1 Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, I think this part is well understood (you can deduce this info just by examining fast_get function itself). No need to further explain it.
The mystery part that I was referring to in my previous comment is this:

mdom->num_partitions > 1

I am not sure if this comment is even related. I am guessing the related comment is below:

/* A userspace thread makes the request */

Am I right? If yes, maybe you could just change it to:

/* if num_partitions > 0, the request comes form a userspace thread */

Or you can combine both comments like this:

/* We only get there for large buffers.
/* Check if the request comes from a userspace thread by examining num_partitions
*/

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wjablon1 yes, that test is related to the "userspace thread" comment. @softwarecki wanted to replace that test with a K_USER check, then it will be clearer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, if we already have a plan for this, forget about my comment and remove unnecessary clarification

@lyakh lyakh removed the DNM Do Not Merge tag label Mar 23, 2026
@lyakh
Copy link
Collaborator Author

lyakh commented Mar 24, 2026

@lrudyX I see a single WCL "module creation" test failing to create a copier module with this PR. Doesn't look related?

@lyakh
Copy link
Collaborator Author

lyakh commented Mar 24, 2026

no PTL nocodec device was available. Re-run jenkins tests

@lyakh
Copy link
Collaborator Author

lyakh commented Mar 24, 2026

SOFCI TEST

@lrudyX
Copy link

lrudyX commented Mar 24, 2026

@lrudyX I see a single WCL "module creation" test failing to create a copier module with this PR. Doesn't look related?

Test failed at deleting pipeline with SRC in DP, which was fixed by "Recent commits 7d0621e ("fast-get: fix partition leak for multi-thread usage") and 3c757dd ("fast-get: fix crash by freeing buffer before removing partition")". It is highly related.

Copy link
Contributor

@tmleman tmleman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for fixing!

Copy link
Collaborator

@singalsu singalsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK for commit "audio: src: free copied coefficients". I don't know enough to review the other commits.

@lyakh
Copy link
Collaborator Author

lyakh commented Mar 24, 2026

"Internal Intel CI" https://sof-ci.01.org/sof-pr-viewer/#/build/PR10639/build14876904 passes now. Let me just remove the added debugging, after which it should be ready for merging

Until now SRC was relying on the final module clean up to free
copied with fast_get() coefficients. This however isn't clean. Fix it
and release partitions explicitly.

Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>
@lyakh
Copy link
Collaborator Author

lyakh commented Mar 24, 2026

The "Internal Intel CI" failure this time is a known intermittent "random HDA DMA" failure

Copy link
Collaborator

@softwarecki softwarecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fast_put flow is still incorrect for multiple module instances. The fast_get adds partitions to the memory domain of every instance that allocates the buffer, but the partition is removed only for the instance that releases the memory last.

Storing the memory domain pointer inside the entry also looks pointless in the current design. Should we instead maintain a separate entry per module instance?

Additionally, memory domain retrieval for the proxy variant needs fixing. The proxy already stores a pointer to the memory domain in its context structure.

case MOD_RES_FAST_GET:
fast_put(res->heap, container->sram_ptr);
#if CONFIG_USERSPACE
mdom = mod->mdom;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The userspace proxy already stores a pointer to the memory domain in a different place. Need to distinguishing the behavior for PROXY/APPLICATION.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@softwarecki is this patch breaking anything for the "thread" model? If not, I'd propose to merge it to fix the "application" model which is currently broken because of this fast-get regression. Once it's all fixed we can carefully check what we can improve

#if CONFIG_USERSPACE
struct userspace_context *user_ctx;
#endif /* CONFIG_USERSPACE */
#if CONFIG_SOF_USERSPACE_APPLICATION
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change is unnecessary, because the proxy already stores the memory domain in its context.

LOG_DBG("grant access to thread %p first was %p", k_current_get(),
entry->thread);
int err = fast_get_access_grant(k_current_get(), ret, size);
if (mdom != entry->mdom &&
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand the purpose of this check. The entry allocation together with the mdom assignment happens in a different part of the code.

If the same module instance calls fast_get again for the same address (unlikely), then this condition will indeed trigger. But what about other instances of the same module? Do they always rely on fast_get_partition_exists on every call?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just checking for "partition exists" would be enough, but this would be an optimisation, we can optimise it later.

@lyakh
Copy link
Collaborator Author

lyakh commented Mar 24, 2026

The fast_get adds partitions to the memory domain of every instance that allocates the buffer, but the partition is removed only for the instance that releases the memory last.

@softwarecki I don't think this is correct. That's exactly why fast_put() checks refcount before and after removing the partition, but the partition is removed every time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants