feat(policies): parallelize policy evaluations with bounded concurrency#2901
feat(policies): parallelize policy evaluations with bounded concurrency#2901waveywaves wants to merge 7 commits intochainloop-dev:mainfrom
Conversation
Use errgroup to evaluate policy attachments concurrently within VerifyStatement and VerifyMaterial, bounded by runtime.NumCPU(). This reduces evaluation time from O(n*t) to O(t) where n is the number of policies and t is the slowest single evaluation. Also fixes pre-existing concurrency safety issues: - Move remotePolicyCache/remoteGroupCache mutexes from instance-level to package-level to prevent data races under concurrent access - Move extism.SetLogLevel() from per-Verify() call to engine construction to avoid racing on global state Fixes chainloop-dev#2899 Signed-off-by: Vibhav Bobade <vibhav.bobde@gmail.com>
There was a problem hiding this comment.
2 issues found across 6 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="pkg/policies/loader.go">
<violation number="1" location="pkg/policies/loader.go:181">
P2: Global cache mutex is held during remote GetPolicy I/O, serializing concurrent policy loads and causing avoidable head-of-line blocking.</violation>
</file>
<file name="pkg/policies/group_loader.go">
<violation number="1" location="pkg/policies/group_loader.go:122">
P1: Global cache mutex is held across remote RPC in `Load`, causing head-of-line blocking and serializing concurrent policy loads.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Use errgroup to evaluate policy attachments concurrently within VerifyStatement and VerifyMaterial, bounded by runtime.NumCPU(). This reduces evaluation time from O(n*t) to O(t) where n is the number of policies and t is the slowest single evaluation. Also fixes pre-existing concurrency safety issues: - Move remotePolicyCache/remoteGroupCache mutexes from instance-level to package-level to prevent data races under concurrent access - Use check-lock-check pattern in cache loaders to avoid holding the mutex during gRPC calls, enabling true parallel cache-miss fetches - Move extism.SetLogLevel() to sync.Once to guarantee it is called exactly once across all concurrent WASM engine instances Fixes chainloop-dev#2899 Signed-off-by: Vibhav Bobade <vibhav.bobde@gmail.com>
Use errgroup to evaluate policy attachments concurrently within VerifyStatement and VerifyMaterial, bounded by max(NumCPU*2, 10). This reduces evaluation time from O(n*t) to O(t) where n is the number of policies and t is the slowest single evaluation. Concurrency safety fixes: - Move remotePolicyCache/remoteGroupCache mutexes from instance-level to package-level to prevent data races under concurrent access - Use singleflight to coalesce concurrent gRPC fetches for the same policy/group ref, ensuring exactly one fetch per key - Move extism.SetLogLevel() to sync.Once to guarantee it is called exactly once across all concurrent WASM engine instances Adds race detector tests covering: - Concurrent VerifyStatement from 10 goroutines - Concurrent VerifyMaterial from 10 goroutines - WithMaxConcurrency option validation - Sequential (maxConcurrency=1) vs parallel result equivalence - errgroup cancellation on policy load failure Fixes chainloop-dev#2899 Signed-off-by: Vibhav Bobade <vibhav.bobde@gmail.com>
Use errgroup to evaluate policy attachments concurrently within VerifyStatement and VerifyMaterial, bounded by max(NumCPU*2, 10). This reduces evaluation time from O(n*t) to O(t) where n is the number of policies and t is the slowest single evaluation. Concurrency safety fixes: - Move remotePolicyCache/remoteGroupCache mutexes from instance-level to package-level to prevent data races under concurrent access - Use singleflight to coalesce concurrent gRPC fetches for the same policy/group ref, ensuring exactly one fetch per key - Use context.WithoutCancel inside singleflight callbacks to prevent the winning goroutine's cancellation from propagating to waiters - Move extism.SetLogLevel() to sync.Once to guarantee it is called exactly once across all concurrent WASM engine instances Adds 5 race detector tests covering: - Concurrent VerifyStatement from 10 goroutines - Concurrent VerifyMaterial from 10 goroutines (with real SPDX fixture) - WithMaxConcurrency option validation - Sequential (maxConcurrency=1) vs parallel result equivalence - errgroup cancellation on policy load failure Fixes chainloop-dev#2899 Signed-off-by: Vibhav Bobade <vibhav.bobde@gmail.com>
Signed-off-by: Vibhav Bobade <vibhav.bobde@gmail.com>
|
Both issues identified by cubic are resolved:
|
|
@cubic-dev-ai please re-review — the mutex-during-RPC issues have been resolved with singleflight in the latest commits. |
@waveywaves I have started the AI code review. It will take a few minutes to complete. |
There was a problem hiding this comment.
1 issue found across 7 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="pkg/policies/loader.go">
<violation number="1" location="pkg/policies/loader.go:200">
P2: Using context.WithoutCancel here removes the caller’s deadline/cancellation, so the GetPolicy RPC can outlive canceled/timed-out verification requests and continue running indefinitely under outage conditions.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
context.WithoutCancel detaches cancellation but also drops the deadline. Re-apply the original deadline so gRPC calls inside singleflight still have a bounded lifetime under server outage conditions. Addresses cubic-dev-ai review feedback. Signed-off-by: Vibhav Bobade <vibhav.bobde@gmail.com>
|
Addressed the P2 finding identified by cubic — Fixed in 6846021. |
|
@jiparis this is ready for review when you get a chance — parallelizes policy evaluations with errgroup + singleflight for cache coalescing. All CI green. |
There was a problem hiding this comment.
1 issue found across 2 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="pkg/policies/loader.go">
<violation number="1" location="pkg/policies/loader.go:203">
P2: Singleflight request context inherits only the winner caller’s deadline, so concurrent waiters can fail early with a timeout unrelated to their own context deadlines.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Replace context.WithoutCancel + inherited deadline with a fixed 30s timeout from context.Background(). This ensures all singleflight waiters get uniform timeout behavior regardless of which goroutine won the race, preventing one caller's short deadline from failing the shared RPC. Addresses cubic-dev-ai review feedback. Signed-off-by: Vibhav Bobade <vibhav.bobde@gmail.com>
|
Addressed the P2 identified by cubic — replaced Fixed in d6dd003. |
| @@ -0,0 +1,233 @@ | |||
| // | |||
| // Copyright 2024-2026 The Chainloop Authors. | |||
| // defaultMaxConcurrency is the default number of concurrent policy evaluations. | ||
| // Set higher than NumCPU because policy evaluation is I/O-mixed (gRPC calls, | ||
| // file loads) rather than purely CPU-bound. Can be overridden via WithMaxConcurrency. | ||
| var defaultMaxConcurrency = max(runtime.NumCPU()*2, 10) |
There was a problem hiding this comment.
I will reduce this to a smaller limit because Chainloop attestations can operate in remote state mode, which means different policy evaluations will contend to store the result using optimistic locking. So, let's ensure it’s smaller. cc/ @jiparis any thoughts?
There was a problem hiding this comment.
In that scenario evaluatons will happen all at once and then update the remote state with the aggregated result, but I agree, let's reduce it to something more conservative for now to test the behaviour, somethig like max(runtime.NunCPU(), 5). We also don't want to hit hard on the AI-enabled policies.
| if err != nil { | ||
| return nil, NewPolicyError(err) | ||
| } | ||
| // Load material content once for all policies in this group |
Summary
VerifyStatement()andVerifyMaterial()usingerrgroupwith bounded concurrency (max(NumCPU*2, 10))WithMaxConcurrency(n)option toPolicyVerifierfor configurable parallelismremotePolicyCache/remoteGroupCachemutexes from instance-level to package-level (the caches are package-level maps but were protected by per-instance mutexes — a latent data race)singleflight.Groupto coalesce concurrent gRPC fetches for the same policy/group ref — ensures exactly one fetch per key, no duplicate RPCs or head-of-line blockingcontext.WithoutCancelinside singleflight callbacks to prevent the winning goroutine's cancellation from propagating to waitersextism.SetLogLevel()tosync.Once(it modifies global state; first engine's log level wins)VerifyStatementfrom 10 goroutinesVerifyMaterialfrom 10 goroutines (with real SPDX fixture)WithMaxConcurrencyoption validationmaxConcurrency=1) vs parallel result equivalenceerrgroupcancellation on policy load failureAll existing + new tests pass with
-raceenabled. All CI checks green.Fixes #2899