Skip to content

feat(policies): parallelize policy evaluations with bounded concurrency#2901

Open
waveywaves wants to merge 7 commits intochainloop-dev:mainfrom
waveywaves:feat/parallelize-policy-evaluations
Open

feat(policies): parallelize policy evaluations with bounded concurrency#2901
waveywaves wants to merge 7 commits intochainloop-dev:mainfrom
waveywaves:feat/parallelize-policy-evaluations

Conversation

@waveywaves
Copy link
Contributor

@waveywaves waveywaves commented Mar 20, 2026

Summary

  • Parallelize policy attachment evaluations within VerifyStatement() and VerifyMaterial() using errgroup with bounded concurrency (max(NumCPU*2, 10))
  • Add WithMaxConcurrency(n) option to PolicyVerifier for configurable parallelism
  • Fix pre-existing concurrency safety issues:
    • Move remotePolicyCache/remoteGroupCache mutexes from instance-level to package-level (the caches are package-level maps but were protected by per-instance mutexes — a latent data race)
    • Use singleflight.Group to coalesce concurrent gRPC fetches for the same policy/group ref — ensures exactly one fetch per key, no duplicate RPCs or head-of-line blocking
    • Use context.WithoutCancel inside singleflight callbacks to prevent the winning goroutine's cancellation from propagating to waiters
    • Move extism.SetLogLevel() to sync.Once (it modifies global state; first engine's log level wins)
  • Marshal statement JSON once per evaluation batch instead of per-policy
  • Add 5 race detector tests:
    • Concurrent VerifyStatement from 10 goroutines
    • Concurrent VerifyMaterial from 10 goroutines (with real SPDX fixture)
    • WithMaxConcurrency option validation
    • Sequential (maxConcurrency=1) vs parallel result equivalence
    • errgroup cancellation on policy load failure

All existing + new tests pass with -race enabled. All CI checks green.

Fixes #2899

Use errgroup to evaluate policy attachments concurrently within
VerifyStatement and VerifyMaterial, bounded by runtime.NumCPU().
This reduces evaluation time from O(n*t) to O(t) where n is the
number of policies and t is the slowest single evaluation.

Also fixes pre-existing concurrency safety issues:
- Move remotePolicyCache/remoteGroupCache mutexes from instance-level
  to package-level to prevent data races under concurrent access
- Move extism.SetLogLevel() from per-Verify() call to engine
  construction to avoid racing on global state

Fixes chainloop-dev#2899

Signed-off-by: Vibhav Bobade <vibhav.bobde@gmail.com>
@waveywaves waveywaves marked this pull request as draft March 20, 2026 17:23
Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 6 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="pkg/policies/loader.go">

<violation number="1" location="pkg/policies/loader.go:181">
P2: Global cache mutex is held during remote GetPolicy I/O, serializing concurrent policy loads and causing avoidable head-of-line blocking.</violation>
</file>

<file name="pkg/policies/group_loader.go">

<violation number="1" location="pkg/policies/group_loader.go:122">
P1: Global cache mutex is held across remote RPC in `Load`, causing head-of-line blocking and serializing concurrent policy loads.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Use errgroup to evaluate policy attachments concurrently within
VerifyStatement and VerifyMaterial, bounded by runtime.NumCPU().
This reduces evaluation time from O(n*t) to O(t) where n is the
number of policies and t is the slowest single evaluation.

Also fixes pre-existing concurrency safety issues:
- Move remotePolicyCache/remoteGroupCache mutexes from instance-level
  to package-level to prevent data races under concurrent access
- Use check-lock-check pattern in cache loaders to avoid holding the
  mutex during gRPC calls, enabling true parallel cache-miss fetches
- Move extism.SetLogLevel() to sync.Once to guarantee it is called
  exactly once across all concurrent WASM engine instances

Fixes chainloop-dev#2899

Signed-off-by: Vibhav Bobade <vibhav.bobde@gmail.com>
Use errgroup to evaluate policy attachments concurrently within
VerifyStatement and VerifyMaterial, bounded by max(NumCPU*2, 10).
This reduces evaluation time from O(n*t) to O(t) where n is the
number of policies and t is the slowest single evaluation.

Concurrency safety fixes:
- Move remotePolicyCache/remoteGroupCache mutexes from instance-level
  to package-level to prevent data races under concurrent access
- Use singleflight to coalesce concurrent gRPC fetches for the same
  policy/group ref, ensuring exactly one fetch per key
- Move extism.SetLogLevel() to sync.Once to guarantee it is called
  exactly once across all concurrent WASM engine instances

Adds race detector tests covering:
- Concurrent VerifyStatement from 10 goroutines
- Concurrent VerifyMaterial from 10 goroutines
- WithMaxConcurrency option validation
- Sequential (maxConcurrency=1) vs parallel result equivalence
- errgroup cancellation on policy load failure

Fixes chainloop-dev#2899

Signed-off-by: Vibhav Bobade <vibhav.bobde@gmail.com>
Use errgroup to evaluate policy attachments concurrently within
VerifyStatement and VerifyMaterial, bounded by max(NumCPU*2, 10).
This reduces evaluation time from O(n*t) to O(t) where n is the
number of policies and t is the slowest single evaluation.

Concurrency safety fixes:
- Move remotePolicyCache/remoteGroupCache mutexes from instance-level
  to package-level to prevent data races under concurrent access
- Use singleflight to coalesce concurrent gRPC fetches for the same
  policy/group ref, ensuring exactly one fetch per key
- Use context.WithoutCancel inside singleflight callbacks to prevent
  the winning goroutine's cancellation from propagating to waiters
- Move extism.SetLogLevel() to sync.Once to guarantee it is called
  exactly once across all concurrent WASM engine instances

Adds 5 race detector tests covering:
- Concurrent VerifyStatement from 10 goroutines
- Concurrent VerifyMaterial from 10 goroutines (with real SPDX fixture)
- WithMaxConcurrency option validation
- Sequential (maxConcurrency=1) vs parallel result equivalence
- errgroup cancellation on policy load failure

Fixes chainloop-dev#2899

Signed-off-by: Vibhav Bobade <vibhav.bobde@gmail.com>
Signed-off-by: Vibhav Bobade <vibhav.bobde@gmail.com>
@waveywaves waveywaves marked this pull request as ready for review March 20, 2026 18:47
@waveywaves waveywaves marked this pull request as draft March 20, 2026 18:48
@waveywaves
Copy link
Contributor Author

Both issues identified by cubic are resolved:

  1. loader.go (P2) and group_loader.go (P1) — mutex held during gRPC: Replaced defer mutex.Unlock() with singleflight.Group to coalesce concurrent fetches. The mutex is now only held for cache reads/writes (microseconds), never during I/O. context.WithoutCancel is used inside singleflight callbacks to prevent cross-goroutine cancellation propagation.

@waveywaves
Copy link
Contributor Author

@cubic-dev-ai please re-review — the mutex-during-RPC issues have been resolved with singleflight in the latest commits.

@cubic-dev-ai
Copy link

cubic-dev-ai bot commented Mar 20, 2026

@cubic-dev-ai please re-review — the mutex-during-RPC issues have been resolved with singleflight in the latest commits.

@waveywaves I have started the AI code review. It will take a few minutes to complete.

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 7 files

@waveywaves waveywaves marked this pull request as ready for review March 20, 2026 19:27
Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 7 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="pkg/policies/loader.go">

<violation number="1" location="pkg/policies/loader.go:200">
P2: Using context.WithoutCancel here removes the caller’s deadline/cancellation, so the GetPolicy RPC can outlive canceled/timed-out verification requests and continue running indefinitely under outage conditions.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

context.WithoutCancel detaches cancellation but also drops the deadline.
Re-apply the original deadline so gRPC calls inside singleflight still
have a bounded lifetime under server outage conditions.

Addresses cubic-dev-ai review feedback.

Signed-off-by: Vibhav Bobade <vibhav.bobde@gmail.com>
@waveywaves
Copy link
Contributor Author

Addressed the P2 finding identified by cubic — context.WithoutCancel now re-applies the original deadline via context.WithDeadline so gRPC calls inside singleflight still have a bounded lifetime. The cancel func is properly deferred.

Fixed in 6846021.

@waveywaves
Copy link
Contributor Author

@jiparis this is ready for review when you get a chance — parallelizes policy evaluations with errgroup + singleflight for cache coalescing. All CI green.

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 2 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="pkg/policies/loader.go">

<violation number="1" location="pkg/policies/loader.go:203">
P2: Singleflight request context inherits only the winner caller’s deadline, so concurrent waiters can fail early with a timeout unrelated to their own context deadlines.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Replace context.WithoutCancel + inherited deadline with a fixed 30s
timeout from context.Background(). This ensures all singleflight
waiters get uniform timeout behavior regardless of which goroutine
won the race, preventing one caller's short deadline from failing
the shared RPC.

Addresses cubic-dev-ai review feedback.

Signed-off-by: Vibhav Bobade <vibhav.bobde@gmail.com>
@waveywaves
Copy link
Contributor Author

Addressed the P2 identified by cubic — replaced context.WithoutCancel + inherited deadline with a fixed 30s timeout from context.Background() inside both singleflight callbacks. This ensures all waiters get uniform timeout behavior regardless of which goroutine won the race.

Fixed in d6dd003.

@@ -0,0 +1,233 @@
//
// Copyright 2024-2026 The Chainloop Authors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2026 :)

// defaultMaxConcurrency is the default number of concurrent policy evaluations.
// Set higher than NumCPU because policy evaluation is I/O-mixed (gRPC calls,
// file loads) rather than purely CPU-bound. Can be overridden via WithMaxConcurrency.
var defaultMaxConcurrency = max(runtime.NumCPU()*2, 10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will reduce this to a smaller limit because Chainloop attestations can operate in remote state mode, which means different policy evaluations will contend to store the result using optimistic locking. So, let's ensure it’s smaller. cc/ @jiparis any thoughts?

Copy link
Member

@jiparis jiparis Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that scenario evaluatons will happen all at once and then update the remote state with the aggregated result, but I agree, let's reduce it to something more conservative for now to test the behaviour, somethig like max(runtime.NunCPU(), 5). We also don't want to hit hard on the AI-enabled policies.

if err != nil {
return nil, NewPolicyError(err)
}
// Load material content once for all policies in this group
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: parallelize policy evaluations

3 participants