feat(policies): parallelize policy evaluations with bounded concurrency by waveywaves · Pull Request #2901 · chainloop-dev/chainloop

waveywaves · 2026-03-20T17:22:08Z

Summary

Parallelize policy attachment evaluations within VerifyStatement() and VerifyMaterial() using errgroup with bounded concurrency (max(NumCPU*2, 10))
Add WithMaxConcurrency(n) option to PolicyVerifier for configurable parallelism
Fix pre-existing concurrency safety issues:
- Move remotePolicyCache/remoteGroupCache mutexes from instance-level to package-level (the caches are package-level maps but were protected by per-instance mutexes — a latent data race)
- Use singleflight.Group to coalesce concurrent gRPC fetches for the same policy/group ref — ensures exactly one fetch per key, no duplicate RPCs or head-of-line blocking
- Use context.WithoutCancel inside singleflight callbacks to prevent the winning goroutine's cancellation from propagating to waiters
- Move extism.SetLogLevel() to sync.Once (it modifies global state; first engine's log level wins)
Marshal statement JSON once per evaluation batch instead of per-policy
Add 5 race detector tests:
- Concurrent VerifyStatement from 10 goroutines
- Concurrent VerifyMaterial from 10 goroutines (with real SPDX fixture)
- WithMaxConcurrency option validation
- Sequential (maxConcurrency=1) vs parallel result equivalence
- errgroup cancellation on policy load failure

All existing + new tests pass with -race enabled. All CI checks green.

Use errgroup to evaluate policy attachments concurrently within VerifyStatement and VerifyMaterial, bounded by runtime.NumCPU(). This reduces evaluation time from O(n*t) to O(t) where n is the number of policies and t is the slowest single evaluation. Also fixes pre-existing concurrency safety issues: - Move remotePolicyCache/remoteGroupCache mutexes from instance-level to package-level to prevent data races under concurrent access - Move extism.SetLogLevel() from per-Verify() call to engine construction to avoid racing on global state Fixes chainloop-dev#2899 Signed-off-by: Vibhav Bobade <vibhav.bobde@gmail.com>

cubic-dev-ai

2 issues found across 6 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="pkg/policies/loader.go">

<violation number="1" location="pkg/policies/loader.go:181">
P2: Global cache mutex is held during remote GetPolicy I/O, serializing concurrent policy loads and causing avoidable head-of-line blocking.</violation>
</file>

<file name="pkg/policies/group_loader.go">

<violation number="1" location="pkg/policies/group_loader.go:122">
P1: Global cache mutex is held across remote RPC in `Load`, causing head-of-line blocking and serializing concurrent policy loads.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

pkg/policies/group_loader.go

pkg/policies/loader.go

Use errgroup to evaluate policy attachments concurrently within VerifyStatement and VerifyMaterial, bounded by runtime.NumCPU(). This reduces evaluation time from O(n*t) to O(t) where n is the number of policies and t is the slowest single evaluation. Also fixes pre-existing concurrency safety issues: - Move remotePolicyCache/remoteGroupCache mutexes from instance-level to package-level to prevent data races under concurrent access - Use check-lock-check pattern in cache loaders to avoid holding the mutex during gRPC calls, enabling true parallel cache-miss fetches - Move extism.SetLogLevel() to sync.Once to guarantee it is called exactly once across all concurrent WASM engine instances Fixes chainloop-dev#2899 Signed-off-by: Vibhav Bobade <vibhav.bobde@gmail.com>

Use errgroup to evaluate policy attachments concurrently within VerifyStatement and VerifyMaterial, bounded by max(NumCPU*2, 10). This reduces evaluation time from O(n*t) to O(t) where n is the number of policies and t is the slowest single evaluation. Concurrency safety fixes: - Move remotePolicyCache/remoteGroupCache mutexes from instance-level to package-level to prevent data races under concurrent access - Use singleflight to coalesce concurrent gRPC fetches for the same policy/group ref, ensuring exactly one fetch per key - Move extism.SetLogLevel() to sync.Once to guarantee it is called exactly once across all concurrent WASM engine instances Adds race detector tests covering: - Concurrent VerifyStatement from 10 goroutines - Concurrent VerifyMaterial from 10 goroutines - WithMaxConcurrency option validation - Sequential (maxConcurrency=1) vs parallel result equivalence - errgroup cancellation on policy load failure Fixes chainloop-dev#2899 Signed-off-by: Vibhav Bobade <vibhav.bobde@gmail.com>

Use errgroup to evaluate policy attachments concurrently within VerifyStatement and VerifyMaterial, bounded by max(NumCPU*2, 10). This reduces evaluation time from O(n*t) to O(t) where n is the number of policies and t is the slowest single evaluation. Concurrency safety fixes: - Move remotePolicyCache/remoteGroupCache mutexes from instance-level to package-level to prevent data races under concurrent access - Use singleflight to coalesce concurrent gRPC fetches for the same policy/group ref, ensuring exactly one fetch per key - Use context.WithoutCancel inside singleflight callbacks to prevent the winning goroutine's cancellation from propagating to waiters - Move extism.SetLogLevel() to sync.Once to guarantee it is called exactly once across all concurrent WASM engine instances Adds 5 race detector tests covering: - Concurrent VerifyStatement from 10 goroutines - Concurrent VerifyMaterial from 10 goroutines (with real SPDX fixture) - WithMaxConcurrency option validation - Sequential (maxConcurrency=1) vs parallel result equivalence - errgroup cancellation on policy load failure Fixes chainloop-dev#2899 Signed-off-by: Vibhav Bobade <vibhav.bobde@gmail.com>

Signed-off-by: Vibhav Bobade <vibhav.bobde@gmail.com>

waveywaves · 2026-03-20T18:49:14Z

Both issues identified by cubic are resolved:

loader.go (P2) and group_loader.go (P1) — mutex held during gRPC: Replaced defer mutex.Unlock() with singleflight.Group to coalesce concurrent fetches. The mutex is now only held for cache reads/writes (microseconds), never during I/O. context.WithoutCancel is used inside singleflight callbacks to prevent cross-goroutine cancellation propagation.

waveywaves · 2026-03-20T19:04:31Z

@cubic-dev-ai please re-review — the mutex-during-RPC issues have been resolved with singleflight in the latest commits.

cubic-dev-ai · 2026-03-20T19:05:15Z

@cubic-dev-ai please re-review — the mutex-during-RPC issues have been resolved with singleflight in the latest commits.

@waveywaves I have started the AI code review. It will take a few minutes to complete.

cubic-dev-ai

No issues found across 7 files

cubic-dev-ai

1 issue found across 7 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="pkg/policies/loader.go">

<violation number="1" location="pkg/policies/loader.go:200">
P2: Using context.WithoutCancel here removes the caller’s deadline/cancellation, so the GetPolicy RPC can outlive canceled/timed-out verification requests and continue running indefinitely under outage conditions.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

pkg/policies/loader.go

context.WithoutCancel detaches cancellation but also drops the deadline. Re-apply the original deadline so gRPC calls inside singleflight still have a bounded lifetime under server outage conditions. Addresses cubic-dev-ai review feedback. Signed-off-by: Vibhav Bobade <vibhav.bobde@gmail.com>

waveywaves · 2026-03-20T19:41:59Z

Addressed the P2 finding identified by cubic — context.WithoutCancel now re-applies the original deadline via context.WithDeadline so gRPC calls inside singleflight still have a bounded lifetime. The cancel func is properly deferred.

Fixed in 6846021.

waveywaves · 2026-03-20T19:45:03Z

@jiparis this is ready for review when you get a chance — parallelizes policy evaluations with errgroup + singleflight for cache coalescing. All CI green.

cubic-dev-ai

1 issue found across 2 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="pkg/policies/loader.go">

<violation number="1" location="pkg/policies/loader.go:203">
P2: Singleflight request context inherits only the winner caller’s deadline, so concurrent waiters can fail early with a timeout unrelated to their own context deadlines.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

pkg/policies/loader.go

Replace context.WithoutCancel + inherited deadline with a fixed 30s timeout from context.Background(). This ensures all singleflight waiters get uniform timeout behavior regardless of which goroutine won the race, preventing one caller's short deadline from failing the shared RPC. Addresses cubic-dev-ai review feedback. Signed-off-by: Vibhav Bobade <vibhav.bobde@gmail.com>

waveywaves · 2026-03-20T20:57:25Z

Addressed the P2 identified by cubic — replaced context.WithoutCancel + inherited deadline with a fixed 30s timeout from context.Background() inside both singleflight callbacks. This ensures all waiters get uniform timeout behavior regardless of which goroutine won the race.

Fixed in d6dd003.

migmartri · 2026-03-22T23:10:47Z

pkg/policies/concurrency_test.go

@@ -0,0 +1,233 @@
+//
+// Copyright 2024-2026 The Chainloop Authors.


migmartri · 2026-03-22T23:12:14Z

pkg/policies/policies.go

+// defaultMaxConcurrency is the default number of concurrent policy evaluations.
+// Set higher than NumCPU because policy evaluation is I/O-mixed (gRPC calls,
+// file loads) rather than purely CPU-bound. Can be overridden via WithMaxConcurrency.
+var defaultMaxConcurrency = max(runtime.NumCPU()*2, 10)


I will reduce this to a smaller limit because Chainloop attestations can operate in remote state mode, which means different policy evaluations will contend to store the result using optimistic locking. So, let's ensure it’s smaller. cc/ @jiparis any thoughts?

In that scenario evaluatons will happen all at once and then update the remote state with the aggregated result, but I agree, let's reduce it to something more conservative for now to test the behaviour, somethig like max(runtime.NunCPU(), 5). We also don't want to hit hard on the AI-enabled policies.

jiparis · 2026-03-23T15:14:41Z

pkg/policies/policy_groups.go

-			if err != nil {
-				return nil, NewPolicyError(err)
-			}
+		// Load material content once for all policies in this group


waveywaves marked this pull request as draft March 20, 2026 17:23

cubic-dev-ai bot reviewed Mar 20, 2026

View reviewed changes

pkg/policies/group_loader.go Outdated Show resolved Hide resolved

pkg/policies/loader.go Outdated Show resolved Hide resolved

waveywaves added 4 commits March 20, 2026 23:01

style: fix gofmt import grouping in loader.go

ea1ea0a

Signed-off-by: Vibhav Bobade <vibhav.bobde@gmail.com>

waveywaves marked this pull request as ready for review March 20, 2026 18:47

waveywaves marked this pull request as draft March 20, 2026 18:48

cubic-dev-ai bot reviewed Mar 20, 2026

View reviewed changes

waveywaves marked this pull request as ready for review March 20, 2026 19:27

cubic-dev-ai bot reviewed Mar 20, 2026

View reviewed changes

pkg/policies/loader.go Outdated Show resolved Hide resolved

cubic-dev-ai bot reviewed Mar 20, 2026

View reviewed changes

pkg/policies/loader.go Outdated Show resolved Hide resolved

migmartri reviewed Mar 22, 2026

View reviewed changes

jiparis reviewed Mar 23, 2026

View reviewed changes

		@@ -0,0 +1,233 @@
		//
		// Copyright 2024-2026 The Chainloop Authors.

Conversation

waveywaves commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

waveywaves commented Mar 20, 2026

Uh oh!

waveywaves commented Mar 20, 2026

Uh oh!

cubic-dev-ai bot commented Mar 20, 2026

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

waveywaves commented Mar 20, 2026

Uh oh!

waveywaves commented Mar 20, 2026

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

waveywaves commented Mar 20, 2026

Uh oh!

migmartri Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

migmartri Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

jiparis Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiparis Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

waveywaves commented Mar 20, 2026 •

edited

Loading

jiparis Mar 23, 2026 •

edited

Loading