Skip to content

fix(externalauth): prevent race condition in token refresh with singleflight + optimistic locking#22904

Merged
kylecarbs merged 1 commit intomainfrom
fix/externalauth-refresh-race
Mar 10, 2026
Merged

fix(externalauth): prevent race condition in token refresh with singleflight + optimistic locking#22904
kylecarbs merged 1 commit intomainfrom
fix/externalauth-refresh-race

Conversation

@kylecarbs
Copy link
Member

Problem

When multiple concurrent callers (e.g., parallel workspace builds) read the same single-use OAuth2 refresh token from the database and race to exchange it with the provider, the first caller succeeds but subsequent callers get bad_refresh_token. The losing caller then clears the valid new token from the database, permanently breaking the auth link until the user manually re-authenticates.

This is reliably reproducible when launching multiple workspaces simultaneously with GitHub App external auth and user-to-server token expiration enabled.

Solution

Two layers of protection:

1. Singleflight deduplication (Config.RefreshToken + ObtainOIDCAccessToken)

Concurrent callers for the same user/provider share a single refresh call via golang.org/x/sync/singleflight, keyed by userID. The singleflight callback re-reads the link from the database to pick up any token already refreshed by a prior in-flight call, avoiding redundant IDP round-trips entirely.

2. Optimistic locking on UpdateExternalAuthLinkRefreshToken

The SQL WHERE clause now includes AND oauth_refresh_token = @old_oauth_refresh_token, so if two replicas (HA) race past singleflight, the loser's destructive UPDATE is a harmless no-op rather than overwriting the winner's valid token.

Changes

File Change
coderd/externalauth/externalauth.go Added singleflight.Group to Config; split RefreshToken into public wrapper + refreshTokenInner; pass OldOauthRefreshToken to DB update
coderd/provisionerdserver/provisionerdserver.go Wrapped OIDC refresh in ObtainOIDCAccessToken with package-level singleflight
coderd/database/queries/externalauth.sql Added optimistic lock (WHERE ... AND oauth_refresh_token = @old_oauth_refresh_token)
coderd/database/queries.sql.go Regenerated
coderd/database/querier.go Regenerated
coderd/database/dbauthz/dbauthz_test.go Updated test params for new field
coderd/externalauth/externalauth_test.go Added ConcurrentRefreshDedup test; updated existing tests for singleflight DB re-read

Testing

  • New test ConcurrentRefreshDedup: 5 goroutines call RefreshToken concurrently, asserts IDP refresh called exactly once, all callers get same token.
  • All existing TestRefreshToken/* subtests updated and passing.
  • TestObtainOIDCAccessToken passing.
  • dbauthz tests passing.

}
link, err := c.refreshTokenInner(ctx, db, freshLink)
return result{link, err}, nil
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just do this in a transaction holding a lock in the database?

As I see it now, 1 call would likely succeed, and some N calls in parallel would fail. I'm not sure how we handle these errors at all callsites.

Ideally he N that would fail would just be stuck in a short waiting loop for the first one to succeed.

I am thinking similar to database.ReadModifyUpdate, but using a lock to prevent the N concurrent from even starting until the first is complete

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I investigated this but it's a mistake because it locks the DB connection, so if like 10 requests came in at once to a coderd it'd lock all database conns.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The singleflight here serves exactly that purpose — N concurrent callers block on the first one, and when it completes they all get the same result. The callback also re-reads the link from DB (line 168) so if a prior singleflight call already refreshed the token, the new group sees the fresh token and skips the IDP round-trip.

As Kyle noted, a DB-level lock (SELECT ... FOR UPDATE / advisory lock) would hold a connection for the duration of the external HTTP call to the IDP, which could exhaust the pool under load. The singleflight + optimistic lock combo gives us the same correctness guarantees without tying up DB resources.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Singleflight won't work across HA, where a db lock would. We have TryAcquireLock for this reason. It still would briefly consume a conn, and backoff is probably some arbitrary duration, rather than a signal based unblock like SingleFlight.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This solution works though because the racing refresh would never be hit right? Because it would get the other one anyways because of how we're comparing the refresh token?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe a singleflight is even necessary in fact, our DB query being adjusted should be enough but let me check.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing this to remove the singleflight. The DB fix is indeed enough.

Copy link
Member

@Emyrk Emyrk Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is

  • TokenSource().Token <- does the oauth refresh
    • Failed refresh UpdateExternalAuthLinkRefreshToken
  • ValidateToken on retry loop, up to 200ms
  • UpdateExternalAuthLink

So this race is ok from a db state perspective. Coder B still has a failed refresh, which could block whatever action is was needing the token for right?

Time │  Coderd A                          │  Coderd B                          │  DB State
─────┼────────────────────────────────────┼────────────────────────────────────┼──────────────────────────────────
 t0  │                                    │                                    │ refresh_token=R1, access_token=T1
     │                                    │                                    │ (expired)
─────┼────────────────────────────────────┼────────────────────────────────────┼──────────────────────────────────
 t1  │ reads link from DB                 │ reads link from DB                 │
     │ (sees R1, T1)                      │ (sees R1, T1)                      │
─────┼────────────────────────────────────┼────────────────────────────────────┼──────────────────────────────────
 t2  │ TokenSource().Token()              │                                    │
     │ sends R1 to GitHub                 │                                    │
     │ gets back T2, R2                   │                                    │
─────┼────────────────────────────────────┼────────────────────────────────────┼──────────────────────────────────
 t3  │                                    │ TokenSource().Token()              │
     │                                    │ sends R1 to GitHub                 │
     │                                    │ ❌ FAILS: R1 already consumed      │
─────┼────────────────────────────────────┼────────────────────────────────────┼──────────────────────────────────
 t4  │ ValidateToken(T2) ✅               │ isFailedRefresh() == true          │
     │                                    │ UpdateExternalAuthLinkRefreshToken │
     │                                    │ sets refresh_token="", reason=err  │
─────┼────────────────────────────────────┼────────────────────────────────────┼──────────────────────────────────
 t5  │ UpdateExternalAuthLink             │                                    │ 
     │ writes T2, R2, err = ''            │                                    │ 
─────┼────────────────────────────────────┼────────────────────────────────────┼──────────────────────────────────

This is the worse ordering, but your change to UpdateExternalAuthLinkRefreshToken fixes it

Time │ Coderd A                     │ Coderd B                          │ DB State
─────┼──────────────────────────────┼───────────────────────────────────┼─────────────────────────
 t0  │                              │                                   │ access=T1, refresh=R1
     │                              │                                   │ (expired)
─────┼──────────────────────────────┼───────────────────────────────────┼─────────────────────────
 t1  │ reads link (R1, T1)          │ reads link (R1, T1)               │
─────┼──────────────────────────────┼───────────────────────────────────┼─────────────────────────
 t2  │ TokenSource(R1) → T2, R2 ✅  │                                   │
─────┼──────────────────────────────┼───────────────────────────────────┼─────────────────────────
 t3  │                              │ TokenSource(R1) → ❌ R1 consumed  │
─────┼──────────────────────────────┼───────────────────────────────────┼─────────────────────────
 t4  │ ValidateToken(T2) ✅         │                                   │
─────┼──────────────────────────────┼───────────────────────────────────┼─────────────────────────
 t5  │ UpdateExternalAuthLink       │                                   │ access=T2, refresh=R2
     │ writes T2, R2                │                                   │ reason="" ✅
─────┼──────────────────────────────┼───────────────────────────────────┼─────────────────────────
 t6  │                              │ UpdateRefreshToken(old=R1)        │
     │                              │ WHERE refresh=R1... BUT DB has R2 │
     │                              │ 🛡️ 0 rows matched — no-op!        │ UNCHANGED ✅
─────┼──────────────────────────────┼───────────────────────────────────┼─────────────────────────

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could probably try and brainstorm this more. But I think your change to UpdateRefreshToken keeps the database state consistent. So that is an improvement.

The singleflight fixes the first one within the same coderd. Not across Coderds.

I am ok with that if we add a comment saying as much. To signal we are aware of the HA issue, and have not solved it yet

@kylecarbs kylecarbs requested a review from Emyrk March 10, 2026 15:57
…istic locking

Add optimistic locking to UpdateExternalAuthLinkRefreshToken to prevent
concurrent workspace builds from consuming the same single-use OAuth2
refresh token, which permanently breaks auth links.

The SQL WHERE clause now includes the old refresh token value, so only
the first concurrent caller succeeds. Losers get a zero-row update,
which callers treat as InvalidTokenError ('skip this provider') — builds
continue without permanent damage.

The dbcrypt wrapper handles the non-deterministic AES-GCM encryption by
reading the raw encrypted row, decrypting it, and verifying against the
caller's plaintext before substituting the actual ciphertext for the
SQL WHERE clause comparison.
@kylecarbs kylecarbs force-pushed the fix/externalauth-refresh-race branch from b0d235d to 94a4ece Compare March 10, 2026 16:20
Comment on lines +266 to +298
// The SQL query uses an optimistic lock:
// WHERE oauth_refresh_token = @old_oauth_refresh_token
// The caller supplies the plaintext old token (since dbcrypt
// decrypts on read), but the DB stores the encrypted value.
// Because AES-GCM is non-deterministic, we cannot simply
// re-encrypt the old token — the ciphertext would differ.
// Instead, read the current row from the inner (raw) store
// and use the actual encrypted value for the WHERE clause.
if params.OldOauthRefreshToken != "" && db.ciphers != nil && db.primaryCipherDigest != "" {
raw, err := db.Store.GetExternalAuthLink(ctx, database.GetExternalAuthLinkParams{
ProviderID: params.ProviderID,
UserID: params.UserID,
})
if err != nil {
return err
}
// Decrypt the stored token so we can compare with the
// caller-supplied plaintext.
decrypted := raw.OAuthRefreshToken
if err := db.decryptField(&decrypted, raw.OAuthRefreshTokenKeyID); err != nil {
return err
}
if decrypted != params.OldOauthRefreshToken {
// The token has changed since the caller read it;
// the optimistic lock should fail (no rows updated).
// Return nil to match the :exec semantics of the SQL
// query, which silently updates zero rows.
return nil
}
// Use the raw encrypted value so the WHERE clause matches.
params.OldOauthRefreshToken = raw.OAuthRefreshToken
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense 👍

Copy link
Member

@Emyrk Emyrk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving. This solution does protect the DB from some races

@kylecarbs kylecarbs merged commit 53e52ae into main Mar 10, 2026
24 checks passed
@kylecarbs kylecarbs deleted the fix/externalauth-refresh-race branch March 10, 2026 17:52
@github-actions github-actions bot locked and limited conversation to collaborators Mar 10, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants