Conversation
coderd/externalauth/externalauth.go
Outdated
| } | ||
| link, err := c.refreshTokenInner(ctx, db, freshLink) | ||
| return result{link, err}, nil | ||
| }) |
There was a problem hiding this comment.
Should we just do this in a transaction holding a lock in the database?
As I see it now, 1 call would likely succeed, and some N calls in parallel would fail. I'm not sure how we handle these errors at all callsites.
Ideally he N that would fail would just be stuck in a short waiting loop for the first one to succeed.
I am thinking similar to database.ReadModifyUpdate, but using a lock to prevent the N concurrent from even starting until the first is complete
There was a problem hiding this comment.
I investigated this but it's a mistake because it locks the DB connection, so if like 10 requests came in at once to a coderd it'd lock all database conns.
There was a problem hiding this comment.
The singleflight here serves exactly that purpose — N concurrent callers block on the first one, and when it completes they all get the same result. The callback also re-reads the link from DB (line 168) so if a prior singleflight call already refreshed the token, the new group sees the fresh token and skips the IDP round-trip.
As Kyle noted, a DB-level lock (SELECT ... FOR UPDATE / advisory lock) would hold a connection for the duration of the external HTTP call to the IDP, which could exhaust the pool under load. The singleflight + optimistic lock combo gives us the same correctness guarantees without tying up DB resources.
There was a problem hiding this comment.
Singleflight won't work across HA, where a db lock would. We have TryAcquireLock for this reason. It still would briefly consume a conn, and backoff is probably some arbitrary duration, rather than a signal based unblock like SingleFlight.
There was a problem hiding this comment.
This solution works though because the racing refresh would never be hit right? Because it would get the other one anyways because of how we're comparing the refresh token?
There was a problem hiding this comment.
I don't believe a singleflight is even necessary in fact, our DB query being adjusted should be enough but let me check.
There was a problem hiding this comment.
Fixing this to remove the singleflight. The DB fix is indeed enough.
There was a problem hiding this comment.
It is
TokenSource().Token<- does the oauth refresh- Failed refresh
UpdateExternalAuthLinkRefreshToken
- Failed refresh
ValidateTokenon retry loop, up to 200msUpdateExternalAuthLink
So this race is ok from a db state perspective. Coder B still has a failed refresh, which could block whatever action is was needing the token for right?
Time │ Coderd A │ Coderd B │ DB State
─────┼────────────────────────────────────┼────────────────────────────────────┼──────────────────────────────────
t0 │ │ │ refresh_token=R1, access_token=T1
│ │ │ (expired)
─────┼────────────────────────────────────┼────────────────────────────────────┼──────────────────────────────────
t1 │ reads link from DB │ reads link from DB │
│ (sees R1, T1) │ (sees R1, T1) │
─────┼────────────────────────────────────┼────────────────────────────────────┼──────────────────────────────────
t2 │ TokenSource().Token() │ │
│ sends R1 to GitHub │ │
│ gets back T2, R2 │ │
─────┼────────────────────────────────────┼────────────────────────────────────┼──────────────────────────────────
t3 │ │ TokenSource().Token() │
│ │ sends R1 to GitHub │
│ │ ❌ FAILS: R1 already consumed │
─────┼────────────────────────────────────┼────────────────────────────────────┼──────────────────────────────────
t4 │ ValidateToken(T2) ✅ │ isFailedRefresh() == true │
│ │ UpdateExternalAuthLinkRefreshToken │
│ │ sets refresh_token="", reason=err │
─────┼────────────────────────────────────┼────────────────────────────────────┼──────────────────────────────────
t5 │ UpdateExternalAuthLink │ │
│ writes T2, R2, err = '' │ │
─────┼────────────────────────────────────┼────────────────────────────────────┼──────────────────────────────────
This is the worse ordering, but your change to UpdateExternalAuthLinkRefreshToken fixes it
Time │ Coderd A │ Coderd B │ DB State
─────┼──────────────────────────────┼───────────────────────────────────┼─────────────────────────
t0 │ │ │ access=T1, refresh=R1
│ │ │ (expired)
─────┼──────────────────────────────┼───────────────────────────────────┼─────────────────────────
t1 │ reads link (R1, T1) │ reads link (R1, T1) │
─────┼──────────────────────────────┼───────────────────────────────────┼─────────────────────────
t2 │ TokenSource(R1) → T2, R2 ✅ │ │
─────┼──────────────────────────────┼───────────────────────────────────┼─────────────────────────
t3 │ │ TokenSource(R1) → ❌ R1 consumed │
─────┼──────────────────────────────┼───────────────────────────────────┼─────────────────────────
t4 │ ValidateToken(T2) ✅ │ │
─────┼──────────────────────────────┼───────────────────────────────────┼─────────────────────────
t5 │ UpdateExternalAuthLink │ │ access=T2, refresh=R2
│ writes T2, R2 │ │ reason="" ✅
─────┼──────────────────────────────┼───────────────────────────────────┼─────────────────────────
t6 │ │ UpdateRefreshToken(old=R1) │
│ │ WHERE refresh=R1... BUT DB has R2 │
│ │ 🛡️ 0 rows matched — no-op! │ UNCHANGED ✅
─────┼──────────────────────────────┼───────────────────────────────────┼─────────────────────────
There was a problem hiding this comment.
I could probably try and brainstorm this more. But I think your change to UpdateRefreshToken keeps the database state consistent. So that is an improvement.
The singleflight fixes the first one within the same coderd. Not across Coderds.
I am ok with that if we add a comment saying as much. To signal we are aware of the HA issue, and have not solved it yet
…istic locking
Add optimistic locking to UpdateExternalAuthLinkRefreshToken to prevent
concurrent workspace builds from consuming the same single-use OAuth2
refresh token, which permanently breaks auth links.
The SQL WHERE clause now includes the old refresh token value, so only
the first concurrent caller succeeds. Losers get a zero-row update,
which callers treat as InvalidTokenError ('skip this provider') — builds
continue without permanent damage.
The dbcrypt wrapper handles the non-deterministic AES-GCM encryption by
reading the raw encrypted row, decrypting it, and verifying against the
caller's plaintext before substituting the actual ciphertext for the
SQL WHERE clause comparison.
b0d235d to
94a4ece
Compare
| // The SQL query uses an optimistic lock: | ||
| // WHERE oauth_refresh_token = @old_oauth_refresh_token | ||
| // The caller supplies the plaintext old token (since dbcrypt | ||
| // decrypts on read), but the DB stores the encrypted value. | ||
| // Because AES-GCM is non-deterministic, we cannot simply | ||
| // re-encrypt the old token — the ciphertext would differ. | ||
| // Instead, read the current row from the inner (raw) store | ||
| // and use the actual encrypted value for the WHERE clause. | ||
| if params.OldOauthRefreshToken != "" && db.ciphers != nil && db.primaryCipherDigest != "" { | ||
| raw, err := db.Store.GetExternalAuthLink(ctx, database.GetExternalAuthLinkParams{ | ||
| ProviderID: params.ProviderID, | ||
| UserID: params.UserID, | ||
| }) | ||
| if err != nil { | ||
| return err | ||
| } | ||
| // Decrypt the stored token so we can compare with the | ||
| // caller-supplied plaintext. | ||
| decrypted := raw.OAuthRefreshToken | ||
| if err := db.decryptField(&decrypted, raw.OAuthRefreshTokenKeyID); err != nil { | ||
| return err | ||
| } | ||
| if decrypted != params.OldOauthRefreshToken { | ||
| // The token has changed since the caller read it; | ||
| // the optimistic lock should fail (no rows updated). | ||
| // Return nil to match the :exec semantics of the SQL | ||
| // query, which silently updates zero rows. | ||
| return nil | ||
| } | ||
| // Use the raw encrypted value so the WHERE clause matches. | ||
| params.OldOauthRefreshToken = raw.OAuthRefreshToken | ||
| } | ||
|
|
Emyrk
left a comment
There was a problem hiding this comment.
Approving. This solution does protect the DB from some races
Problem
When multiple concurrent callers (e.g., parallel workspace builds) read the same single-use OAuth2 refresh token from the database and race to exchange it with the provider, the first caller succeeds but subsequent callers get
bad_refresh_token. The losing caller then clears the valid new token from the database, permanently breaking the auth link until the user manually re-authenticates.This is reliably reproducible when launching multiple workspaces simultaneously with GitHub App external auth and user-to-server token expiration enabled.
Solution
Two layers of protection:
1. Singleflight deduplication (
Config.RefreshToken+ObtainOIDCAccessToken)Concurrent callers for the same user/provider share a single refresh call via
golang.org/x/sync/singleflight, keyed byuserID. The singleflight callback re-reads the link from the database to pick up any token already refreshed by a prior in-flight call, avoiding redundant IDP round-trips entirely.2. Optimistic locking on
UpdateExternalAuthLinkRefreshTokenThe SQL
WHEREclause now includesAND oauth_refresh_token = @old_oauth_refresh_token, so if two replicas (HA) race past singleflight, the loser's destructive UPDATE is a harmless no-op rather than overwriting the winner's valid token.Changes
coderd/externalauth/externalauth.gosingleflight.GrouptoConfig; splitRefreshTokeninto public wrapper +refreshTokenInner; passOldOauthRefreshTokento DB updatecoderd/provisionerdserver/provisionerdserver.goObtainOIDCAccessTokenwith package-level singleflightcoderd/database/queries/externalauth.sqlWHERE ... AND oauth_refresh_token = @old_oauth_refresh_token)coderd/database/queries.sql.gocoderd/database/querier.gocoderd/database/dbauthz/dbauthz_test.gocoderd/externalauth/externalauth_test.goConcurrentRefreshDeduptest; updated existing tests for singleflight DB re-readTesting
ConcurrentRefreshDedup: 5 goroutines callRefreshTokenconcurrently, asserts IDP refresh called exactly once, all callers get same token.TestRefreshToken/*subtests updated and passing.TestObtainOIDCAccessTokenpassing.dbauthztests passing.