Bug Report
Description
When using populate(reserve_jobs=True) with multiple workers (e.g., SLURM array jobs), multiple workers can successfully "reserve" the same key and call make() simultaneously.
I noticed this when sumbitting multiple SLURM jobs that populate the same table: their output logs confirmed that 6/7 jobs were executing make() on the same key. This job reservation system has always been a nice feature of datajoint for distributed computing - preventing redundant computations and collisions.
Current workaround
Currently I think this is a rare occurance, probably made slightly more common by the fact that I am accessing my server across a network (not hosted locally) and submitting sbatch arrays. My workaround right now has been adding a random sleep (0-30s) before populate() in the SLURM script to stagger worker start times. With that in place I have not noticed the duplicate make / race.
I'm not very familiar with how datajoint manages job reservations, but below is what claude suggested might be the issue. Adding it to this post in case it is useful.
Job.reserve() in DataJoint 2.1 uses a non-atomic SELECT-then-UPDATE pattern that allows multiple workers to reserve the same job simultaneously. This is a regression from the 0.13.x approach which used an atomic INSERT ... ON DUPLICATE KEY pattern that was inherently atomic.
Root Cause
Job.reserve() (jobs.py:430-473) performs a check-then-act without atomicity:
def reserve(self, key: dict) -> bool:
# Step 1: SELECT — check if job is pending
job = (self & key & "status='pending'" & "scheduled_time <= CURRENT_TIMESTAMP(3)").to_dicts()
if not job:
return False
# Step 2: UPDATE — mark as reserved
pk = self._get_pk(key)
update_row = {**pk, "status": "reserved", ...}
try:
self.update1(update_row) # UPDATE ... SET status='reserved' WHERE <pk>
return True
except Exception:
return False
The UPDATE's WHERE clause matches only on primary key, not on status='pending'. So if two workers both read the row as 'pending' before either updates, both UPDATEs succeed — the second simply overwrites the first worker's reservation.
Comparison with 0.13.x
The old JobTable.reserve() (0.13.x) used an atomic INSERT pattern:
def reserve(self, table_name, key):
job = dict(key, table_name=table_name, status='reserved',
host=platform.node(), pid=os.getpid(), ...)
try:
self.insert1(job) # INSERT — fails with DuplicateError if row exists
except DuplicateError:
return False # Another worker already has this key
return True
This is inherently atomic: the first INSERT wins, all others get DuplicateError. No window exists between check and action.
Suggested Fix
Option A — Add a WHERE clause to the UPDATE:
UPDATE jobs SET status='reserved', ...
WHERE table_name=... AND key_hash=... AND status='pending'
Then check affected_rows == 1 to determine success. This is a single atomic operation.
Option B — Use SELECT ... FOR UPDATE before the check:
SELECT * FROM jobs WHERE ... AND status='pending' FOR UPDATE
This acquires a row-level lock, preventing concurrent readers from seeing the row as pending.
Option C — Restore the INSERT-based approach from 0.13.x, which was atomic by design.
Environment
- datajoint 2.1.0
- Python 3.12
- MySQL 8.0
- SLURM cluster, 7 concurrent array tasks calling
populate(reserve_jobs=True)