My remote queue service doesn’t do much ... just watches a folder for new notebook jobs and runs them via Papermill with injected parameters, which ends up being the run name. Localised parameters related to the notebook are just stored and reported in the databse. But the simplicity is the win. No Flask, no Celery, no Redis. Just one Python process per node. Each job gets a UUID, a full log, and a clean artifact path. That’s enough to train thousands of models across GPUs without manual babysitting. [[ML]] [[Serendipity]]