Distributed Machine Learning Ops

I spent a lot of time today trying to streamline my flow, in terms of distributed compute. Our environment consists of multiple training servers, which are remote from my day-to-day local station. The local workstation is often a MacBook or mini, which is obviously insufficient compute for training. My old way of working was to rsync (or git commit → push → pull via actions) and then kick things off remotely, via ssh. My new way of working is having a remote 'queue' which is a Python script, running on each of the training servers, looking for any new notebooks to run. [Papermill](https://github.com/nteract/papermill) gives us headless execution of the notebook with parameters. Now, whenever I copy a working notebook from local to remote using the [Jupyter plugin](https://plugins.jetbrains.com/plugin/22814-jupyter) for IntelliJ, the notebook gets picked up automatically and added to the queue of training. [[ML]] [[Serendipity]]