Home | Next: Installation
The Jupyter notebook is a powerful tool for interacting with data. It lets you visualize what is happening while developing algorithms that process the data.
At the same time, the Jupyter notebook is a way of sharing a computational narrative: that is, slowly building up pieces of an algorithm and explaining how they work on data.
However, this view skips the entire process in between. Here Jupyter is limited in two key ways: it doesn't provide a good architecture for farming out parallel jobs when pickle fails, and it doesn't have version control that's well-adapted to the domain of algorithms research.
One key feature of the notebook is that it encourages keeping a lot of information stored as datastructures in memory, as opposed to stored as files on disk. So you end up having data structures that aren't serialized anywhere, and that maybe can't be regenerated because the code cells leading to them have been deleted. You also have functions that exist only in the memory of the notebook, and maybe as source code somewhere inside a cell.
This makes it hard to transmit these datastructures to other machines as jobs. Objects in memory which are executable (like functions) or rely on interpreter state (like generators) are not guaranteed to be portable, and in many cases can't be sent across machines. Details of the Python pickle implementation impose further limitations.
A notebook in the process of being written may be filled with one-time code to visualize a particular piece of data, that will never be needed again. It may also be highly nonlinear, i.e. just executing the notebook from the top down starting with a clean slate may not even run. Sometimes code may be changed in place, obscuring old versions.
Jupyter provides checkpointing as a backup against crashes, but this only covers one use of version control. If a piece of code is deleted, and a researcher wants to remember what it did months down the line, this is not well-supported by a simple versioning setup.
Similarly, research often involves considering many different code variations in parallel. Whereas for well-understood domains software engineers can have a linear revision history that slowly adds functionality, it is not unnatural research to have many branching paths. Most of these will be swiftly abandoned, but sometimes need to be dug up at a much later date.
These are symptoms of the same problem: storing code and data in memory instead of on-disk. Even if the full history leading to the current state is theoretically available (e.g. in IPython history logs), it is not easy to use.
A common solution to this problem is well known: make sure that jobs are submitted in reproducible source-code form, and store the code of any job ever submitted. This solves portability across machines, and versioning any code that may be of interest. Frameworks that do this typically rely on a combination of existing version control systems, shell scripts, and archive files.
What makes it hard to adapt the notebook to this kind of workflow out-of-the-box is that a notebook might contain a mix of algorithmic and exploratory code. The algorithmic code is the key to running reproducible jobs, but the exploratory code is there to help the author make sense of the results and find bugs. It is usually this exploratory code that is unreproducible.
The key observation is: the author knows which code is exploratory, and which code isn't. Also, the algorithmic code is typically fairly concise, and also more reproducible than the exploratory sections.
So this is the general flavor of the proposal: the author annotates algorithmic code, runs the code in jobs (possibly on remote clusters), and this framework keep tracks of versioning the result.
Other features we pick up along the way:
This framework, called nbjob
, is still heavily a work in progress. It is in flux as I adapt it to my personal research needs.
Comments welcome.