Lab IA’s Knowledge Book¶
Welcome to Lab IA’s Knowledge Book
This Knowledge Book project is focused on facilitating the sharing of knowledge between Etalab’s data scientists and other technical roles using data formats and tools that make sense in these professions. It allows us to publish posts (Markdown and Jupyter Notebook) to better promote the analysis of our work.
About Notebooks (.ipynb)¶
Notebooks are great for conduction exploratory analysis and keeping notes, but they do introduce some problems - particularly when maintaining project structure, sensible imports, and version control.
Also refer to Python guidance.
Structure¶
One problem with notebooks is they encourage bad importing practice. It might be tempting to do something like this:
# ipython notebook:
%run my_notebook_of_handy_functions.py
y = my_function(x)
This should be avoided, as it is akin to from my_notebook_of_handy_functions import *, which doesn’t give the reader of the code much of a clue where the functions are located. It is generally considered best practice to declare explicit imports at the top of the file, with one line per module:
from os import environ
# for multiple functions you can use commas
from numpy import sqrt, correlate, pi
# for very long imports use a tuple
from numpy import (apply_along_axis, arange, argsort, correlate, cos,
diagonal, pi, sqrt, tan, tile)
# or use the module.function syntax if you're using a lot of functions
# (note that import that both approaches are computationally identical)
import numpy
numpy.sqrt(x)
Version control¶
Notebooks do not lend themselves well to version control - because of their verbose file structure that contains lots of information in addition to the code - makes git diff or pull requests difficult to interpret. To prevent this from causing difficulties, I’d suggest taking one or both of the following steps:
Use
nbstripout
to remove.ipynb
outputOne of the main reasons commits from notebooks are so verbose is that they contain the output of the code. This can often change lots for a small change in code, making large git diffs. nbstripout remove the output from the notebook before it is sent to GitHub (i.e. a git filter). To set it up:
pip install -U nbstripout cd $local_git_repo nbstripout --install --attributes .gitattributes
Use a custom pre-commit hook to generate
.py
filesGit hooks let you run scripts at certain times during the git process. You can use the following script to generate two version of the code for the repo: one containing only the code (great for
git diff
) and a rendered version for viewing.To create a pre-commit git hook, that generates a .py and .html file for each .ipynb, create the file repo/.git/hooks/pre-commit (note the file has no extension) containing:
#!/bin/sh # ## Convert all .ipynb files to .py and .html versions for git files=` git diff --cached --name-only | grep 'ipynb' | grep -v '.ipynb_checkpoints'` # Don't treat spaces as a new file OIFS="$IFS" IFS=$'\n' for f in $files; do echo $f # Get dirnames and create them newdir=`dirname "$f"` mkdir -p "nb_src/$newdir" mkdir -p "nb_render/$newdir" # Convert to .py and .html pyfile="nb_src/${f%.ipynb}.py" htmlfile="nb_render/${f%.ipynb}.html" jupyter nbconvert --to script "$f" --stdout > "$pyfile" jupyter nbconvert --to html "$f" --stdout > "$htmlfile" # Add it back to index git add "$pyfile" git add "$htmlfile" done # Reset spaces IFS="$OIFS"
Note: The generated
.py
and.html
files will be committed but are not present in the git diff message presented in the terminal. This is a bug in git as the diff message is generated before the pre-commit script fires.
Inspired by airbnb’s Knowledge Repo and using the amazing Jupyter Book project to generate static content.