Jupyter notebooks have been one of the most controversial tools in the data science community. There are some outspoken critics, as well as passionate fans. Nevertheless, many data scientists will agree that they can be really valuable – if used well. And that’s what we’re going to focus on in this article, which is the second in my series on Software Patterns for Data Science & ML Engineering. I’ll show you best practices for using Jupyter Notebooks for exploratory data analysis.
But first, we need to understand why notebooks were established in the scientific community. When data science was sexy, notebooks weren’t a thing yet. Before them, we had IPython, which was integrated into IDEs such as Spyder that tried to mimic the way RStudio or Matlab worked. These tools gained significant adoption among researchers.
In 2014, Project Jupyter evolved from IPython. Its usage sky-rocketed, driven mainly by researchers who jumped to work in industry. However, approaches for using notebooks that work well for scientific projects don’t necessarily translate well to analyses conducted for the business and product units of enterprises. It’s not uncommon for data scientists hired right out of university to struggle to fulfill the new expectations they encounter around the structure and presentation of their analyses.
In this article, we’ll talk about Jupyter notebooks specifically from a business and product point of view. As I already mentioned, Jupyter notebooks are a polarising topic, so let’s go straight into my opinion.
Jupyter notebooks should be used for purely exploratory tasks or ad-hoc analysis ONLY.
A notebook should be nothing more than a report. The code it contains should not be important at all. It’s only the results it generates that matter. Ideally, we should be able to hide the code in the notebook because it’s just a means to answer questions.
For example: What are the statistical characteristics of these tables? What are the properties of this training dataset? What is the impact of putting this model into production? How can we make sure this model outperforms the previous one? How has this AB test performed?
Jupyter notebook: guidelines for effective storytelling
Writing Jupyter notebooks is basically a way of telling a story or answering a question about a problem you’ve been investigating. But that doesn’t mean you have to show the explicit work you’ve done to reach your conclusion.
Notebooks have to be refined.
They are primarily created for the writer to understand an issue but also for their fellow peers to gain that knowledge without having to dive deep into the problem themselves.
The non-linear and tree-like nature of exploring datasets in notebooks, which typically contain irrelevant sections of exploration streams that didn’t lead to any answer, is not the way the notebook should look at the end. The notebook should contain the minimum content that best answers the questions at hand. You should always comment on and give rationales about each of the assumptions and conclusions. Executive summaries are always advisable as they’re perfect for stakeholders with a vague interest in the topic or limited time. They’re also a great way to prepare peer reviewers for the full notebook delve.
The audience for notebooks is typically quite technical or business-savvy. Hence, you’re expected to use advanced terminology. Nevertheless, executive summaries or conclusions should always be written in simple language and link to sections with further and deeper explanations. If you find yourself struggling to craft a notebook for a non-technical audience, maybe you want to consider creating a slide deck instead. There, you can use infographics, custom visualizations, and broader ways to explain your ideas.
Always provide context for the problem at hand. Data on its own is not sufficient for a cohesive story. We have to frame the whole analysis within the domain we’re working in so that the audience feels comfortable reading it. Use links to the company’s existing knowledge base to support your statements and collect all the references in a dedicated section of the notebook.
How to structure Jupyter notebook’s content
In this section, I will explain the notebook layout I typically use. It may seem like a lot of work, but I recommend creating a notebook template with the following sections, leaving placeholders for the specifics of your task. Such a customized template will save you a lot of time and ensure consistency across notebooks.
- Title: Ideally, the name of the associated JIRA task (or any other issue-tracking software) linked to the task. This allows you and your audience to unambiguously connect the answer (the notebook) to the question (the JIRA task).
- Description: What do you want to achieve in this task? This should be very brief.
- Table of contents: The entries should link to the notebook sections, allowing the reader to jump to the part they’re interested in. (Jupyter creates HTML anchors for each headline that are derived from the original headline through headline.lower().replace(” “, “-“), so you can link to them with plain Markdown links such as [section title](#section-title). You can also place your own anchors by adding <a id=’your-anchor’></a> to markdown cells.)
- References: Links to internal or external documentation with background information or specific information used within the analysis presented in the notebook.
- TL;DR or executive summary: Explain, very concisely, the results of the whole exploration and highlight the key conclusions (or questions) that you’ve come up with.
- Introduction & background: Put the task into context, add information about the key business precedents around the issue, and explain the task in more detail.
- Imports: Library imports and settings. Configure settings for third-party libraries, such as matplotlib or seaborn. Add environment variables such as dates to fix the exploration window.
- Data to explore: Outline the tables or datasets you’re exploring/analyzing and reference their sources or link their data catalog entries. Ideally, you surface how each dataset or table is created and how frequently it is updated. You could link this section to any other piece of documentation.
- Analysis cells
- Conclusion: Detailed explanation of the key results you’ve obtained in the Analysis section, with links to specific parts of the notebooks where readers can find further explanations.
Remember to always use Markdown formatting for headers and to highlight important statements and quotes. You can check the different Markdown syntax options in Markdown Cells — Jupyter Notebook 6.5.2 documentation.
How to organize code in Jupyter notebook
For exploratory tasks, the code to produce SQL queries, pandas data wrangling, or create plots is not important for readers.
However, it is important for reviewers, so we should still maintain a high quality and readability.
My tips for working with code in notebooks are the following:
Move auxiliary functions to plain Python modules
Generally, importing functions defined in Python modules is better than defining them in the notebook. For one, Git diffs within .py files are way easier to read than diffs in notebooks. The reader should also not need to know what a function is doing under the hood to follow the notebook.
For example, you typically have functions to read your data, run SQL queries, and preprocess, transform, or enrich your dataset. All of them should be moved into .py filed and then imported into the notebook so that readers only see the function call. If a reviewer wants more detail, they can always look at the Python module directly.
I find this especially useful for plotting functions, for example. It’s typical that I can reuse the same function to make a barplot several times in my notebook. I’ll need to make small changes, such as using a different set of data or a different title, but the overall plot layout and style will be the same. Instead of copying and pasting the same code snippet around, I just create a utils/plots.py module and create functions that can be imported and adapted by providing arguments.
Here’s a very simple example:
When creating these Python modules, remember that the code is still part of an exploratory analysis. So unless you’re using it in any other part of the project, it doesn’t need to be perfect. Just readable and understandable enough for your reviewers.
Using SQL directly in Jupyter cells
There are some cases in which data is not in memory (e.g., in a pandas DataFrame) but in the company’s data warehouse (e.g., Redshift). In those cases, most of the data exploration and wrangling will be done through SQL.
There are several ways to use SQl wit Jupyter notebooks. JupySQL allows you to write SQL code directly in notebook cells and shows the query result as if it was a pandas DataFrame. You can also store SQL scripts in accompanying files or within the auxiliary Python modules we discussed in the previous section.
Whether it’s better to use one or the other depends mostly on your goal:
If you’re running a data exploration around several tables from a data warehouse and you want to show to your peers the quality and validity of the data, then showing SQL queries within the notebook is usually the best option. Your reviewers will appreciate that they can directly see how you’ve queried these tables, what kind of joins you had to make to arrive at certain views, what filters you needed to apply, etc.
However, if you’re just generating a dataset to validate a machine learning model and the main focus of the notebook is to show different metrics and explainability outputs, then I would recommend to hide the dataset extraction as much as possible and keep the queries in a separate SQL script or Python module.
We will now see an example of how to use both options.
Reading & executing from .sql scripts
We can use .sql files that are opened and executed from the notebook through a database connector library.
Let’s say we have the following query in a select_purchases.sql file:
Then, we could define a function to execute SQL scripts:
Note that we have provided default values for the database connection parameters so that we don’t have to specify them every time. However, remember to never store secrets or other sensitive information within your Python scripts! (Later in the series, we’ll discuss different solutions to this problem.)
Now we can use the following one-liner within our notebook to execute the script:
Traditionally, ipython-sql has been the tool of choice to query SQL from Jupyter notebooks. But it has been sunset by its original creator in April 2023, who recommends switching to JupySQL, which is an actively maintained fork. Going forward, all improvements and new features will only be added to JupySQL.
To install the library for using it with Redshift, we have to do:
In your Jupyter notebook you can now use the %load_ext sql magic command to enable SQL and use the following snippet to create a sqlalchemy Redshift engine:
Then, just pass the engine to the magic command:
And you’re ready to go!
Now it’s just as simple as using the magic command and write any query that you want to execute and you will get the results in the cell’s output:
Make sure cells are executed in order
I recommend you always run all code cells before pushing the notebook to your repository. Jupyter notebooks save the output state of each cell when it is executed. That means that the code you wrote or edited might not correspond to the shown output of the cell.
Running a notebook from top to bottom is also a good test to see if your notebook depends on any user input to execute correctly. Ideally, everything should just run through without your intervention. If not, your analysis is most likely not reproducible by others – or even by your future self.
One way of checking that a notebook has been run in-order is to use the nbcheckorder pre-commit hook. It checks if the cell’s output numbers are sequential. If they’re not, it indicates that the notebook cells haven’t been executed one after the other and prevents the Git commit from going through.
If you’re not using pre-commit yet, I highly recommend you adopt this little tool. I recommend you to start learning about it through this introduction to pre-commit by Elliot Jordan. Later, you can go through its extensive documentation to understand all of its features.
Clear out cells’ output
Even better than the tip before, clear out all cells’ output in the notebook. One benefit you get is that you can ignore the cells states and outputs, but on the other hand, it forces reviewers to run the code in local if they want to see the results. There are several ways to do this automatically.
Produce and share reports with Jupyter notebook
Now, here comes a not very well-solved question in the industry. What’s the best way to share your notebooks with your team and external stakeholders?
In terms of sharing analyses from Jupyter notebooks, the field is divided between three different types of teams that foster different ways of working.
The translator teams
These teams believe that people from business or product units won’t be comfortable reading Jupyter notebooks. Hence, they adapt their analysis and reports to their expected audience.
Translator teams take their findings from the notebooks and add them to their company’s knowledge system (e.g., Confluence, Google Slides, etc.). As a negative side effect, they lose some of the traceability of notebooks, because it is now more difficult to review the report’s version history. But, they’ll argue, they are able to convey their results and analysis more effectively to the respective stakeholders.
If you want to do this, I recommend keeping a link between the exported document and the Jupyter notebook so that they’re always in sync. In this setup, you can keep notebooks with less text and conclusions, focused more on the raw facts or data evidence. You’ll use the documentation system to expand on the executive summary and comments about each of the findings. In this way, you can decouple both deliverables – the exploratory code and the resulting findings.
The all in-house teams
These teams use local Jupyter notebooks and share them with other business units by building solutions tailored to their company’s knowledge system and infrastructure. They do believe that business and product stakeholders should be able to understand the data scientist’s notebooks and feel strongly about the need to keep a fully traceable lineage from findings back to the raw data.
However, it’s unlikely the finance team is going to GitHub or Bitbucket to read your notebook.
I have seen several solutions implemented in this space. For example, you can use tools like nbconvert to generate PDFs from Jupyter notebooks or export them as HTML pages, so that they can be easily shared with anyone, even outside the technical teams.
You can even move these notebooks into S3 and allow them to be hosted as a static website with the rendered view. You could use a CI/CD workflow to create and push an HTML rendering of your notebook to S3 when the code gets merged into a specific branch.
The third-party tool advocates
These teams use tools that enable not just the development of notebooks but also the sharing with other people in the organisation. This typically involves dealing with complexities such as ensuring secure and simple access to internal data warehouses, data lakes, and databases.
Some of the most widely adopted tools in this space are Deepnote, Amazon SageMaker, Google Vertex AI, and Azure Machine Learning. These are all full-fledged platforms for running notebooks that allow spinning-up virtual environments in remote machines to execute your code. They provide interactive plotting, data, and experiments exploration, which simplifies the whole data science lifecycle. For example, Sagemaker allows you to visualise all your experiments information that you have tracked with Sagemaker Experiments, and Deepnote offers also point and click visualization with their Chart Blocks.
On top of that, Deepnote and SageMaker allow you to share the notebook with any of your peers to view it or even to enable real-time collaboration using the same execution environment.
There are also open-source alternatives such as JupyterHub, but the setup effort and maintenance that you need to operate it is not worth it. Spinning up a JupyterHub on-premises can be a suboptimal solution, and only in very few cases does it make sense to do it (e.g: very specialised types of workloads which require specific hardware). By using Cloud services, you can leverage economies of scale which guarantee much better fault-tolerant architectures than other companies which operate in a different business can offer. You have to assume the initial setup costs, delegate its maintenance to a platform operations team to keep it up and running for Data Scientists, and guarantee data security and privacy. Therefore, trust in managed services will avoid endless headaches about the infrastructure that is better not having.
My general advice for exploring these products: If your company is already using a cloud provider like AWS, Google Cloud Platform, or Azure it might be a good idea to adopt their notebook solution, as accessing your company’s infrastructure will likely be easier and seem less risky.
Embracing effective Jupyter notebook practices
In this article, we’ve discussed best practices and advice for optimizing the utility of Jupyter notebooks.
The most important takeaway:
Always approach creating a notebook with the intended audience and final objective in mind. In that way, you know how much focus to put on the different dimensions of the notebook (code, analysis, executive summary, etc).
All in all, I encourage data scientists to use Jupyter notebooks, but exclusively for answering exploratory questions and reporting purposes.
Production artefacts such as models, datasets, or hyperparameters shouldn’t trace back to notebooks. They should have their origin in production systems that are reproducible and re-runnable. For example, SageMaker Pipelines or Airflow DAGs that are well-maintained and thoroughly tested.
These last thoughts about traceability, reproducibility, and lineage will be the starting point for the next article in my series on Software Patterns in Data Science and ML Engineering, which will focus on how to uplevel your ETL skills. While often ignored by data scientists, I believe mastering ETL is core and critical to guarantee the success of any machine learning project.