Resources > Articles

Parallelizing Jupyter Notebook Test

Pragmatic Institute Logo and 3 verticals

By Michael Li

This article was originally published on February 1, 2017, on The Data Incubator.

 

How we cut our end-to-end test suite runtime by 66% using parallelism

While there’s a common stereotype that data scientists are poor software engineers, at The Data Incubator (a Pragmatic Institute company), we believe that mastering the fundamentals of software engineering is important for data science and we strive to implement rigorous engineering standards for our data science company.  We have an extensive curriculum for data science corporate training and online data science courses leveraging the jupyter (née ipython) notebook format.  Last year, we published a post about testing Jupyter notebooks — applying rigorous software engineering testing standards to new technologies popular in data science.

However, over time, as our codebase as grown, we’ve added in more and more notebooks to our curriculum material. This led to tests on our curriculum taking ~30 minutes to run! We quickly identified parallelism as a low-hanging fruit that would make sense for a first approach, with a couple of points:

jupyter

  1. We have curriculum materials that run code in Spark 2.0 parallelizing runs in that kernel is hard because of how the Spark execution environment spins up.  We also have curriculum materials in the jupyter R Kernel.
  2. Subprocess communication in Python (what our testing code is written in) is a pain, so maybe there’s a way to use some other parallelization library to avoid having to reinvent that wheel.
  3. Most of our notebooks are in Python, so those shouldn’t have any issues.

These issues aside, this seemed like a reasonable approach because each Jupyter notebook executes as its own subprocess in our current setup – we just had to take each of those processes and run them at the same time. Taking a stab at 3., parallelizing python tests, while finding a way around 2. – annoying multiprocess communication issues – yielded great results!

 

The library: nose

Anyone who’s written production-grade python is probably familiar with nosetests, the ubiquitous test suite runner. In another codebase of ours, we use nose in conjunction with the flaky plugin to rerun some tests whose output can be… less deterministic than would be ideal.

Nose has good support for parallel testing out-of-the-box (incidentally, the flaky plugin and parallel testing don’t play nice together), so it seemed like a clear candidate for handling test-related subprocess communication for us.

 

The hurdle: dynamically adding tests

We run our test suite in many different configurations: on pull requests, for example, we’ll only run modified notebooks to facilitate speedier development – and we save the full build for when we merge into master. Given this, we need to dynamically add tests – 1 per notebook we want to test in a given run. A popular Python mechanism for this, that we’ve historically employed, is using something like this:

suite = unittest.TestSuit()

for filename in notebook_filenames:

suite.addTest(NotebookTestCase(filename))

Nose, unfortunately, does not like this and doesn’t play nice with unittest. It insists, instead, on test discovery. So we had to get creative. What did “creative” mean, exactly? Unfortunately for the pythonistas among us, it meant we had to use some of Python’s introspection functionality.

 

The solution: dynamically adding functions to Python classes

The hack we came up with was the following:

  1. Dynamically search out notebooks and add a test function for each to a class. In python, this involves defining a function, setting its __name__ attribute, and then using setattr on the parent class to add that function with the appropriate name. This took care of adding parallel tests in.
  2. Use the nose attr plugin to specify attributes on the tests, so we can maintain speedy single-notebook PR testing as described above. We have code that keeps track of the current diffed filenames (from master), and adds two sets of tests: one under the all attribute, and another under the change attribute. You can see the @attr decorator being used below.

You can see the class below. In a wrapper file, we call the add_tests() function as soon as that file is imported (i.e. before nose attempts any “discovery”) – the ipynb_all and ipynb_change_nbs functions live outside of the class but simply search out appropriate filenames.

class IpynbSelectorTestCase(object):

"""

Parallelizable TestCase to be used for nose tests.

To use, inherit and override`check_ipynb` to define how to check each notebook.

Call `add_tests` in a global call immediately after the class declaration.

See http://nose.readthedocs.io/en/latest/writing_tests.html#test-generators

Tests can be invoked via (e.g.):

nosetests -a 'all'

Do not inherit `unittest.TestCase`: this will break parallelization

"""

def check_ipynb(self, ipynb):

raise NotImplemented

@classmethod

def add_func(cls, ipynb, prefix):

@attr(prefix)

def func(self):

self.check_ipynb(ipynb)

_, nbname = os.path.split(ipynb)

func.__name__ = 'test_{}_{}'.format(prefix, nbname.split('.')[0])

func.__doc__ = 'Test {}'.format(nbname)

setattr(cls, func.__name__, func)

@classmethod

def add_tests(cls):

for ipynb in ipynb_all():

cls.add_func(ipynb, 'all')

for ipynb in ipynb_change_nbs():

cls.add_func(ipynb, 'change')

The results

So, our full build used to take 30 minutes to run, typically. With added parallelism, that time has dropped to 11 minutes. We tested with a few different process counts and continued seeing marginal improvement up to 6 processes. We made some plots. (Made with seaborn).

runtime-vs_processesjupyter_runtime_comparison

Not only is the reduction numerically dramatic, but gains like these add up in terms of curriculum developer productivity and allows us to rapidly iterate on our curriculum.

Other Resources in this Series

Most Recent

professionals sitting down looking at phone and reports
Article

10 Reasons You Need to Assess the Data Maturity in Your Organization

While most companies want to harness the power of data, the journey to determine where to begin or what are the next steps can be challenging. When an organization is data-driven, they base decisions on
Category: Data Science
predictive analytics on laptop
Article

Staying Ahead of the Competition with Predictive Analytics

Changes in customer behavior, the industry, and competitors’ offerings are why products routinely go out of favor—particularly in the digital space. For example, a digital enterprise product that was well-received when it launched in 2015
Category: Data Science
professionals evaluating reports on computer
Article

How to Pick the Best KPIs for Any Business

Data, data everywhere, not an insight in sight. You probably have encountered this contradiction if you’re a business leader trying to use data to manage your business. It’s not that you don’t have access to
Category: Data Science
data literacy
Article

How Big Data is Revolutionizing Business

Data is revolutionizing the world. IBM estimates that the world is producing 2.5 exabytes of data each day. That’s enough hard disks to cover more than six NFL football fields 

Category: Data Science
woman analyzing different graphs
Article

[Q&A] How Data Visualization Can Be Misused

Learn how organizations can leverage data visualization to make data-driven decisions and how to stop charts from lying.
Category: Data Science

OTHER ArticleS

professionals sitting down looking at phone and reports
Article

10 Reasons You Need to Assess the Data Maturity in Your Organization

While most companies want to harness the power of data, the journey to determine where to begin or what are the next steps can be challenging. When an organization is data-driven, they base decisions on
Category: Data Science
predictive analytics on laptop
Article

Staying Ahead of the Competition with Predictive Analytics

Changes in customer behavior, the industry, and competitors’ offerings are why products routinely go out of favor—particularly in the digital space. For example, a digital enterprise product that was well-received when it launched in 2015
Category: Data Science

Sign up to stay up to date on the latest industry best practices.

Sign up to received invites to upcoming webinars, updates on our recent podcast episodes and the latest on industry best practices.

Training on Your Schedule

Fill out the form today and our sales team will help you schedule your private Pragmatic training today.

Subscribe

Subscribe

Training on Your Schedule

Fill out the form today and our sales team will help you schedule your private Pragmatic training today.