Resources > Articles

Parallelizing Jupyter Notebook Test

Author
  • Pragmatic Institute

    Pragmatic Institute is the transformational partner for today’s businesses, providing immediate impact through actionable and practical training for product, design and data teams. Our courses are taught by industry experts with decades of hands-on experience, and include a complete ecosystem of training, resources and community. This focus on dynamic instruction and continued learning has delivered impactful education to over 200,000 alumni worldwide over the last 30 years.

Pragmatic Institute Logo and 3 verticals

By Michael Li

This article was originally published on February 1, 2017, on The Data Incubator.

 

How we cut our end-to-end test suite runtime by 66% using parallelism

While there’s a common stereotype that data scientists are poor software engineers, at The Data Incubator (a Pragmatic Institute company), we believe that mastering the fundamentals of software engineering is important for data science and we strive to implement rigorous engineering standards for our data science company.  We have an extensive curriculum for data science corporate training and online data science courses leveraging the jupyter (née ipython) notebook format.  Last year, we published a post about testing Jupyter notebooks — applying rigorous software engineering testing standards to new technologies popular in data science.

However, over time, as our codebase as grown, we’ve added in more and more notebooks to our curriculum material. This led to tests on our curriculum taking ~30 minutes to run! We quickly identified parallelism as a low-hanging fruit that would make sense for a first approach, with a couple of points:

 

  1. We have curriculum materials that run code in Spark 2.0 parallelizing runs in that kernel is hard because of how the Spark execution environment spins up.  We also have curriculum materials in the jupyter R Kernel.
  2. Subprocess communication in Python (what our testing code is written in) is a pain, so maybe there’s a way to use some other parallelization library to avoid having to reinvent that wheel.
  3. Most of our notebooks are in Python, so those shouldn’t have any issues.

These issues aside, this seemed like a reasonable approach because each Jupyter notebook executes as its own subprocess in our current setup – we just had to take each of those processes and run them at the same time. Taking a stab at 3., parallelizing python tests, while finding a way around 2. – annoying multiprocess communication issues – yielded great results!

 

The library: nose

Anyone who’s written production-grade python is probably familiar with nosetests, the ubiquitous test suite runner. In another codebase of ours, we use nose in conjunction with the flaky plugin to rerun some tests whose output can be… less deterministic than would be ideal.

Nose has good support for parallel testing out-of-the-box (incidentally, the flaky plugin and parallel testing don’t play nice together), so it seemed like a clear candidate for handling test-related subprocess communication for us.

 

The hurdle: dynamically adding tests

We run our test suite in many different configurations: on pull requests, for example, we’ll only run modified notebooks to facilitate speedier development – and we save the full build for when we merge into master. Given this, we need to dynamically add tests – 1 per notebook we want to test in a given run. A popular Python mechanism for this, that we’ve historically employed, is using something like this:

suite = unittest.TestSuit()

for filename in notebook_filenames:

suite.addTest(NotebookTestCase(filename))

Nose, unfortunately, does not like this and doesn’t play nice with unittest. It insists, instead, on test discovery. So we had to get creative. What did “creative” mean, exactly? Unfortunately for the pythonistas among us, it meant we had to use some of Python’s introspection functionality.

 

The solution: dynamically adding functions to Python classes

The hack we came up with was the following:

  1. Dynamically search out notebooks and add a test function for each to a class. In python, this involves defining a function, setting its __name__ attribute, and then using setattr on the parent class to add that function with the appropriate name. This took care of adding parallel tests in.
  2. Use the nose attr plugin to specify attributes on the tests, so we can maintain speedy single-notebook PR testing as described above. We have code that keeps track of the current diffed filenames (from master), and adds two sets of tests: one under the all attribute, and another under the change attribute. You can see the @attr decorator being used below.

You can see the class below. In a wrapper file, we call the add_tests() function as soon as that file is imported (i.e. before nose attempts any “discovery”) – the ipynb_all and ipynb_change_nbs functions live outside of the class but simply search out appropriate filenames.

class IpynbSelectorTestCase(object):

"""

Parallelizable TestCase to be used for nose tests.

To use, inherit and override`check_ipynb` to define how to check each notebook.

Call `add_tests` in a global call immediately after the class declaration.

See http://nose.readthedocs.io/en/latest/writing_tests.html#test-generators

Tests can be invoked via (e.g.):

nosetests -a 'all'

Do not inherit `unittest.TestCase`: this will break parallelization

"""

def check_ipynb(self, ipynb):

raise NotImplemented

@classmethod

def add_func(cls, ipynb, prefix):

@attr(prefix)

def func(self):

self.check_ipynb(ipynb)

_, nbname = os.path.split(ipynb)

func.__name__ = 'test_{}_{}'.format(prefix, nbname.split('.')[0])

func.__doc__ = 'Test {}'.format(nbname)

setattr(cls, func.__name__, func)

@classmethod

def add_tests(cls):

for ipynb in ipynb_all():

cls.add_func(ipynb, 'all')

for ipynb in ipynb_change_nbs():

cls.add_func(ipynb, 'change')

The results

So, our full build used to take 30 minutes to run, typically. With added parallelism, that time has dropped to 11 minutes. We tested with a few different process counts and continued seeing marginal improvement up to 6 processes. We made some plots. (Made with seaborn).

 

Not only is the reduction numerically dramatic, but gains like these add up in terms of curriculum developer productivity and allows us to rapidly iterate on our curriculum.

Author
  • Pragmatic Institute

    Pragmatic Institute is the transformational partner for today’s businesses, providing immediate impact through actionable and practical training for product, design and data teams. Our courses are taught by industry experts with decades of hands-on experience, and include a complete ecosystem of training, resources and community. This focus on dynamic instruction and continued learning has delivered impactful education to over 200,000 alumni worldwide over the last 30 years.

Author:

Other Resources in this Series

Most Recent

Spotify is data-driven
Article

Case Study: How Spotify Prioritizes Data Projects for a Personalized Music Experience

Spotify, a titan in the realm of audio streaming, has transformed the way we experience music and podcasts. Since its inception in 2008, it’s become a ubiquitous platform, boasting a colossal user base of approximately...
Category: Data Science
Team Prioritizing Projects
Article

Avoid These Mistakes When Prioritizing Data Projects for Your Company

Even in a world full of data, business decisions often still rely on instinct and emotions. However, when it comes to business, considering all external factors before making a move is essential. This is where...
Category: Data Science
Guy celebrates connection
Article

Harnessing Data to Forge Emotional Bonds with Customers: Insights from Zack Wenthe

Zack Wenthe joined a recent episode of Data Chats to discuss the importance of understanding how consumers interact with your brand, how customers make decisions emotionally and leveraging data to create meaningful decisions.   Wenthe is...
Man and Woman Working on Same Laptop
Article

The Path to Data Democratization

Data democratization isn't easy. Developing a successful data strategy requires a clear vision of the end goal or purpose that an organization wants to achieve within one year to eighteen months. This vision should be comprehensive and ambitious, considering every aspect of the investment and budget.
Category: Data Science
Business Team Communicating
Article

Communicating Data to Non-Data Teams

Providing data insights to non-data teams can be a challenging task. Non-data teams often have limited knowledge of data and statistics and may not have the skills to interpret and apply insights effectively. Here's what you can do about it.
Category: Data Science

OTHER ArticleS

Spotify is data-driven
Article

Case Study: How Spotify Prioritizes Data Projects for a Personalized Music Experience

Spotify, a titan in the realm of audio streaming, has transformed the way we experience music and podcasts. Since its inception in 2008, it’s become a ubiquitous platform, boasting a colossal user base of approximately...
Category: Data Science
Team Prioritizing Projects
Article

Avoid These Mistakes When Prioritizing Data Projects for Your Company

Even in a world full of data, business decisions often still rely on instinct and emotions. However, when it comes to business, considering all external factors before making a move is essential. This is where...
Category: Data Science

Sign up to stay up to date on the latest industry best practices.

Sign up to received invites to upcoming webinars, updates on our recent podcast episodes and the latest on industry best practices.

Training on Your Schedule

Fill out the form today and our sales team will help you schedule your private Pragmatic training today.

Subscribe

Subscribe