Resources > Articles

# Automating Excel with Python

Previously posted on December 13, 2018, on The Data Incubator.

We know there are a lot of pain points in Excel that make it cumbersome and repetitive for data manipulation. We’ve distilled those pain points into three major themes:

• The first is that it’s awkward to deal with higher-dimensional data in a two-dimensional spreadsheet. Spreadsheets are great for representing two-dimensional data but they’re awkward for representing anything at three or higher dimensions. And while there are many workarounds like pivot tables, this will only get you so far.
• The second pain point revolves around doing the same calculation over multiple sheets or multiple workbooks. While it’s easy to iterate over rows or columns in Excel, it’s cumbersome and time-consuming to iterate over hundreds of sheets or notebooks.
• Finally, data manipulation in Excel is actually very manual and hence very error-prone. So in Excel, the convention is to copy data or formulas from cell to cell, but this makes it hard to keep our data up to date as new data arrives or as we update our computations as they become more complex. Errors aren’t always easy to catch before important business decisions are made.

In this video, we look at some data that we can get from the Bureau of Labor Statistics. While it comes in Excel, it comes in a very particular format. The rows iterate through years, the columns iterate through months, the sheets iterate through industries, and the workbooks iterate through wages, hours worked and overtime. So how would you use this to calculate salary, which we’re going to define as wage times the quantity hours worked plus 1.5 times the overtime?

In Excel, you’d have to squash the two-dimensional years across months into a single-dimensional time. You’d have to repeat this process across every sheet and you’d have to repeat this over every workbook. Plus you’d have to handle missing data on an ad hoc basis. For example, while overtime is only given for three industries, hours worked is given for all 19. And of course, some sheets are missing data for certain dates but this is different for every sheet. On top of this, how would you repeat this analysis each month as the government updates its data? And how would you do similar calculations based on the same data?

Instead, let’s explore this data using Python. Python is an open-source platform that’s used by data scientists and analysts all over the world. It’s free to download and use and its 90,000 packages make manipulating data agree. Specifically, it solves many of our pain points of Excel. It can natively work with high-dimensional data, can easily repeat calculations across sheets and workbooks, and we can automate common tasks like data manipulation and plotting so that processing new data is simple.

So let’s look at the data from the Bureau of Labor Statistics. We can use assert to check for errors. So here we’re able to check that the row above contains the header and the row below contains null as we pull out the data in the middle. We can easily squash this two-dimensional table into a single-dimensional time series and we can iterate this calculation across multiple sheets and multiple workbooks without much trouble. We can write the formulas once without copying and we can easily handle missing data using conventions like fill value. Finally, we can plot and format our plots programmatically.

We can easily produce another calculation looking at the fraction of workers who are women across different industries. We can reuse our code to extract the data, write simple formulas that we don’t have to copy from cell to cell, and then programmatically plot the fraction of women who are working in different industries in December 2015. This took about 15 minutes to write but you can now run this on updated data from the Bureau of Labor Statistics for free.

If you want to see the code that we used in today’s demo, you can view that on our GitHub page. And if you want to learn how to do this yourself, you can check out our online data science training courses.

##### Author

Author:

Other Resources in this Series

## Most Recent

Article

Advancements in data collection and analysis are constantly reshaping the business landscape. This transformation has shifted the role of data management and utilization from a mere support function to a fundamental cornerstone of most business...
Category: Data Science
Article

### The Pragmatic Data Insights Model: A Blueprint for Data Success

In the fast-paced world of data analytics and business intelligence, achieving actionable insights from data can be a challenging endeavor. Many data projects face disconnects between data teams and business leaders, leading to unclear goals...
Category: Data Science
Article

### Crafting Data Stories: The Intersection of Art and Data Science

Editor’s note: This conversation has been lightly edited and condensed for clarity. “Data visualization is about adding a visual channel to make the data more memorable and comprehensible. We remember things in images and stories;...
Category: Data Science
Article

### Demographic Bias in Data

Demographic bias in data occurs when datasets don’t include information from a broad, diverse group of subjects. For example, a company might collect information from 100 people, 90 of whom identify as male and 10...
Category: Data Science
Article

### 3 Emerging Roles at the Intersection of Data and Business

The rise of AI doesn’t have to spell doom for data careers.  In our latest Data Chats podcast episode, Favio Vazquez, senior data scientist at H2O.ai, not only provides reassurance to data professionals but also...
Category: Data Science

## OTHER ArticleS

Article

Advancements in data collection and analysis are constantly reshaping the business landscape. This transformation has shifted the role of data management and utilization from a mere support function to a fundamental cornerstone of most business...
Category: Data Science
Article

### The Pragmatic Data Insights Model: A Blueprint for Data Success

In the fast-paced world of data analytics and business intelligence, achieving actionable insights from data can be a challenging endeavor. Many data projects face disconnects between data teams and business leaders, leading to unclear goals...
Category: Data Science