Resources > Articles

# Automating Excel with Python

##### Author

Previously posted on December 13, 2018, on The Data Incubator.

We know there are a lot of pain points in Excel that make it cumbersome and repetitive for data manipulation. We’ve distilled those pain points into three major themes:

• The first is that it’s awkward to deal with higher-dimensional data in a two-dimensional spreadsheet. Spreadsheets are great for representing two-dimensional data but they’re awkward for representing anything at three or higher dimensions. And while there are many workarounds like pivot tables, this will only get you so far.
• The second pain point revolves around doing the same calculation over multiple sheets or multiple workbooks. While it’s easy to iterate over rows or columns in Excel, it’s cumbersome and time-consuming to iterate over hundreds of sheets or notebooks.
• Finally, data manipulation in Excel is actually very manual and hence very error-prone. So in Excel, the convention is to copy data or formulas from cell to cell, but this makes it hard to keep our data up to date as new data arrives or as we update our computations as they become more complex. Errors aren’t always easy to catch before important business decisions are made.

In this video, we look at some data that we can get from the Bureau of Labor Statistics. While it comes in Excel, it comes in a very particular format. The rows iterate through years, the columns iterate through months, the sheets iterate through industries, and the workbooks iterate through wages, hours worked and overtime. So how would you use this to calculate salary, which we’re going to define as wage times the quantity hours worked plus 1.5 times the overtime?

In Excel, you’d have to squash the two-dimensional years across months into a single-dimensional time. You’d have to repeat this process across every sheet and you’d have to repeat this over every workbook. Plus you’d have to handle missing data on an ad hoc basis. For example, while overtime is only given for three industries, hours worked is given for all 19. And of course, some sheets are missing data for certain dates but this is different for every sheet. On top of this, how would you repeat this analysis each month as the government updates its data? And how would you do similar calculations based on the same data?

Instead, let’s explore this data using Python. Python is an open-source platform that’s used by data scientists and analysts all over the world. It’s free to download and use and its 90,000 packages make manipulating data agree. Specifically, it solves many of our pain points of Excel. It can natively work with high-dimensional data, can easily repeat calculations across sheets and workbooks, and we can automate common tasks like data manipulation and plotting so that processing new data is simple.

So let’s look at the data from the Bureau of Labor Statistics. We can use assert to check for errors. So here we’re able to check that the row above contains the header and the row below contains null as we pull out the data in the middle. We can easily squash this two-dimensional table into a single-dimensional time series and we can iterate this calculation across multiple sheets and multiple workbooks without much trouble. We can write the formulas once without copying and we can easily handle missing data using conventions like fill value. Finally, we can plot and format our plots programmatically.

We can easily produce another calculation looking at the fraction of workers who are women across different industries. We can reuse our code to extract the data, write simple formulas that we don’t have to copy from cell to cell, and then programmatically plot the fraction of women who are working in different industries in December 2015. This took about 15 minutes to write but you can now run this on updated data from the Bureau of Labor Statistics for free.

If you want to see the code that we used in today’s demo, you can view that on our GitHub page. And if you want to learn how to do this yourself, you can check out our online data science training courses.

##### Author

Author:

Other Resources in this Series

## Most Recent

Article

### Case Study: How Spotify Prioritizes Data Projects for a Personalized Music Experience

Spotify, a titan in the realm of audio streaming, has transformed the way we experience music and podcasts. Since its inception in 2008, it’s become a ubiquitous platform, boasting a colossal user base of approximately...
Category: Data Science
Article

### Avoid These Mistakes When Prioritizing Data Projects for Your Company

Even in a world full of data, business decisions often still rely on instinct and emotions. However, when it comes to business, considering all external factors before making a move is essential. This is where...
Category: Data Science
Article

### Harnessing Data to Forge Emotional Bonds with Customers: Insights from Zack Wenthe

Zack Wenthe joined a recent episode of Data Chats to discuss the importance of understanding how consumers interact with your brand, how customers make decisions emotionally and leveraging data to create meaningful decisions.   Wenthe is...
Article

### The Path to Data Democratization

Data democratization isn't easy. Developing a successful data strategy requires a clear vision of the end goal or purpose that an organization wants to achieve within one year to eighteen months. This vision should be comprehensive and ambitious, considering every aspect of the investment and budget.
Category: Data Science
Article

### Communicating Data to Non-Data Teams

Providing data insights to non-data teams can be a challenging task. Non-data teams often have limited knowledge of data and statistics and may not have the skills to interpret and apply insights effectively. Here's what you can do about it.
Category: Data Science

## OTHER ArticleS

Article

### Case Study: How Spotify Prioritizes Data Projects for a Personalized Music Experience

Spotify, a titan in the realm of audio streaming, has transformed the way we experience music and podcasts. Since its inception in 2008, it’s become a ubiquitous platform, boasting a colossal user base of approximately...
Category: Data Science
Article

### Avoid These Mistakes When Prioritizing Data Projects for Your Company

Even in a world full of data, business decisions often still rely on instinct and emotions. However, when it comes to business, considering all external factors before making a move is essential. This is where...
Category: Data Science