Resources > Articles

Automating Excel with Python

Automating Excel with Python

Automating Excel with Python

 

Previously posted on December 13, 2018, on The Data Incubator.

This article was taken from the video Automating Excel with Python.

 

We know there are a lot of pain points in Excel that make it cumbersome and repetitive for data manipulation. We’ve distilled those pain points into three major themes:

  • The first is that it’s awkward to deal with higher-dimensional data in a two-dimensional spreadsheet. Spreadsheets are great for representing two-dimensional data but they’re awkward for representing anything at three or higher dimensions. And while there are many workarounds like pivot tables, this will only get you so far.
  • The second pain point revolves around doing the same calculation over multiple sheets or multiple workbooks. While it’s easy to iterate over rows or columns in Excel, it’s cumbersome and time-consuming to iterate over hundreds of sheets or notebooks.
  • Finally, data manipulation in Excel is actually very manual and hence very error-prone. So in Excel, the convention is to copy data or formulas from cell to cell, but this makes it hard to keep our data up to date as new data arrives or as we update our computations as they become more complex. Errors aren’t always easy to catch before important business decisions are made.

In this video, we look at some data that we can get from the Bureau of Labor Statistics. While it comes in Excel, it comes in a very particular format. The rows iterate through years, the columns iterate through months, the sheets iterate through industries, and the workbooks iterate through wages, hours worked and overtime. So how would you use this to calculate salary, which we’re going to define as wage times the quantity hours worked plus 1.5 times the overtime?

In Excel, you’d have to squash the two-dimensional years across months into a single-dimensional time. You’d have to repeat this process across every sheet and you’d have to repeat this over every workbook. Plus you’d have to handle missing data on an ad hoc basis. For example, while overtime is only given for three industries, hours worked is given for all 19. And of course, some sheets are missing data for certain dates but this is different for every sheet. On top of this, how would you repeat this analysis each month as the government updates its data? And how would you do similar calculations based on the same data?

Instead, let’s explore this data using Python. Python is an open-source platform that’s used by data scientists and analysts all over the world. It’s free to download and use and its 90,000 packages make manipulating data agree. Specifically, it solves many of our pain points of Excel. It can natively work with high-dimensional data, can easily repeat calculations across sheets and workbooks, and we can automate common tasks like data manipulation and plotting so that processing new data is simple.

So let’s look at the data from the Bureau of Labor Statistics. We can use assert to check for errors. So here we’re able to check that the row above contains the header and the row below contains null as we pull out the data in the middle. We can easily squash this two-dimensional table into a single-dimensional time series and we can iterate this calculation across multiple sheets and multiple workbooks without much trouble. We can write the formulas once without copying and we can easily handle missing data using conventions like fill value. Finally, we can plot and format our plots programmatically.

We can easily produce another calculation looking at the fraction of workers who are women across different industries. We can reuse our code to extract the data, write simple formulas that we don’t have to copy from cell to cell, and then programmatically plot the fraction of women who are working in different industries in December 2015. This took about 15 minutes to write but you can now run this on updated data from the Bureau of Labor Statistics for free.

If you want to see the code that we used in today’s demo, you can view that on our GitHub page. And if you want to learn how to do this yourself, you can check out our online data science training courses.

Other Resources in this Series

Most Recent

man looking at screen with data
Article

5 Ways to Turn Your Data Into Actionable Insights

We hear a lot about data these days. Data consists of the facts and statistics collected about people, places, things, business rules and other factors of the business operations. It is also one of an
Category: Data Science
professionals sitting down looking at phone and reports
Article

10 Reasons You Need to Assess the Data Maturity in Your Organization

While most companies want to harness the power of data, the journey to determine where to begin or what are the next steps can be challenging. When an organization is data-driven, they base decisions on
Category: Data Science
predictive analytics on laptop
Article

Staying Ahead of the Competition with Predictive Analytics

Changes in customer behavior, the industry, and competitors’ offerings are why products routinely go out of favor—particularly in the digital space. For example, a digital enterprise product that was well-received when it launched in 2015
Category: Data Science
professionals evaluating reports on computer
Article

How to Pick the Best KPIs for Any Business

Data, data everywhere, not an insight in sight. You probably have encountered this contradiction if you’re a business leader trying to use data to manage your business. It’s not that you don’t have access to
Category: Data Science
data literacy
Article

How Big Data is Revolutionizing Business

Data is revolutionizing the world. IBM estimates that the world is producing 2.5 exabytes of data each day. That’s enough hard disks to cover more than six NFL football fields 

Category: Data Science

OTHER ArticleS

man looking at screen with data
Article

5 Ways to Turn Your Data Into Actionable Insights

We hear a lot about data these days. Data consists of the facts and statistics collected about people, places, things, business rules and other factors of the business operations. It is also one of an
Category: Data Science
professionals sitting down looking at phone and reports
Article

10 Reasons You Need to Assess the Data Maturity in Your Organization

While most companies want to harness the power of data, the journey to determine where to begin or what are the next steps can be challenging. When an organization is data-driven, they base decisions on
Category: Data Science

Sign up to stay up to date on the latest industry best practices.

Sign up to received invites to upcoming webinars, updates on our recent podcast episodes and the latest on industry best practices.

Training on Your Schedule

Fill out the form today and our sales team will help you schedule your private Pragmatic training today.

Subscribe

Subscribe

Training on Your Schedule

Fill out the form today and our sales team will help you schedule your private Pragmatic training today.