Resources > Articles

SQLite vs Pandas: Performance Benchmarks

Author
  • Pragmatic Institute

    Pragmatic Institute is the transformational partner for today’s businesses, providing immediate impact through actionable and practical training for product, design and data teams. Our courses are taught by industry experts with decades of hands-on experience, and include a complete ecosystem of training, resources and community. This focus on dynamic instruction and continued learning has delivered impactful education to over 200,000 alumni worldwide over the last 30 years.

SQLite vs Pandas

 

 

by Paul Paczuski

Previously posted on May 23, 2018, on The Data Incubator

 

This technical article was written for The Data Incubator by Paul Paczuski, a Fellow of our 2016 Spring cohort in New York City who landed a job with our hiring partner, Genentech as a Clinical Data Scientist.

As a data scientist, we all know that unglamorous data manipulation is 90% of the work. Two of the most common data manipulation tools are SQL and pandas. In this blog, we’ll compare the performance of pandas and SQLite, a simple form of SQL favored by Data Scientists.

 

Let’s find out the tasks at which each of these excel. Below, we compare Python’s pandas to sqlite for some common data analysis operations: sort, select, load, join, filter, and group by.

 

 

 

Note that the axis is logarithmic, so that raw differences are more pronounced.

Analysis details

For the analysis, we ran the six tasks 10 times each, for 5 different sample sizes, for each of 3 programs: pandas, sqlite, and memory-sqlite (where database is in memory instead of on disk). See below for the definitions of each task.

Our sample data was randomly generated. Here’s what it looks like:

 

sql_vs_pandas$ head -n 5 data/sample.100.csv

 

qqFjQHQc,c,1981,82405.59262172286

vILuhVGz,a,1908,27712.27152250119

mwCjpoOF,f,1992,58974.38538762843

kGbriYAK,d,1927,42258.24179716961

MeoxuJng,c,1955,96907.56416314292

 

This consists of a random string of 8 characters, a random single character (for the filtering operation), a random integer simulating a year (1900-2000), and a uniform random float value between 10000 and 100000.

Results

sqlite or memory-sqlite is faster for the following tasks:

  • select two columns from data (<.1 millisecond for any data size for sqlite. pandas scales with the data, up to just under 0.5 seconds for 10 million records)
  • filter data (>10x-50x faster with sqlite. The difference is more pronounced as data grows in size)
  • sort by single column: pandas is always a bit slower, but this was the closest

pandas is faster for the following tasks:

  • groupby computation of a mean and sum (significantly better for large data, only 2x faster for <10k records)
  • load data from disk (5x faster for >10k records, even better for smaller data)
  • join data (2-5x faster, but slower for smallest dataset of 1000 rows)

Comparing memory-sqlite vs. sqlite, there was no meaningful difference, especially as data size increased.

There is no significant speedup from loading sqlite in its own shell vs. via pandas.

Overall, joining and loading data is the slowest whereas select and filter are generally the fastest. Further, pandas seems to be optimized for group-by operations, where it performs really well (group-by is pandas‘ second-fastest operation for larger data).

Note that this analysis assumes you are equally proficient in writing code with both! But these results could encourage you to learn the tool that you are less familiar with, if the performance gains are significant.

Check our our Data Chats podcast!

Code

All code is on our GitHub page.

 

Below are the definitions of our six tasks: sort, select, load, join, filter, and group by (see driver/sqlite_driver.py or driver/pandas_driver.py).

sqlite is first, followed by pandas:

sort

def sort(self):

self._cursor.execute(‘SELECT * FROM employee ORDER BY name ASC;’)

self._conn.commit()

 

def sort(self):

self.df_employee.sort_values(by=’name’)

 

select

def select(self):

self._cursor.execute(‘SELECT name, dept FROM employee;’)

self._conn.commit()

 

def select(self):

self.df_employee[[“name”, “dept”]]

 

load

def load(self):

self._cursor.execute(‘CREATE TABLE employee (name varchar(255), dept char(1), birth int, salary double);’)

df = pd.read_csv(self.employee_file)

df.columns = employee_columns

df.to_sql(’employee’, self._conn, if_exists=’replace’)

 

self._cursor.execute(‘CREATE TABLE bonus (name varchar(255), bonus double);’)

df_bonus = pd.read_csv(self.bonus_file)

df_bonus.columns = bonus_columns

df_bonus.to_sql(‘bonus’, self._conn, if_exists=’replace’)

 

def load(self):

self.df_employee = pd.read_csv(self.employee_file)

self.df_employee.columns = employee_columns

 

self.df_bonus = pd.read_csv(self.bonus_file)

self.df_bonus.columns = bonus_columns

 

join

def join(self):

self._cursor.execute(‘SELECT employee.name, employee.salary + bonus.bonus ‘

‘FROM employee INNER JOIN bonus ON employee.name = bonus.name’)

self._conn.commit()

 

def join(self):

joined = self.df_employee.merge(self.df_bonus, on=’name’)

joined[‘total’] = joined[‘bonus’] + joined[‘salary’]

 

filter

def filter(self):

self._cursor.execute(‘SELECT * FROM employee WHERE dept = “a”;’)

self._conn.commit()

 

def filter(self):

self.df_employee[self.df_employee[‘dept’] == ‘a’]

 

group by

def groupby(self):

self._cursor.execute(‘SELECT avg(birth), sum(salary) FROM employee GROUP BY dept;’)

self._conn.commit()

 

def groupby(self):

self.df_employee.groupby(“dept”).agg({‘birth’: np.mean, ‘salary’: np.sum})

 

Details

We used pandas version 0.19.1 and sqlite version 3.13.0. All tests were run on Digital Ocean Ubuntu 14.04 with 16GB memory and 8 core CPU.

References

For resources on becoming a data scientist, check out our blog, particularly this article on preparing for our free data science fellowship and this one on data manipulating data like a professional data scientist. And if you’re looking for a class, consider our convenient after-work online instructor lead data science foundations course geared towards working professionals or our free data science fellowship for people with advanced degrees.

To learn more how sqlite works, check out this awesome blog series.

Here is a syntax comparison between pandas and sql.

Author
  • Pragmatic Institute

    Pragmatic Institute is the transformational partner for today’s businesses, providing immediate impact through actionable and practical training for product, design and data teams. Our courses are taught by industry experts with decades of hands-on experience, and include a complete ecosystem of training, resources and community. This focus on dynamic instruction and continued learning has delivered impactful education to over 200,000 alumni worldwide over the last 30 years.

Author:

Other Resources in this Series

Most Recent

Spotify is data-driven
Article

Case Study: How Spotify Prioritizes Data Projects for a Personalized Music Experience

Spotify, a titan in the realm of audio streaming, has transformed the way we experience music and podcasts. Since its inception in 2008, it’s become a ubiquitous platform, boasting a colossal user base of approximately...
Category: Data Science
Team Prioritizing Projects
Article

Avoid These Mistakes When Prioritizing Data Projects for Your Company

Even in a world full of data, business decisions often still rely on instinct and emotions. However, when it comes to business, considering all external factors before making a move is essential. This is where...
Category: Data Science
Guy celebrates connection
Article

Harnessing Data to Forge Emotional Bonds with Customers: Insights from Zack Wenthe

Zack Wenthe joined a recent episode of Data Chats to discuss the importance of understanding how consumers interact with your brand, how customers make decisions emotionally and leveraging data to create meaningful decisions.   Wenthe is...
Man and Woman Working on Same Laptop
Article

The Path to Data Democratization

Data democratization isn't easy. Developing a successful data strategy requires a clear vision of the end goal or purpose that an organization wants to achieve within one year to eighteen months. This vision should be comprehensive and ambitious, considering every aspect of the investment and budget.
Category: Data Science
Business Team Communicating
Article

Communicating Data to Non-Data Teams

Providing data insights to non-data teams can be a challenging task. Non-data teams often have limited knowledge of data and statistics and may not have the skills to interpret and apply insights effectively. Here's what you can do about it.
Category: Data Science

OTHER ArticleS

Spotify is data-driven
Article

Case Study: How Spotify Prioritizes Data Projects for a Personalized Music Experience

Spotify, a titan in the realm of audio streaming, has transformed the way we experience music and podcasts. Since its inception in 2008, it’s become a ubiquitous platform, boasting a colossal user base of approximately...
Category: Data Science
Team Prioritizing Projects
Article

Avoid These Mistakes When Prioritizing Data Projects for Your Company

Even in a world full of data, business decisions often still rely on instinct and emotions. However, when it comes to business, considering all external factors before making a move is essential. This is where...
Category: Data Science

Sign up to stay up to date on the latest industry best practices.

Sign up to received invites to upcoming webinars, updates on our recent podcast episodes and the latest on industry best practices.

Training on Your Schedule

Fill out the form today and our sales team will help you schedule your private Pragmatic training today.

Subscribe

Subscribe