Pandas is faster to load CSV than SQL

Tags:

It seems that loading data from a CSV is faster than from SQL (Postgre SQL) with Pandas. (I have a SSD)

Here is my test code :

import pandas as pd
import numpy as np

start = time.time()
df = pd.read_csv('foo.csv')
df *= 3
duration = time.time() - start
print('{0}s'.format(duration))

engine = create_engine('postgresql://user:password@host:port/schema')
start = time.time()
df = pd.read_sql_query("select * from mytable", engine)
df *= 3
duration = time.time() - start
print('{0}s'.format(duration))

The foo.csv and the database are the same (same amount of data and columns in both, 4 columns, 100 000 rows full of random int).

CSV takes 0.05s

SQL takes 0.5s

Do you think it's normal that CSV is 10 time faster than SQL ? I'm wondering if I'm missing something here...

234

asked May 09 '17 15:05

Haelle

2 Answers

While Steven G's explanation of the process is fundamentally the answer to your question, and Simon G's COPY solution is the most efficient one I've been able to find, I decided to look into your question in more depth, and actually measure the different aspects relevant to it.

At https://github.com/mikaelhg/pandas-pg-csv-speed-poc there is a project which contains pytest benchmarks for the various alternative solutions.

The CSV for this test is a order of magnitude larger than in the question, with the shape of (3742616, 6). Just to make sure that there is less of a chance for various buffers being just the right size to skew the results.

Thanks to the Finnish Traffic Safety Bureau Trafi's open data initiative for providing the test data.

As for the PostgreSQL installation, it's inside the canonical Docker container, and was started with upped shared_buffers and work_mem values, with the data files stored under the host machine's /dev/shm mount point, in order to negate actual disk I/O. Its UNIX socket connection point is also similarly exposed.

version: '3'

services:

  db:
    image: 'postgres:10-alpine'
    command: "postgres -c 'shared_buffers=512MB' -c 'temp_buffers=80MB' -c 'work_mem=256MB'"
    ports:
      - '5432:5432'
    volumes:
      - '/dev/shm/pgtest/data:/var/lib/postgresql/data'
      - '/dev/shm/pgtest/run:/var/run/postgresql'
    environment:
      POSTGRES_USER: test
      POSTGRES_PASSWORD: test
      POSTGRES_DB: test

  test:
    image: pandas_speed_poc:temp
    build:
      context: .
      dockerfile: Dockerfile.test-runner
    volumes:
      - '.:/app'
      - '/dev/shm/pgtest/run:/var/run/postgresql'
    working_dir: '/app'
    user: '1000'

The test runner is a simple Ubuntu 18.04 container:

FROM ubuntu:18.04
ENV DEBIAN_FRONTEND noninteractive
RUN apt-get -qq update && \
    apt-get -y -qq install python3-dev python3-pip python3-psycopg2 \
                    build-essential \
                    bash less nano wait-for-it
RUN pip3 install sqlalchemy numpy pandas \
                pytest pytest-benchmark
WORKDIR /app
CMD wait-for-it db:5432 -- /bin/bash -c "trap : TERM INT; sleep infinity & wait"

The actual benchmark is a Python 3 unittest written for pytest-benchmark:

#!/usr/bin/python3

from sqlalchemy import create_engine
import psycopg2
import psycopg2.extensions

import pandas as pd
import numpy as np

import io
import time
import gzip

import unittest
import pytest

DATA_FILE = 'data/licenses.csv.gz'

DROP_TABLE = "DROP TABLE IF EXISTS licenses"

CREATE_TABLE = """
    CREATE TABLE licenses (
        a VARCHAR(16),
        b CHAR(3),
        c CHAR(6),
        d INTEGER,
        e INTEGER,
        f INTEGER
    )
"""

COPY_FROM = """
    COPY licenses (a, b, c, d, e, f) FROM STDIN
    WITH (FORMAT CSV, DELIMITER ';', HEADER)
"""

COPY_TO = "COPY licenses TO STDOUT WITH (FORMAT CSV, HEADER)"

SELECT_FROM = 'SELECT * FROM licenses'

VACUUM = "VACUUM FULL ANALYZE"

DB_UNIX_SOCKET_URL = 'postgresql://test:test@/test'

DB_TCP_URL = 'postgresql://test:test@db/test'

def my_cursor_factory(*args, **kwargs):
    cursor = psycopg2.extensions.cursor(*args, **kwargs)
    cursor.itersize = 10240
    return cursor

class TestImportDataSpeed(unittest.TestCase):

    @pytest.fixture(autouse=True)
    def setupBenchmark(self, benchmark):
        self.benchmark = benchmark

    @classmethod
    def setUpClass(cls):
        cls.engine = create_engine(DB_TCP_URL, connect_args={'cursor_factory': my_cursor_factory})
        connection = cls.engine.connect().connection
        cursor = connection.cursor()

        cursor.execute(DROP_TABLE)
        cursor.execute(CREATE_TABLE)

        with gzip.open(DATA_FILE, 'rb') as f:
            cursor.copy_expert(COPY_FROM, file=f, size=1048576)

        connection.commit()

        connection.set_session(autocommit=True)
        cursor.execute(VACUUM)

        cursor.close()
        connection.close()

    def test_pd_csv(self):

        def result():
            return pd.read_csv(DATA_FILE, delimiter=';', low_memory=False)

        df = self.benchmark(result)
        assert df.shape == (3742616, 6)

    def test_psycopg2_cursor(self):

        def result():
            connection = self.engine.connect().connection
            cursor = connection.cursor()
            cursor.itersize = 102400
            cursor.arraysize = 102400
            cursor.execute(SELECT_FROM)
            rows = cursor.fetchall()
            cursor.close()
            connection.close()
            return pd.DataFrame(rows)

        df = self.benchmark(result)
        assert df.shape == (3742616, 6)

    def test_pd_sqla_naive(self):

        def result():
            return pd.read_sql_query(SELECT_FROM, self.engine)

        df = self.benchmark(result)
        assert df.shape == (3742616, 6)

    def test_pd_sqla_chunked(self):

        def result():
            gen = (x for x in pd.read_sql(SELECT_FROM, self.engine, chunksize=10240))
            return pd.concat(gen, ignore_index=True)

        df = self.benchmark(result)
        assert df.shape == (3742616, 6)

    def test_pg_copy(self):
        connection = self.engine.connect().connection
        cursor = connection.cursor()

        def result(cursor):
            f = io.StringIO()
            cursor.copy_expert(COPY_TO, file=f, size=1048576)
            f.seek(0)
            return pd.read_csv(f, low_memory=False)

        df = self.benchmark(result, cursor)
        assert df.shape == (3742616, 6)

The final results:

speed_test.py .....


-------------------------------------------------------------------------------- benchmark: 5 tests -------------------------------------------------------------------------------
Name (time in s)            Min               Max              Mean            StdDev            Median               IQR            Outliers     OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_pd_csv              1.4623 (1.0)      1.4903 (1.0)      1.4776 (1.0)      0.0110 (1.21)     1.4786 (1.0)      0.0171 (1.15)          2;0  0.6768 (1.0)           5           1
test_pg_copy             3.0631 (2.09)     3.0842 (2.07)     3.0732 (2.08)     0.0091 (1.0)      3.0769 (2.08)     0.0149 (1.0)           2;0  0.3254 (0.48)          5           1
test_psycopg2_cursor     4.5325 (3.10)     4.5724 (3.07)     4.5531 (3.08)     0.0161 (1.77)     4.5481 (3.08)     0.0249 (1.68)          2;0  0.2196 (0.32)          5           1
test_pd_sqla_naive       6.0177 (4.12)     6.0523 (4.06)     6.0369 (4.09)     0.0147 (1.62)     6.0332 (4.08)     0.0242 (1.63)          2;0  0.1656 (0.24)          5           1
test_pd_sqla_chunked     6.0247 (4.12)     6.1454 (4.12)     6.0889 (4.12)     0.0442 (4.86)     6.0963 (4.12)     0.0524 (3.52)          2;0  0.1642 (0.24)          5           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Your results will vary! Run the tests on your own architecture for relevant results.

answered Sep 25 '22 01:09

Mikael Gueck

This is a normal behavior, reading a csv file is always one of the quickest way to simply load data

A CSV is very naive and simple. loading directly from it will be very quick. For massive database with complex structure CSV is not an option. SQL is super fast to select data from table an return that data to you. naturally, if you can select, modify and manipulate data it will add an overhead time cost to your call.

imagine that you have a time series in a csv from 1920 to 2017 in a csv but you only want data from 2010 to today.

csv approach would be to load the entire csv then select the years 2010 to 2017.

SQL approach would be to pre-select the years via SQL select function

In that scenario, SQL would be MUCH faster.

answered Sep 22 '22 01:09

Steven G

Related questions
                            
                                How to know the if the socket connection is closed in python?
                            
                                Python defaultdict with string as type of value
                            
                                Saving with h5py arrays of different sizes
                            
                                List vs generator comprehension speed with join function [duplicate]
                            
                                django makemigrations and migrate on heroku server don't create tables
                            
                                How can I get formatter string from a logging.formatter object in Python?
                            
                                Detecting when code is run on Travis CI
                            
                                Check if a directory is a mount point with python 2.7
                            
                                What is the encoding of the body of Gmail message? How to decode it?
                            
                                Difference between chain(*iterable) vs chain.from_iterable(iterable)
                            
                                "PermissionError: [Errno 13] Permission denied: '/usr/lib/python3.5/site-packages'" installing Django
                            
                                Pandas Dataframe to HTML remove index
                            
                                How to specify argument type in lambda? [duplicate]
                            
                                Error tokenizing data. C error: out of memory pandas python, large file csv
                            
                                Spark dataframe add new column with random data
                            
                                Make Sublime Text 3 the default editor for Python Files. MAC OSX: El Capitan OS
                            
                                BULK INSERT error code 3: The system cannot find the path specified
                            
                                pandas localize and convert datetime column instead of the datetimeindex
                            
                                Cannot find file setuptools-27.2.0-py3.5.egg
                            
                                Pandas Read_Excel Datetime Converter

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas is faster to load CSV than SQL

Tags:

python

sql

pandas

csv