Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do quickly search through a .csv file in Python

I'm reading a 6 million entry .csv file with Python, and I want to be able to search through this file for a particular entry.

Are there any tricks to search the entire file? Should you read the whole thing into a dictionary or should you perform a search every time? I tried loading it into a dictionary but that took ages so I'm currently searching through the whole file every time which seems wasteful.

Could I possibly utilize that the list is alphabetically ordered? (e.g. if the search word starts with "b" I only search from the line that includes the first word beginning with "b" to the line that includes the last word beginning with "b")

I'm using import csv.

(a side question: it is possible to make csv go to a specific line in the file? I want to make the program start at a random line)

Edit: I already have a copy of the list as an .sql file as well, how could I implement that into Python?

like image 989
Iceland_jack Avatar asked Dec 17 '22 02:12

Iceland_jack


2 Answers

If the csv file isn't changing, load in it into a database, where searching is fast and easy. If you're not familiar with SQL, you'll need to brush up on that though.

Here is a rough example of inserting from a csv into a sqlite table. Example csv is ';' delimited, and has 2 columns.

import csv
import sqlite3

con = sqlite3.Connection('newdb.sqlite')
cur = con.cursor()
cur.execute('CREATE TABLE "stuff" ("one" varchar(12), "two" varchar(12));')

f = open('stuff.csv')
csv_reader = csv.reader(f, delimiter=';')

cur.executemany('INSERT INTO stuff VALUES (?, ?)', csv_reader)
cur.close()
con.commit()
con.close()
f.close()
like image 195
JimB Avatar answered Dec 19 '22 16:12

JimB


you can use memory mapping for really big files

import mmap,os,re
reportFile = open( "big_file" )
length = os.fstat( reportFile.fileno() ).st_size
try:
    mapping = mmap.mmap( reportFile.fileno(), length, mmap.MAP_PRIVATE, mmap.PROT_READ )
except AttributeError:
    mapping = mmap.mmap( reportFile.fileno(), 0, None, mmap.ACCESS_READ )
data = mapping.read(length)
pat =re.compile("b.+",re.M|re.DOTALL) # compile your pattern here.
print pat.findall(data)
like image 31
ghostdog74 Avatar answered Dec 19 '22 14:12

ghostdog74