Python in memory table data structures for analysis (dict, list, combo)

Tags:

I'm trying to simulate some code that I have working with SQL but using all Python instead.. With some help here CSV to Python Dictionary with all column names?

I now can read my zipped-csv file into a dict Only one line though, the last one. (how do I get a sample of lines or the whole data file?)

I am hoping to have a memory resident table that I can manipulate much like sql when I'm done eg Clean data by matching bad data to to another table with bad data and correct entries.. then sum by type average by time period and the like.. The total data file is about 500,000 rows.. I'm not fussed about getting all in memory but want to solve the general case as best I can,, again so I know what can be done without resorting to SQL

import csv, sys, zipfile
sys.argv[0] = "/home/tom/Documents/REdata/AllListing1RES.zip"
zip_file    = zipfile.ZipFile(sys.argv[0])
items_file  = zip_file.open('AllListing1RES.txt', 'rU')
for row in csv.DictReader(items_file, dialect='excel', delimiter='\t'):
    pass 
# Then is my result is
>>> for key in row:
print 'key=%s, value=%s' % (key, row[key])  
key=YEAR_BUILT_DESC, value=EXIST
key=SUBDIVISION, value=KNOLLWOOD
key=DOM, value=2
key=STREET_NAME, value=ORLEANS RD
key=BEDROOMS, value=3
key=SOLD_PRICE, value=
key=PROP_TYPE, value=SFR
key=BATHS_FULL, value=2
key=PENDING_DATE, value=
key=STREET_NUM, value=3828
key=SOLD_DATE, value=
key=LIST_PRICE, value=324900
key=AREA, value=200
key=STATUS_DATE, value=3/3/2011 11:54:56 PM
key=STATUS, value=A
key=BATHS_HALF, value=0
key=YEAR_BUILT, value=1968
key=ZIP, value=35243
key=COUNTY, value=JEFF
key=MLS_ACCT, value=492859
key=CITY, value=MOUNTAIN BROOK
key=OWNER_NAME, value=SPARKS
key=LIST_DATE, value=3/3/2011
key=DATE_MODIFIED, value=3/4/2011 12:04:11 AM 
key=PARCEL_ID, value=28-15-3-009-001.0000
key=ACREAGE, value=0
key=WITHDRAWN_DATE, value=
>>>

I think I'm barking up a few wrong trees here... One is that I only have 1 line of my about 500,000 line data file.. Two is it seems that the dict may not be the right structure since I don't think I can just load all 500,000 lines and do various operations on them. Like..Sum by group and date.. plus it seems that duplicate keys may cause problems ie the non unique descriptors like county and subdivision.

I also don't know how to read a specific small subset of line into memory (like 10 or 100 to test with, before loading all (which I also don't get..) I have read over the Python docs and several reference books but it just is not clicking yet..

It seems that most of the answers I can find all suggest using various SQL solutions for this sort of thing,, but I am anxious to learn the basics of achieving the similar results with Python. As in some cases I think it will be easier and faster as well as expand my tool set. But I'm having a hard time finding relevant examples.

one answer that hints at what I'm getting at is:

Once the reading is done right, DictReader should work for getting rows as dictionaries, a typical row-oriented structure. Oddly enough, this isn't normally the efficient way to handle queries like yours; having only column lists makes searches a lot easier. Row orientation means you have to redo some lookup work for every row. Things like date matching requires data that is certainly not present in a CSV, like how dates are represented and which columns are dates.

An example of getting a column-oriented data structure (however, involving loading the whole file):

import csv
allrows=list(csv.reader(open('test.csv')))
# Extract the first row as keys for a columns dictionary
columns=dict([(x[0],x[1:]) for x in zip(*allrows)])
The intermediate steps of going to list and storing in a variable aren't necessary. 
The key is using zip (or its cousin itertools.izip) to transpose the table.
Then extracting column two from all rows with a certain criterion in column one:

matchingrows=[rownum for (rownum,value) in enumerate(columns['one']) if value>2]
print map(columns['two'].__getitem__, matchingrows)
When you do know the type of a column, it may make sense to parse it, using appropriate 
functions like datetime.datetime.strptime.

via Yann Vernier

Surely there is some good reference for this general topic?

216

asked Apr 17 '11 21:04

dartdog

2 Answers

You can only read one line at a time from the csv reader, but you can store them all in memory quite easily:

rows = []
for row in csv.DictReader(items_file, dialect='excel', delimiter='\t'):
    rows.append(row)

# rows[0]
{'keyA': 13, 'keyB': 'dataB' ... }
# rows[1]
{'keyA': 5, 'keyB': 'dataB' ... }

Then, to do aggregations and calculations:

sum(row['keyA'] for row in rows)

You may want to transform the data before it goes into rows, or use a friendlier data structure. Iterating over 500,000 rows for each calculation could become quite inefficient.

As a commenter mentioned, using an in-memory database could be really beneficial to you. another question asks exactly how to transfer csv data into a sqlite database.

import csv
import sqlite3

conn = sqlite3.connect(":memory:")
c = conn.cursor()
c.execute("create table t (col1 text, col2 float);")

# csv.DictReader uses the first line in the file as column headings by default
dr = csv.DictReader(open('data.csv', delimiter=','))
to_db = [(i['col1'], i['col2']) for i in dr]
c.executemany("insert into t (col1, col2) values (?, ?);", to_db)

101

answered Oct 27 '22 20:10

Josh Smeaton

You say """I now can read my zipped-csv file into a dict Only one line though, the last one. (how do I get a sample of lines or the whole data file?)"""

Your code does this:

for row in csv.DictReader(items_file, dialect='excel', delimiter='\t'):
    pass

I can't imagine why you wrote that, but the effect is to read the whole input file row by row, ignoring each row (pass means "do exactly nothing"). The end result is that row refers to the last row (unless of course the file is empty).

To "get" the whole file, change pass to do_something_useful_with(row).

If you want to read the whole file into memory, simply do this:

rows = list(csv.DictReader(.....))

To get a sample, e.g. every Nth row (N > 0), starting at the Mth row (0 <= M < N), do something like this:

for row_index, row in enumerate(csv.DictReader(.....)):
    if row_index % N != M: continue
    do_something_useful_with(row)

By the way, you don't need dialect='excel'; that's the default.

answered Oct 27 '22 21:10

John Machin

Related questions
                            
                                Jython and the SAX Parser: No more than 64000 entities allowed?
                            
                                What is the best way to build a database from a MS Word document?
                            
                                Recursive variable definitions in Python and F# (probably OCaml, too)
                            
                                debugging: how to check what where my Python program is hanging?
                            
                                Google app engine datastore tag cloud with python
                            
                                Python Kombu consumer not notified of rabbitmq message (queue.get does work)
                            
                                Should sockets be non-blocking to work with select in Python?
                            
                                Python-Gmail Email Retreving/downloading [duplicate]
                            
                                Apache, mod_*, PHP, Perl, Python, Ruby; Learning backwards
                            
                                Identify the number of elements in a python struct/pack fmt string?
                            
                                Load a numpy array into C from a file that was saved with numpy.save(...)
                            
                                Where do I place the .egg (Python)?
                            
                                How to differentiate timeout error and other `URLError`s in Python?
                            
                                Reading and graphing data read from huge files
                            
                                OAuth2 client that works on App Engine
                            
                                Finding an element of a list when the list is in a dictionary?
                            
                                Is there a clean way to get the n-th column of an html table using BeautifulSoup?
                            
                                Creating a List from a Binary Search Tree
                            
                                Stackless Python development using Python Tools for Visual Studio
                            
                                How to build a lift chart (a.k.a gains chart) in Python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python in memory table data structures for analysis (dict, list, combo)

Tags:

python

dictionary

list

data-structures

csv

dartdog

People also ask

2 Answers

Josh Smeaton

John Machin

Recent Activity

Donate For Us