List with many dictionaries VS dictionary with few lists?

Tags:

I am doing some exercises with datasets like so:

List with many dictionaries

users = [     {"id": 0, "name": "Ashley"},     {"id": 1, "name": "Ben"},     {"id": 2, "name": "Conrad"},     {"id": 3, "name": "Doug"},     {"id": 4, "name": "Evin"},     {"id": 5, "name": "Florian"},     {"id": 6, "name": "Gerald"} ]

Dictionary with few lists

users2 = {     "id": [0, 1, 2, 3, 4, 5, 6],     "name": ["Ashley", "Ben", "Conrad", "Doug","Evin", "Florian", "Gerald"] }

Pandas dataframes

import pandas as pd pd_users = pd.DataFrame(users) pd_users2 = pd.DataFrame(users2) print pd_users == pd_users2

Questions:

Should I structure the datasets like users or like users2?
Are there performance differences?
Is one more readable than the other?
Is there a standard I should follow?
I usually convert these to pandas dataframes. When I do that, both versions are identical... right?
The output is true for each element so it doesn't matter if I work with panda df's right?

214

asked May 29 '15 06:05

megashigger

1 Answers

This relates to column oriented databases versus row oriented. Your first example is a row oriented data structure, and the second is column oriented. In the particular case of Python, the first could be made notably more efficient using slots, such that the dictionary of columns doesn't need to be duplicated for every row.

Which form works better depends a lot on what you do with the data; for instance, row oriented is natural if you only ever access all of any row. Column oriented meanwhile makes much better use of caches and such when you're searching by a particular field (in Python, this may be reduced by the heavy use of references; types like array can optimize that). Traditional row oriented databases frequently use column oriented sorted indices to speed up lookups, and knowing these techniques you can implement any combination using a key-value store.

Pandas does convert both your examples to the same format, but the conversion itself is more expensive for the row oriented structure, simply because every individual dictionary must be read. All of these costs may be marginal.

There's a third option not evident in your example: In this case, you only have two columns, one of which is an integer ID in a contiguous range from 0. This can be stored in the order of the entries itself, meaning the entire structure would be found in the list you've called users2['name']; but notably, the entries are incomplete without their position. The list translates into rows using enumerate(). It is common for databases to have this special case also (for instance, sqlite rowid).

In general, start with a data structure that keeps your code sensible, and optimize only when you know your use cases and have a measurable performance issue. Tools like Pandas probably means most projects will function just fine without finetuning.

170

answered Oct 07 '22 15:10

Yann Vernier

Related questions
                            
                                Fastest way to generate delimited string from 1d numpy array
                            
                                Emitting namespace specifications with ElementTree in Python
                            
                                Writing response body with BaseHTTPRequestHandler
                            
                                Python Multiple users append to the same file at the same time
                            
                                Python string 'in' operator implementation algorithm and time complexity
                            
                                Tuple or list when using 'in' in an 'if' clause?
                            
                                How to add extra whitespace between section header and a paragraph
                            
                                Modify the default font in Python Tkinter
                            
                                Pycharm exit code 0
                            
                                Ansible. override single dictionary key [duplicate]
                            
                                Wait for all multiprocessing jobs to finish before continuing
                            
                                Getting an error - AttributeError: 'module' object has no attribute 'run' while running subprocess.run(["ls", "-l"])
                            
                                S3 Connection timeout when using boto3
                            
                                How to peek front of deque without popping?
                            
                                Python 2.x - Write binary output to stdout?
                            
                                How can I make a simple 3D line with Matplotlib?
                            
                                How dangerous is setting self.__class__ to something else?
                            
                                Need to execute a function after returning the response in Flask
                            
                                Integer division by negative number [duplicate]
                            
                                Finding red color in image using Python & OpenCV

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

List with many dictionaries VS dictionary with few lists?

Tags:

python

pandas

dataset

megashigger

People also ask

1 Answers

Yann Vernier

Recent Activity

Donate For Us