I am doing some exercises with datasets like so:
List with many dictionaries
users = [ {"id": 0, "name": "Ashley"}, {"id": 1, "name": "Ben"}, {"id": 2, "name": "Conrad"}, {"id": 3, "name": "Doug"}, {"id": 4, "name": "Evin"}, {"id": 5, "name": "Florian"}, {"id": 6, "name": "Gerald"} ]
Dictionary with few lists
users2 = { "id": [0, 1, 2, 3, 4, 5, 6], "name": ["Ashley", "Ben", "Conrad", "Doug","Evin", "Florian", "Gerald"] }
Pandas dataframes
import pandas as pd pd_users = pd.DataFrame(users) pd_users2 = pd.DataFrame(users2) print pd_users == pd_users2
Questions:
It is more efficient to use dictionaries for the lookup of elements as it is faster than a list and takes less time to traverse. Moreover, lists keep the order of the elements while dictionary does not. So, it is wise to use a list data structure when you are concerned with the order of the data elements.
It is more efficient to use a dictionary for lookup of elements because it takes less time to traverse in the dictionary than a list. For example, let's consider a data set with 5000000 elements in a machine learning model that relies on the speed of retrieval of data.
A list refers to a collection of various index value pairs like that in the case of an array in C++. A dictionary refers to a hashed structure of various pairs of keys and values. We can create a list by placing all the available elements into a [ ] and separating them using “,” commas.
Use a dictionary when you have a set of unique keys that map to values. Use a list if you have an ordered collection of items. Use a set to store an unordered set of items.
This relates to column oriented databases versus row oriented. Your first example is a row oriented data structure, and the second is column oriented. In the particular case of Python, the first could be made notably more efficient using slots, such that the dictionary of columns doesn't need to be duplicated for every row.
Which form works better depends a lot on what you do with the data; for instance, row oriented is natural if you only ever access all of any row. Column oriented meanwhile makes much better use of caches and such when you're searching by a particular field (in Python, this may be reduced by the heavy use of references; types like array can optimize that). Traditional row oriented databases frequently use column oriented sorted indices to speed up lookups, and knowing these techniques you can implement any combination using a key-value store.
Pandas does convert both your examples to the same format, but the conversion itself is more expensive for the row oriented structure, simply because every individual dictionary must be read. All of these costs may be marginal.
There's a third option not evident in your example: In this case, you only have two columns, one of which is an integer ID in a contiguous range from 0. This can be stored in the order of the entries itself, meaning the entire structure would be found in the list you've called users2['name']
; but notably, the entries are incomplete without their position. The list translates into rows using enumerate(). It is common for databases to have this special case also (for instance, sqlite rowid).
In general, start with a data structure that keeps your code sensible, and optimize only when you know your use cases and have a measurable performance issue. Tools like Pandas probably means most projects will function just fine without finetuning.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With