The elementary data science python courses I have attended focus on practical execution and not so much on theory. When follow along, it makes sense but when I have to do an unguided one, I am lost. I am not sure if this is common among beginners like me? It can get quite disheartening.
I learned about characteristics of Lists, Series, Dictionary and Dataframes. But I don't understand when to use which and why? Sometimes it calls for a list, sometimes a series, sometimes an array. It seems that the ultimate aim is to have everything in dataframes? Is it right?
I am not even sure if my question makes sense.
This question is perfectly valid, but the answer is "often it depends". I will try to outline it somewhat: First there are basic python types (List, Dictionary) and than there are types from the pandas library (Series, Dataframe). Generally, the Python types are more multi purpose and general, while the pandas datatypes are catering the needs of data scientists.
Use a list if you have a number of related items which need to be accessed without a key - e.g. a list of person names
names = ["John", "Peter"]
A list is ordered and can be easily filtered using list comprehensions or functions like filter(), map() etc. A list is a swiss army knife suitable for a lot of data, but should not be used if you need to access your data by a id. For that use case, use a dictionary.
Nothing stops you from adding objects of different types to a list, e.g. [1, "A", {}]
but thats often a bad idea to do.
A dictionary offers the ability to store various objects and to access them by a known value. e.g.
data = {"John": {"Age": 16, "Stupid": False}, "Peter": {"Age": 20, "Stupid": True}}
john = data["John"]
This is extremely handy if you need to get an object by such an value. It's also possible to iterate the values using data.iterdict(), but if you only need to iterate the data, keep it as a list.
It's often a matter of design if you keep your data in a list or a dictionary - both can work, but often a style shows itselfs as preferable: e.g. prefer a list if you need to iterate the data, if you need random access via id, take a dictionary.
Since Python 3.7 dictionaries are ordered, so if you iterate them they will keep their order, but thats not the case for older python versions. Use an orderedDict in that case or use a list.
Nothing stops you from adding objects of different types to a dict, but thats often a bad idea to do.
Lastly, there are also Sets
in python. Sets behave a lot like dictionary, but support operations from set theory like intersection, issubset etc. Can be extremely handy if you have to compare or subtract groups of data.
Series are a pure pandas library construct. They view data fundamentally like a column in a table - a "list" of data points of a certain type and a certain length. Also, the column has a name.
Technically, a Series is not a list iternally but a numpy array - which is both faster and smaller (memory wise) than a python list. So for many elements, a Series has better performance.
A Series also offers method to manipulate and describe data which a list has not. I use Series if I need to to something with it which is supported only by Series, e.g. plotting a Histogram.
Also a pandas type. Contains a tabular view of data: basically a list of Series. Offers rich functionalities to view and manipulate data. Well suited for data analysis of tabular data, but not really a general purpose data format (although extremely handy). Use this for data you want to analyze - not for data you get from an API etc.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With