I have a Pandas dataframe:
type(original) pandas.core.frame.DataFrame
which includes the series object original['user']
:
type(original['user']) pandas.core.series.Series
original['user']
points to a number of dicts:
type(original['user'].ix[0]) dict
Each dict has the same keys:
original['user'].ix[0].keys() [u'follow_request_sent', u'profile_use_background_image', u'profile_text_color', u'id', u'verified', u'profile_location', # ... keys removed for brevity ]
Above is (part of) one of the dicts of user
fields in a tweet from tweeter API. I want to build a data frame from these dicts.
When I try to make a data frame directly, I get only one column for each row and this column contains the whole dict:
pd.DataFrame(original['user'][:2]) user 0 {u'follow_request_sent': False, u'profile_use_... 1 {u'follow_request_sent': False, u'profile_use_..
When I try to create a data frame using from_dict() I get the same result:
pd.DataFrame.from_dict(original['user'][:2]) user 0 {u'follow_request_sent': False, u'profile_use_... 1 {u'follow_request_sent': False, u'profile_use_..
Next I tried a list comprehension which returned an error:
item = [[k, v] for (k,v) in users] ValueError: too many values to unpack
When I create a data frame from a single row, it nearly works:
df = pd.DataFrame.from_dict(original['user'].ix[0]) df.reset_index() index contributors_enabled created_at default_profile default_profile_image description entities favourites_count follow_request_sent followers_count following friends_count geo_enabled id id_str is_translation_enabled is_translator lang listed_count location name notifications profile_background_color profile_background_image_url profile_background_image_url_https profile_background_tile profile_image_url profile_image_url_https profile_link_color profile_location profile_sidebar_border_color profile_sidebar_fill_color profile_text_color profile_use_background_image protected screen_name statuses_count time_zone url utc_offset verified 0 description False Mon May 26 11:58:40 +0000 2014 True False {u'urls': []} 0 False 157
It works almost like I want it to, except it sets the description
field as the default index.
Each of the dicts has 40 keys but I only need about 10 of them and I have 28734 rows in data frame.
How can I filter out the keys which I do not need?
You can create a pandas series from a dictionary by passing the dictionary to the command: pandas. Series() . In this article, you will learn about the different methods of configuring the pandas. Series() command to make a pandas series from a dictionary followed by a few practical tips for using them.
Method 1: Create DataFrame from Dictionary using default Constructor of pandas. Dataframe class. Method 2: Create DataFrame from Dictionary with user-defined indexes. Method 3: Create DataFrame from simple dictionary i.e dictionary with key and simple value like integer or string value.
Create Dataframe from list of dicts with custom indexes. As all the dictionaries have similar keys, so the keys became the column names. Then for each key all the values associated with that key in all the dictionaries became the column values.
what I would try to do is the following:
new_df = pd.DataFrame(list(original['user']))
this will convert the series to list then pass it to pandas dataframe and it should take care of the rest.
df = original['user'].apply(pd.Series)
works well
credit
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With