I have the following code:
businessdata = ['Name of Location','Address','City','Zip Code','Website','Yelp',
'# Reviews', 'Yelp Rating Stars','BarRestStore','Category',
'Price Range','Alcohol','Ambience','Latitude','Longitude']
business = pd.read_table('FL_Yelp_Data_v2.csv', sep=',', header=1, names=businessdata)
print '\n\nBusiness\n'
print business[:6]
It reads my file and creates a Panda table I can work with. What I need is to count how many categories are in each line of the 'Category' variable and store this number in a new column named '# Categories'. Here is the target column sample:
Category
French
Adult Entertainment , Lounges , Music Venues
American (New) , Steakhouses
American (New) , Beer, Wine & Spirits , Gastropubs
Chicken Wings , Sports Bars , American (New)
Japanese
Desired output:
Category # Categories
French 1
Adult Entertainment , Lounges , Music Venues 3
American (New) , Steakhouses 2
American (New) , Beer, Wine & Spirits , Gastropubs 4
Chicken Wings , Sports Bars , American (New) 3
Japanese 1
EDIT 1:
Raw input = CSV file. Target column: "Category" I can't post screenshots yet. I don't think the values to be counted are lists.
This is my code:
business = pd.read_table('FL_Yelp_Data_v2.csv', sep=',', header=1, names=businessdata, skip_blank_lines=True)
#business = pd.read_csv('FL_Yelp_Data_v2.csv')
business['Category'].str.split(',').apply(len)
#not sure where to declare the df part in the suggestions that use it.
print business[:6]
but I keep getting the following error:
TypeError: object of type 'float' has no len()
EDIT 2:
I GIVE UP. Thanks for all your help, but I'll have to figure something else.
Please do as follows: Select the cell you will place the counting result, type the formula =LEN(A2)-LEN(SUBSTITUTE(A2,",","")) (A2 is the cell where you will count the commas) into it, and then drag this cell's AutoFill Handle to the range as you need.
Use Sum Function to Count Specific Values in a Column in a Dataframe. We can use the sum() function on a specified column to count values equal to a set condition, in this case we use == to get just rows equal to our specific data point.
Assuming that Category is actually a list, you can use apply
(per @EdChum's suggestion):
business['# Categories'] = business.Category.apply(len)
If not, you first need to parse it and convert it into a list.
df['Category'] = df.Category.map(lambda x: [i.strip() for i in x.split(",")])
Can you show some sample output of EXACTLY what this column looks like (including correct quotations)?
P.S. @EdChum Thank you for your suggestions. I appreciate them. I believe the list comprehension method may be faster, per a sample of some text data I tested with 30k+ rows of data:
%%timeit
df.Category.str.strip().str.split(',').apply(len)
10 loops, best of 3: 44.8 ms per loop
%%timeit
df.Category.map(lambda x: [i.strip() for i in x.split(",")])
10 loops, best of 3: 28.4 ms per loop
Even accounting for the len
function call:
%%timeit
df.Category.map(lambda x: len([i.strip() for i in x.split(",")]))
10 loops, best of 3: 30.3 ms per loop
This works:
business['# Categories'] = business['Category'].apply(lambda x: len(x.split(',')))
If you need to handle NA, etc, you can pass a more elaborate function instead of the lambda.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With