I have a text file from amazon, containing the following info:
# user item time rating review text (the header is added by me for explanation, not in the text file
disjiad123 TYh23hs9 13160032 5 I love this phone as it is easy to use
hjf2329ccc TGjsk123 14423321 3 Suck restaurant
As you see, the data is separated by space and there are different number of columns in each row. However, so it is the text content. Here is the code I have tried:
pd.read_csv(filename, sep = " ", header = None, names = ["user","item","time","rating", "review"], usecols = ["user", "item", "rating"])#I'd like to skip the text review part
And such an error occurs:
ValueError: Passed header names mismatches usecols
When I tried to read all the columns:
pd.read_csv(filename, sep = " ", header = None)
And the error this time is:
Error tokenizing data. C error: Expected 229 fields in line 3, saw 320
And given the review text is so long in many rows , the method of adding header names for each column in this question can not work.
I wonder how to read the csv file if I want to keep the review text and skip them respectively. Thank you in advance!
EDIT:
The problem has been solved by Martin Evans perfectly. But now I am playing with another data set with similar but different format. Now the order of the data is converse:
# review text user item time rating (the header is added by me for explanation, not in the text file
I love this phone as it is easy to used isjiad123 TYh23hs9 13160032 5
Suck restaurant hjf2329ccc TGjsk123 14423321 3
Do you have any idea to read it properly? It would be appreciated for any help!
Using len() function Under this method, we need to read the CSV file using pandas library and then use the len() function with the imported CSV file, which will return an int value of a number of lines/rows present in the CSV file.
A CSV file stores data in rows and the values in each row is separated with a separator, also known as a delimiter. Although the file is defined as Comma Separated Values, the delimiter could be anything.
To get the number of rows, and columns we can use len(df.
I think the best approach is using pandas
read_csv
:
import pandas as pd
import io
temp=u""" disjiad123 TYh23hs9 13160032 5 I love this phone as it is easy to use
hjf2329ccc TGjsk123 14423321 3 Suck restaurant so I love cooking pizza with onion ham garlic tomatoes """
#estimated max length of columns
N = 20
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp),
sep = "\s+", #separator is arbitrary whitespace
header = None, #first row is not header, read all data to df
names=range(N))
print df
0 1 2 3 4 5 6 7 8 \
0 disjiad123 TYh23hs9 13160032 5 I love this phone as
1 hjf2329ccc TGjsk123 14423321 3 Suck restaurant so I love
9 10 11 12 13 14 15 16 17 18 19
0 it is easy to use NaN NaN NaN NaN NaN NaN
1 cooking pizza with onion ham garlic tomatoes NaN NaN NaN NaN
#get order of wanted columns
df = df.iloc[:, [0,1,2]]
#rename columns
df.columns = ['user','item','time']
print df
user item time
0 disjiad123 TYh23hs9 13160032
1 hjf2329ccc TGjsk123 14423321
If you need all columns, you need preprocessing for founding max length of columns for parameter usecols
and then postprocessing join last columns to one:
import pandas as pd
import csv
#preprocessing
def get_max_len():
with open('file1.csv', 'r') as csvfile:
reader = csv.reader(csvfile)
num = []
for i, row in enumerate(reader):
num.append(len(''.join(row).split()))
m = max(num)
#print m
return m
df = pd.read_csv('file1.csv',
sep = "\s+", #separator is arbitrary whitespace
header = None, #first row is not header, read all data to df
usecols = range(get_max_len())) #filter first, second and fourth column (python count from 0)
print df
0 1 2 3 4 5 6 7 8 \
0 disjiad123 TYh23hs9 13160032 5 I love this phone as
1 hjf2329ccc TGjsk123 14423321 3 Suck restaurant NaN NaN NaN
9 10 11 12 13
0 it is easy to use
1 NaN NaN NaN NaN NaN
#df from 4 col to last
print df.ix[:, 4:]
4 5 6 7 8 9 10 11 12 13
0 I love this phone as it is easy to use
1 Suck restaurant NaN NaN NaN NaN NaN NaN NaN NaN
#concanecate columns to one review text
df['review text'] = df.ix[:, 4:].apply(lambda x: ' '.join([e for e in x if isinstance(e, basestring)]), axis=1)
df = df.rename(columns={0:'user', 1:'item', 2:'time',3:'rating'})
#get string columns
cols = [x for x in df.columns if isinstance(x, basestring)]
#filter only string columns
print df[cols]
user item time rating \
0 disjiad123 TYh23hs9 13160032 5
1 hjf2329ccc TGjsk123 14423321 3
review text
0 I love this phone as it is easy to use
1 Suck restaurant
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With