I have a text file from amazon, containing the following info: <pre class="prettyprint"><code> # user item time rating review text (the header is added by me for explanation, not in the text file disjiad123 TYh23hs9 13160032 5 I love this phone as it is easy to use hjf2329ccc TGjsk123 14423321 3 Suck restaurant </code></pre> As you see, the data is separated by space and there are different number of columns in each row. However, so it is the text content. Here is the code I have tried: <pre class="prettyprint"><code>pd.read_csv(filename, sep = " ", header = None, names = ["user","item","time","rating", "review"], usecols = ["user", "item", "rating"])#I'd like to skip the text review part </code></pre> And such an error occurs: <pre class="prettyprint"><code>ValueError: Passed header names mismatches usecols </code></pre> When I tried to read all the columns: <pre class="prettyprint"><code>pd.read_csv(filename, sep = " ", header = None) </code></pre> And the error this time is: <pre class="prettyprint"><code>Error tokenizing data. C error: Expected 229 fields in line 3, saw 320 </code></pre> And given the review text is so long in many rows , the method of adding header names for each column in this question can not work. I wonder how to read the csv file if I want to keep the review text and skip them respectively. Thank you in advance! EDIT: The problem has been solved by Martin Evans perfectly. But now I am playing with another data set with similar but different format. Now the order of the data is converse: <pre class="prettyprint"><code> # review text user item time rating (the header is added by me for explanation, not in the text file I love this phone as it is easy to used isjiad123 TYh23hs9 13160032 5 Suck restaurant hjf2329ccc TGjsk123 14423321 3 </code></pre> Do you have any idea to read it properly? It would be appreciated for any help!

I think the best approach is using <code>pandas</code> <code>read_csv</code>: <pre class="prettyprint"><code> import pandas as pd import io temp=u""" disjiad123 TYh23hs9 13160032 5 I love this phone as it is easy to use hjf2329ccc TGjsk123 14423321 3 Suck restaurant so I love cooking pizza with onion ham garlic tomatoes """ #estimated max length of columns N = 20 #after testing replace io.StringIO(temp) to filename df = pd.read_csv(io.StringIO(temp), sep = "\s+", #separator is arbitrary whitespace header = None, #first row is not header, read all data to df names=range(N)) print df 0 1 2 3 4 5 6 7 8 \ 0 disjiad123 TYh23hs9 13160032 5 I love this phone as 1 hjf2329ccc TGjsk123 14423321 3 Suck restaurant so I love 9 10 11 12 13 14 15 16 17 18 19 0 it is easy to use NaN NaN NaN NaN NaN NaN 1 cooking pizza with onion ham garlic tomatoes NaN NaN NaN NaN #get order of wanted columns df = df.iloc[:, [0,1,2]] #rename columns df.columns = ['user','item','time'] print df user item time 0 disjiad123 TYh23hs9 13160032 1 hjf2329ccc TGjsk123 14423321 </code></pre> If you need all columns, you need preprocessing for founding max length of columns for parameter <code>usecols</code> and then postprocessing join last columns to one: <pre class="prettyprint"><code>import pandas as pd import csv #preprocessing def get_max_len(): with open('file1.csv', 'r') as csvfile: reader = csv.reader(csvfile) num = [] for i, row in enumerate(reader): num.append(len(''.join(row).split())) m = max(num) #print m return m df = pd.read_csv('file1.csv', sep = "\s+", #separator is arbitrary whitespace header = None, #first row is not header, read all data to df usecols = range(get_max_len())) #filter first, second and fourth column (python count from 0) print df 0 1 2 3 4 5 6 7 8 \ 0 disjiad123 TYh23hs9 13160032 5 I love this phone as 1 hjf2329ccc TGjsk123 14423321 3 Suck restaurant NaN NaN NaN 9 10 11 12 13 0 it is easy to use 1 NaN NaN NaN NaN NaN </code></pre> <pre class="prettyprint"><code>#df from 4 col to last print df.ix[:, 4:] 4 5 6 7 8 9 10 11 12 13 0 I love this phone as it is easy to use 1 Suck restaurant NaN NaN NaN NaN NaN NaN NaN NaN #concanecate columns to one review text df['review text'] = df.ix[:, 4:].apply(lambda x: ' '.join([e for e in x if isinstance(e, basestring)]), axis=1) df = df.rename(columns={0:'user', 1:'item', 2:'time',3:'rating'}) #get string columns cols = [x for x in df.columns if isinstance(x, basestring)] #filter only string columns print df[cols] user item time rating \ 0 disjiad123 TYh23hs9 13160032 5 1 hjf2329ccc TGjsk123 14423321 3 review text 0 I love this phone as it is easy to use 1 Suck restaurant </code></pre>

How to read the csv file properly if each row contains different number of fields (number quite big)?

Tags:

python

pandas

csv

I have a text file from amazon, containing the following info:

 #      user        item     time   rating     review text (the header is added by me for explanation, not in the text file
  disjiad123    TYh23hs9     13160032    5     I love this phone as it is easy to use
  hjf2329ccc    TGjsk123     14423321    3     Suck restaurant

As you see, the data is separated by space and there are different number of columns in each row. However, so it is the text content. Here is the code I have tried:

pd.read_csv(filename, sep = " ", header = None, names = ["user","item","time","rating", "review"], usecols = ["user", "item", "rating"])#I'd like to skip the text review part

And such an error occurs:

ValueError: Passed header names mismatches usecols

When I tried to read all the columns:

pd.read_csv(filename, sep = " ", header = None)

And the error this time is:

Error tokenizing data. C error: Expected 229 fields in line 3, saw 320

And given the review text is so long in many rows , the method of adding header names for each column in this question can not work.

I wonder how to read the csv file if I want to keep the review text and skip them respectively. Thank you in advance!

EDIT:

The problem has been solved by Martin Evans perfectly. But now I am playing with another data set with similar but different format. Now the order of the data is converse:

     # review text                          user        item     time   rating      (the header is added by me for explanation, not in the text file
   I love this phone as it is easy to used  isjiad123    TYh23hs9     13160032    5    
  Suck restaurant                           hjf2329ccc    TGjsk123     14423321    3

Do you have any idea to read it properly? It would be appreciated for any help!

453

asked Feb 11 '16 16:02

user5779223

1 Answers

I think the best approach is using pandas read_csv:

 import pandas as pd
import io

temp=u"""  disjiad123    TYh23hs9     13160032    5     I love this phone as it is easy to use
  hjf2329ccc    TGjsk123     14423321    3     Suck restaurant so I love cooking pizza with onion ham garlic tomatoes """


#estimated max length of columns 
N = 20

#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), 
                 sep = "\s+", #separator is arbitrary whitespace 
                 header = None, #first row is not header, read all data to df
                 names=range(N)) 
print df
           0         1         2   3     4           5     6      7     8   \
0  disjiad123  TYh23hs9  13160032   5     I        love  this  phone    as   
1  hjf2329ccc  TGjsk123  14423321   3  Suck  restaurant    so      I  love   

        9      10    11     12   13      14        15  16  17  18  19  
0       it     is  easy     to  use     NaN       NaN NaN NaN NaN NaN  
1  cooking  pizza  with  onion  ham  garlic  tomatoes NaN NaN NaN NaN

#get order of wanted columns
df = df.iloc[:, [0,1,2]]
#rename columns
df.columns = ['user','item','time']
print df
         user      item      time
0  disjiad123  TYh23hs9  13160032
1  hjf2329ccc  TGjsk123  14423321

If you need all columns, you need preprocessing for founding max length of columns for parameter usecols and then postprocessing join last columns to one:

import pandas as pd
import csv

#preprocessing
def get_max_len():
    with open('file1.csv', 'r') as csvfile:
        reader = csv.reader(csvfile)
        num = []
        for i, row in enumerate(reader):
            num.append(len(''.join(row).split()))
        m = max(num)
        #print m
        return m


df = pd.read_csv('file1.csv', 
                         sep = "\s+", #separator is arbitrary whitespace 
                         header = None, #first row is not header, read all data to df
                         usecols = range(get_max_len())) #filter first, second and fourth column (python count from 0)
print df
           0         1         2   3     4           5     6      7    8   \
0  disjiad123  TYh23hs9  13160032   5     I        love  this  phone   as   
1  hjf2329ccc  TGjsk123  14423321   3  Suck  restaurant   NaN    NaN  NaN   

    9    10    11   12   13  
0   it   is  easy   to  use  
1  NaN  NaN   NaN  NaN  NaN

#df from 4 col to last
print df.ix[:, 4:]
     4           5     6      7    8    9    10    11   12   13
0     I        love  this  phone   as   it   is  easy   to  use
1  Suck  restaurant   NaN    NaN  NaN  NaN  NaN   NaN  NaN  NaN

#concanecate columns to one review text
df['review text'] = df.ix[:, 4:].apply(lambda x: ' '.join([e for e in x if isinstance(e, basestring)]), axis=1)
df = df.rename(columns={0:'user', 1:'item', 2:'time',3:'rating'})

#get string columns
cols = [x for x in df.columns if isinstance(x, basestring)]

#filter only string columns
print df[cols]
         user      item      time  rating  \
0  disjiad123  TYh23hs9  13160032       5   
1  hjf2329ccc  TGjsk123  14423321       3   

                              review text  
0  I love this phone as it is easy to use  
1                         Suck restaurant

answered Oct 14 '22 02:10

jezrael

Related questions
                            
                                How to return a matplotlib.figure.Figure object from Pandas plot function
                            
                                JS dataTables from pandas
                            
                                Django - Custom Admin Actions Logging
                            
                                Recursively dump an object
                            
                                How to decompress a .xz file which has multiple folders/files inside, in a single go?
                            
                                Converting Boolean value from Javascript to Django?
                            
                                Print the Python Exception/Error Hierarchy
                            
                                Basic example for PCA with matplotlib
                            
                                Name is not defined in Django model
                            
                                Missing sqlite3 after Python3 compile
                            
                                Python datetime add
                            
                                Generating Silence with pyDub
                            
                                Save HTML of some website in a txt file with python
                            
                                pip error: unrecognized command line option ‘-fstack-protector-strong’
                            
                                how to force matplotlib to display only whole numbers on the Y axis [duplicate]
                            
                                How to simulate HTML5 Drag and Drop in Selenium Webdriver?
                            
                                Best way to write Python 2 and 3 compatible code using nothing but the standard library
                            
                                Optimization Break-even Point: iterate many times over set or convert to list first?
                            
                                Finding the position of a word in a string
                            
                                Django Rest Framework - Nested Serialization not working as expected

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With