Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas read_csv() for multiple delimiters

Tags:

pandas

I have a file which has data as follows

1000000 183:0.6673;2:0.3535;359:0.304;363:0.1835
1000001 92:1.0
1000002 112:1.0
1000003 154435:0.746;30:0.3902;220:0.2803;238:0.2781;232:0.2717
1000004 118:1.0
1000005 157:0.484;25:0.4383;198:0.3033
1000006 277:0.7815;1980:0.4825;146:0.175
1000007 4069:0.6678;2557:0.6104;137:0.4261
1000009 2:1.0

I want to read the file to a pandas dataframe seperated by the multiple delimeters \t, :, ;

I tried

df_user_key_word_org = pd.read_csv(filepath+"user_key_word.txt", sep='\t|:|;', header=None, engine='python')

It gives me the following error.

pandas.errors.ParserError: Error could be due to quotes being ignored when a multi-char delimiter is used.

Why am I getting this error?

So I thought I'll try to use the regex string. But I am not sure how to write a split regex. r'\t|:|;' doesn't work.

What is the best way to read a file to a pandas data frame with multiple delimiters?

like image 333
user77005 Avatar asked Jan 02 '18 15:01

user77005


People also ask

How do I read a CSV file in a different delimiter?

Using the "From Text" feature in Excel Select the CSV file that has the data clustered into one column. Select Delimited, then make sure the File Origin is Unicode UTF-8. Select Comma (this is Affinity's default list separator). The preview will show the columns being separated.

What is delimiter in pandas read_csv?

read_csv(filepath_or_buffer, sep=', ', delimiter=None, header='infer', names=None, index_col=None, ....) It reads the content of a csv file at given path, then loads the content to a Dataframe and returns that. It uses comma (,) as default delimiter or separator while parsing a file.


1 Answers

From this question, Handling Variable Number of Columns with Pandas - Python, one workaround to pandas.errors.ParserError: Expected 29 fields in line 11, saw 45. is let read_csv know about how many columns in advance.

my_cols = [str(i) for i in range(45)] # create some col names
df_user_key_word_org = pd.read_csv(filepath+"user_key_word.txt",
                                   sep="\s+|;|:",
                                   names=my_cols, 
                                   header=None, 
                                   engine="python")
# I tested with s = StringIO(text_from_OP) on my computer

enter image description here

Hope this works.

like image 55
Tai Avatar answered Sep 18 '22 13:09

Tai