Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python pandas read_csv delimiter in column data

I'm having this type of CSV file:

12012;My Name is Mike. What is your's?;3;0 
1522;In my opinion: It's cool; or at least not bad;4;0
21427;Hello. I like this feature!;5;1

I want to get this data into da pandas.DataFrame. But read_csv(sep=";") throws exceptions due to the semicolon in the user generated message column in line 2 (In my opinion: It's cool; or at least not bad). All remaining columns constantly have numeric dtypes.

What is the most convenient method to manage this?

like image 506
Tomas Pazur Avatar asked Jun 17 '15 17:06

Tomas Pazur


People also ask

How do I use delimiter in python CSV?

delimiter specifies the character used to separate each field. The default is the comma ( ',' ). quotechar specifies the character used to surround fields that contain the delimiter character. The default is a double quote ( ' " ' ).

How do you split a column by delimiter in python?

We can use the pandas Series. str. split() function to split strings in the column around a given separator/delimiter. which is similar to the python string split() function but applies to the entire data frame column.


1 Answers

Dealing with unquoted delimiters is always a nuisance. In this case, since it looks like the broken text is known to be surrounded by three correctly-encoded columns, we can recover. TBH, I'd just use the standard Python reader and build a DataFrame once from that:

import csv
import pandas as pd

with open("semi.dat", "r", newline="") as fp:
    reader = csv.reader(fp, delimiter=";")
    rows = [x[:1] + [';'.join(x[1:-2])] + x[-2:] for x in reader] 
    df = pd.DataFrame(rows)

which produces

       0                                              1  2  3
0  12012               My Name is Mike. What is your's?  3  0
1   1522  In my opinion: It's cool; or at least not bad  4  0
2  21427                    Hello. I like this feature!  5  1

Then we can immediately save it and get something quoted correctly:

In [67]: df.to_csv("fixedsemi.dat", sep=";", header=None, index=False)

In [68]: more fixedsemi.dat
12012;My Name is Mike. What is your's?;3;0
1522;"In my opinion: It's cool; or at least not bad";4;0
21427;Hello. I like this feature!;5;1

In [69]: df2 = pd.read_csv("fixedsemi.dat", sep=";", header=None)

In [70]: df2
Out[70]: 
       0                                              1  2  3
0  12012               My Name is Mike. What is your's?  3  0
1   1522  In my opinion: It's cool; or at least not bad  4  0
2  21427                    Hello. I like this feature!  5  1
like image 96
DSM Avatar answered Oct 20 '22 05:10

DSM