Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pd.read_csv ignore comma if it is inside parenthesis

Tags:

I have a very simple file:

[Name]
Streamline 1


[Data]
X [ m ], Y [ m ], Z [ m ], Velocity [ m s^-1 ]
2.66747564e-01, 0.00000000e+00, 2.03140453e-01, (0.00000000e+00, 8.17744827e+00, 0.00000000e+00)
2.66958952e-01, 0.00000000e+00, 2.07407191e-01, (0.00000000e+00, 6.77392197e+00, 0.00000000e+00)
2.63460875e-01, 0.00000000e+00, 2.06593186e-01, (0.00000000e+00, 7.04168701e+00, 0.00000000e+00)
2.65424699e-01, 0.00000000e+00, 2.00831652e-01, (0.00000000e+00, 8.93691921e+00, 0.00000000e+00)
2.70607203e-01, 0.00000000e+00, 2.02286631e-01, (0.00000000e+00, 8.45830917e+00, 0.00000000e+00)
2.68299729e-01, 0.00000000e+00, 1.97365344e-01, (0.00000000e+00, 1.00771456e+01, 0.00000000e+00)
...

I need to load the velocity as a vector, into a single row.

My basic code:

df = pd.read_csv("C:/Users/Marek/Downloads/0deg-5ms.csv", skiprows=5)

But this attempt leads to 1st 2 cols becoming index and the rest splits into 4 columns. index_col=False can solve the issue with index, but leads to index out of range. I need a delimiter that implicitly tells pandas to ignore whatever is in brackets. I thought python ignore the separator withing brackets while reading a csv file might work but yes, I have spaces everywhere. I found some solutions that use extended functions to load files and handle them by lines, such as CSV file containing column with occasional comma in parentheses crashes pandas.read_csv and Load CSV with data surrounded by parentheses into a pandas dataframe . I however believe that this is a very easy scenario, as all lines are similar and can be solved by one-liner adding delimiter='some_regex'. I however cannot figure out, how this regex should look. It should look for delimiter , but not (.*,.*).

I have tried with following, but this results in a single column:

df = pd.read_csv("C:/Users/Marek/Downloads/0deg-5ms.csv", skiprows=5,  delimiter=',^(\(.*,.*\))')

EDIT: got to something like this - ,|(?:(\(.*,.*\))), but this adds an empty column after each comma.

like image 348
Ruli Avatar asked Oct 12 '21 13:10

Ruli


People also ask

What is Index_col in read_csv?

index_col: This is to allow you to set which columns to be used as the index of the dataframe. The default value is None, and pandas will add a new column start from 0 to specify the index column. It can be set as a column name or column index, which will be used as the index column.

When a .CSV file is read with pandas read_csv () what is returned by this function?

A comma-separated values (csv) file is returned as two-dimensional data structure with labeled axes. Write DataFrame to a comma-separated values (csv) file. Read a comma-separated values (csv) file into DataFrame.

Is read_csv faster than Read_excel?

Idea #2: Use CSVs rather than Excel FilesImporting csv files in Python is 100x faster than Excel files. We can now load these files in 0.63 seconds. That's nearly 10 times faster!


1 Answers

You can parse manually the file:

data = []
with open('data.csv') as fp:
    [next(fp) for i in range(5)]  # skiprows=5
    headers = [c.strip() for c in next(fp).split(',')]
    for line in fp:
        data.append([i.strip() for i in re.split(r',(?![^\(]*[\)])', line)])

df = pd.DataFrame(data, columns=headers).apply(pd.eval)

Output:

>>> df
    X [ m ]  Y [ m ]   Z [ m ]     Velocity [ m s^-1 ]
0  0.266748      0.0  0.203140  [0.0, 8.17744827, 0.0]
1  0.266959      0.0  0.207407  [0.0, 6.77392197, 0.0]
2  0.263461      0.0  0.206593  [0.0, 7.04168701, 0.0]
3  0.265425      0.0  0.200832  [0.0, 8.93691921, 0.0]
4  0.270607      0.0  0.202287  [0.0, 8.45830917, 0.0]
5  0.268300      0.0  0.197365  [0.0, 10.0771456, 0.0]

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   X [ m ]              6 non-null      float64
 1   Y [ m ]              6 non-null      float64
 2   Z [ m ]              6 non-null      float64
 3   Velocity [ m s^-1 ]  6 non-null      object 
dtypes: float64(3), object(1)
memory usage: 320.0+ bytes

>>> type(df.iloc[0, 3])  # [0.0, 8.17744827, 0.0]
list

>>> type(df.iloc[0, 3][1])  # 8.17744827
float
like image 87
Corralien Avatar answered Oct 13 '22 01:10

Corralien