Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Load .csv with unknown delimiter into Pandas DataFrame

I have many .csv files that are to be loaded into pandas data-frames, there are at a minimum two delimiters comma and semi colon, and I am unsure of the rest of the delimiters. I understand that the delimeter can be set using

dataRaw = pd.read_csv(name,sep=",")

and

dataRaw = pd.read_csv(name,sep=";")

unfortunately if I was to not specify a delimiter the default is comma which results in a single column data frame for other delimiters. thus is there a dynamic way to allocate a delimiter so that any csv can be passed to pandas? such as try comma or semicolon. The pandas documentation doesn't allude to the use of logic in the csv read

like image 364
Tyler Cowan Avatar asked Dec 18 '15 16:12

Tyler Cowan


People also ask

How do you read a CSV file to a DataFrame with custom delimiter in pandas?

pandas. read_csv(filepath_or_buffer, sep=', ', delimiter=None, header='infer', names=None, index_col=None, ....) It reads the content of a csv file at given path, then loads the content to a Dataframe and returns that. It uses comma (,) as default delimiter or separator while parsing a file.


2 Answers

There is actually an answer in pandas documentation (at least, for pandas 0.20.1)

sep : str, default ‘,’

Delimiter to use. If sep is None, the C engine cannot automatically detect the separator, but the Python parsing engine can, meaning the latter will be used automatically. In addition, separators longer than 1 character and different from '\s+' will be interpreted as regular expressions and will also force the use of the Python parsing engine. Note that regex delimiters are prone to ignoring quoted data. Regex example: '\r\t'

This means you can read your files just with

dataRaw = pd.read_csv(name, sep = None, engine = 'python')

This should also work if there are other separators than ';' or '.' among your .csv files (for example, tab-separators).

like image 169
Anaidel Avatar answered Sep 18 '22 10:09

Anaidel


If you have different separators you can use:

dataRaw = pd.read_csv(name,sep=";|,")

is a Regular expression that can handle multiple separators divided by the OR (|) operator.

like image 38
Fabio Lamanna Avatar answered Sep 19 '22 10:09

Fabio Lamanna