Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I import a CSV file and automatically infer the delimiter?

I want to import two kinds of CSV files, some use ";" for delimiter and others use ",". So far I have been switching between the next two lines:

reader=csv.reader(f,delimiter=';') 

or

reader=csv.reader(f,delimiter=',') 

Is it possible not to specify the delimiter and to let the program check for the right delimiter?

The solutions below (Blender and sharth) seem to work well for comma-separated files (generated with Libroffice) but not for semicolon-separated files (generated with MS Office). Here are the first lines of one semicolon-separated file:

ReleveAnnee;ReleveMois;NoOrdre;TitreRMC;AdopCSRegleVote;AdopCSAbs;AdoptCSContre;NoCELEX;ProposAnnee;ProposChrono;ProposOrigine;NoUniqueAnnee;NoUniqueType;NoUniqueChrono;PropoSplittee;Suite2LecturePE;Council PATH;Notes 1999;1;1;1999/83/EC: Council Decision of 18 January 1999 authorising the Kingdom of Denmark to apply or to continue to apply reductions in, or exemptions from, excise duties on certain mineral oils used for specific purposes, in accordance with the procedure provided for in Article 8(4) of Directive 92/81/EEC;U;;;31999D0083;1998;577;COM;NULL;CS;NULL;;;;Propos* are missing on Celex document 1999;1;2;1999/81/EC: Council Decision of 18 January 1999 authorising the Kingdom of Spain to apply a measure derogating from Articles 2 and 28a(1) of the Sixth Directive (77/388/EEC) on the harmonisation of the laws of the Member States relating to turnover taxes;U;;;31999D0081;1998;184;COM;NULL;CS;NULL;;;;Propos* are missing on Celex document 
like image 275
rom Avatar asked May 01 '13 02:05

rom


People also ask

How do I find the delimiter in a CSV file?

Mac/Windows Select the CSV file that has the data clustered into one column. Select Delimited, then make sure the File Origin is Unicode UTF-8. Select Comma (this is Affinity's default list separator). The preview will show the columns being separated.

Can a CSV file take any delimiter character?

A CSV file stores data in rows and the values in each row is separated with a separator, also known as a delimiter. Although the file is defined as Comma Separated Values, the delimiter could be anything.

Which function is used to specify the delimiter in CSV?

csv file format has data items separated by a delimiter other than a comma. This includes semicolon, colon, tab space, vertical bars, etc. In such cases, we need to use the sep parameter inside the read. csv() function.


1 Answers

The csv module seems to recommend using the csv sniffer for this problem.

They give the following example, which I've adapted for your case.

with open('example.csv', 'rb') as csvfile:  # python 3: 'r',newline=""     dialect = csv.Sniffer().sniff(csvfile.read(1024), delimiters=";,")     csvfile.seek(0)     reader = csv.reader(csvfile, dialect)     # ... process CSV file contents here ... 

Let's try it out.

[9:13am][wlynch@watermelon /tmp] cat example  #!/usr/bin/env python import csv  def parse(filename):     with open(filename, 'rb') as csvfile:         dialect = csv.Sniffer().sniff(csvfile.read(), delimiters=';,')         csvfile.seek(0)         reader = csv.reader(csvfile, dialect)          for line in reader:             print line  def main():     print 'Comma Version:'     parse('comma_separated.csv')      print     print 'Semicolon Version:'     parse('semicolon_separated.csv')      print     print 'An example from the question (kingdom.csv)'     parse('kingdom.csv')  if __name__ == '__main__':     main() 

And our sample inputs

[9:13am][wlynch@watermelon /tmp] cat comma_separated.csv  test,box,foo round,the,bend  [9:13am][wlynch@watermelon /tmp] cat semicolon_separated.csv  round;the;bend who;are;you  [9:22am][wlynch@watermelon /tmp] cat kingdom.csv  ReleveAnnee;ReleveMois;NoOrdre;TitreRMC;AdopCSRegleVote;AdopCSAbs;AdoptCSContre;NoCELEX;ProposAnnee;ProposChrono;ProposOrigine;NoUniqueAnnee;NoUniqueType;NoUniqueChrono;PropoSplittee;Suite2LecturePE;Council PATH;Notes 1999;1;1;1999/83/EC: Council Decision of 18 January 1999 authorising the Kingdom of Denmark to apply or to continue to apply reductions in, or exemptions from, excise duties on certain mineral oils used for specific purposes, in accordance with the procedure provided for in Article 8(4) of Directive 92/81/EEC;U;;;31999D0083;1998;577;COM;NULL;CS;NULL;;;;Propos* are missing on Celex document 1999;1;2;1999/81/EC: Council Decision of 18 January 1999 authorising the Kingdom of Spain to apply a measure derogating from Articles 2 and 28a(1) of the Sixth Directive (77/388/EEC) on the harmonisation of the laws of the Member States relating to turnover taxes;U;;;31999D0081;1998;184;COM;NULL;CS;NULL;;;;Propos* are missing on Celex document 

And if we execute the example program:

[9:14am][wlynch@watermelon /tmp] ./example  Comma Version: ['test', 'box', 'foo'] ['round', 'the', 'bend']  Semicolon Version: ['round', 'the', 'bend'] ['who', 'are', 'you']  An example from the question (kingdom.csv) ['ReleveAnnee', 'ReleveMois', 'NoOrdre', 'TitreRMC', 'AdopCSRegleVote', 'AdopCSAbs', 'AdoptCSContre', 'NoCELEX', 'ProposAnnee', 'ProposChrono', 'ProposOrigine', 'NoUniqueAnnee', 'NoUniqueType', 'NoUniqueChrono', 'PropoSplittee', 'Suite2LecturePE', 'Council PATH', 'Notes'] ['1999', '1', '1', '1999/83/EC: Council Decision of 18 January 1999 authorising the Kingdom of Denmark to apply or to continue to apply reductions in, or exemptions from, excise duties on certain mineral oils used for specific purposes, in accordance with the procedure provided for in Article 8(4) of Directive 92/81/EEC', 'U', '', '', '31999D0083', '1998', '577', 'COM', 'NULL', 'CS', 'NULL', '', '', '', 'Propos* are missing on Celex document'] ['1999', '1', '2', '1999/81/EC: Council Decision of 18 January 1999 authorising the Kingdom of Spain to apply a measure derogating from Articles 2 and 28a(1) of the Sixth Directive (77/388/EEC) on the harmonisation of the laws of the Member States relating to turnover taxes', 'U', '', '', '31999D0081', '1998', '184', 'COM', 'NULL', 'CS', 'NULL', '', '', '', 'Propos* are missing on Celex document'] 

It's also probably worth noting what version of python I'm using.

[9:20am][wlynch@watermelon /tmp] python -V Python 2.7.2 
like image 114
Bill Lynch Avatar answered Sep 28 '22 11:09

Bill Lynch