How to determine the delimiter in CSV file

Tags:

csv

I have a scenario at which i have to parse CSV files from different sources, the parsing code is very simple and straightforward.

        String csvFile = "/Users/csv/country.csv";
        String line = "";
        String cvsSplitBy = ",";
        try (BufferedReader br = new BufferedReader(new FileReader(csvFile))) {
            while ((line = br.readLine()) != null) {
                // use comma as separator
                String[] country = line.split(cvsSplitBy);
                System.out.println("Country [code= " + country[4] + " , name=" + country[5] + "]");
            }
        } catch (IOException e) {
            e.printStackTrace();
        }

my problem come from the CSV delimiter character, i have many different formats, some time it is a , sometimes it is a ;

is there is any way to determine the delimiter character before parsing the file

742

asked Mar 12 '18 12:03

2 Answers

Yes, but only if the delimiter characters are not allowed to exist as regular text

The most simple answer is to have a list with all the available delimiter characters and try to identify which character is being used. Even though, you have to place some limitations on the files or the person/people that created them. Look a the following two scenarios:

Case 1 - Contents of file.csv

test,test2,test3

Case 2 - Contents of file.csv

test1|test2,3|test4

If you have prior knowledge of the delimiter characters, then you would split the first string using , and the second one using |, getting the same result. But, if you try to identify the delimiter by parsing the file, both strings can be split using the , character, and you would end up with this:

Case 1 - Result of split using ,

test1
test2
test3

Case 2 - Result of split using ,

test1|test2
3|test4

By lacking the prior knowledge of which delimiter character is being used, you cannot create a "magical" algorithm that will parse every combination of text; even regular expressions or counting the number of appearance of a character will not save you.

Worst case

test1,2|test3,4|test5

By looking the text, one can tokenize it by using | as the delimiter. But the frequency of appearance of both , and | are the same. So, from an algorithm's perspective, both results are accurate:

Correct result

test1,2
test3,4
test5

Wrong result

test1
2|test3
4|test5

If you pose a set of guidelines or you can somehow control the generation of the CSV files, then you could just try to find the delimiter used with String.contains() method, employing the aforementioned list of characters. For example:

public class MyClass {

    private List<String> delimiterList = new ArrayList<>(){{
        add(",");
        add(";");
        add("\t");
        // etc...
    }};

    private static String determineDelimiter(String text) {
        for (String delimiter : delimiterList) {
            if(text.contains(delimiter)) {
                return delimiter;
            }
        }
        return "";
    }

    public static void main(String[] args) {
        String csvFile = "/Users/csv/country.csv";
        String line = "";
        String cvsSplitBy = ",";
        String delimiter = "";
        boolean firstLine = true;
        try (BufferedReader br = new BufferedReader(new FileReader(csvFile)))  {
            while ((line = br.readLine()) != null) {
                if(firstLine) {
                    delimiter = determineDelimiter(line);
                    if(delimiter.equalsIgnoreCase("")) {
                        System.out.println("Unsupported delimiter found: " + delimiter);
                        return;
                    }
                    firstLine = false;
                }
                // use comma as separator
                String[] country = line.split(delimiter);
                System.out.println("Country [code= " + country[4] + " , name=" + country[5] + "]");
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

Update

For a more optimized way, in determineDelimiter() method instead of the for-each loop, you can employ regular expressions.

answered Oct 16 '22 04:10

Lefteris008

univocity-parsers supports automatic detection of the delimiter (also line endings and quotes). Just use it instead of fighting with your code:

CsvParserSettings settings = new CsvParserSettings();
settings.detectFormatAutomatically();

CsvParser parser = new CsvParser(settings);
List<String[]> rows = parser.parseAll(new File("/path/to/your.csv"));

// if you want to see what it detected
CsvFormat format = parser.getDetectedFormat();

Disclaimer: I'm the author of this library and I made sure all sorts of corner cases are covered. It's open source and free (Apache 2.0 license)

Hope this helps.

answered Oct 16 '22 04:10

Jeronimo Backes

Related questions
                            
                                Else method for ifPresent Stream [duplicate]
                            
                                Javadoc closing tags [duplicate]
                            
                                Spark Framework: Match with or without trailing slash
                            
                                React Native - java.lang.RuntimeException: SDK location not found. Define location with sdk.dir in the local.properties
                            
                                DockerFile to run a java program
                            
                                How to preserve newlines while reading a file using stream - java 8
                            
                                Include common config for multiple apps in Spring Cloud Config server
                            
                                MapStruct String to List mapping
                            
                                What is the meaning of Event consumes in JavaFX
                            
                                Generate All Possible Combinations - Java [duplicate]
                            
                                How assign a method reference value to Runnable
                            
                                How do I use PostgreSQL JSON(B) operators containing a question mark "?" via JDBC
                            
                                Select constant in JOOQ union
                            
                                WARNING as java.io.EOFException when ActiveMQ starts
                            
                                jOOQ and Spring transaction management
                            
                                Understanding Quantifiers
                            
                                From post man(rest service) how to send json date(string format) to java which accepts date object
                            
                                How to extract a method across files?
                            
                                Kafka - Producer Acknowledgement
                            
                                Map multiple LiveData values into one

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to determine the delimiter in CSV file

Tags:

java

csv

Melad Basilius

People also ask

2 Answers

Yes, but only if the delimiter characters are not allowed to exist as regular text

Lefteris008

Jeronimo Backes

Recent Activity

Donate For Us