Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Handle duplicate header names in CSV file

I use a simulation program which generates a CSV file in the background containing the simulation data (such as the length of the road, how long is the simulation etc)

One example of how the file looks like:

example

So, in this case, I know that there is only one car ( because of the carId). Now if another CSV gets generated and the simulation contains 2 cars the columns changes:

two different cars

So as you can see different columns where added ( a lot of them are irrelevant as I only need carId and mph).

If there are more and more cars the number of columns increases and I never know at which position is the carId with the mph.

What I want is to create a CarObject for each carId with mph appearance.

Is there a way in java to create like an own table for each (carId, mph) column. So I have for eg. a List<CarDataRows> and CarDataRows is like a List where each Item contains the carId with the specific mph.

EDIT

In this case the collection should contain two items. And the value from the Item should also smth like a list which contains the carId with the different mph.:

enter image description here

myList.get(0)
-> 1,3
   1,4
   1,7


myList.get(1)
-> 2,12
   2,23
   2,0

In phyton for example you can use panda and create foreach pattern(carId with mph) and own dataframe. This is what I am trying to do here

like image 331
plshm Avatar asked Aug 21 '18 11:08

plshm


People also ask

Can CSV have duplicate headers?

Because the csv document is treated as a tabular data the header can not contain duplicate entries. If the header contains duplicates an exception will be thrown on usage. Starting with 9.7. 0 the SyntaxError exception thrown will return the list of duplicate column names.

What is a duplicate header CSV?

In most cases, the duplicate headers error is referring to duplicate blank column headers. The solution is to remove any additional blank columns in your . csv file. Also, ensure when you are uploading that all data types are set to the “string” data type. This should resolve the issue for you.

How do I find duplicates in a CSV file?

To find duplicate values in a column, click the column header and select Histogram. This will count how many many times each value appears in the dataset. You can search the Histogram for values that show up more than once.


1 Answers

Your CSV looks quite odd, I wouldn't expect people to append columns, but rather rows, especially when you have an ID column to tell the rows apart.

That said, I'd parse the data into Car objects and then work on those. This involves three steps.

On the first line of the file, find the header fields that indicate the interesting columns, here carId and mph. Apache's CSV library chokes on duplicated header names (and I cannot really blame it for that), so you'll need to find a more lenient library or parse the header yourself.

With the information from the header, you create Car objects, telling them their columns. Each Car is in charge of its own column numbers as well as all the speeds (the information you really want).

Then you read the rest of the file, and assign the speeds to their cars. Since the cars own their data (their column numbers and speeds), you just pass each record field to all cars and ask them to act on it if it's their's (Tell Don't Ask principle).

Your Car class could look like this:

import java.util.List;
import java.util.ArrayList;

public class Car {
    private int id;
    private final int idColumn;
    private final int speedColumn;
    private List<Integer> speeds;

    public Car(int idColumn, int speedColumn) {
        id = 0;
        this.idColumn = idColumn;
        this.speedColumn = speedColumn;
        speeds = new ArrayList<>();
    }

    public void add(int column, String value) {
        if (this.idColumn == column) {
            int i = Integer.parseInt(value);
            if (id != 0 && id != i) {
                throw new IllegalStateException("changed ID from "  + id + " to " + i);
            }
            id = i;
        }
        if (this.speedColumn == column) {
            speeds.add(Integer.parseInt(value));
        }
    }

    public List<Integer> getSpeeds() {
        return speeds;
    }
}

Then you work the CSV input like this, using the Apache Common CSV parser:

import java.io.*;
import java.util.*;
import org.apache.commons.csv.*;

public class YourCarCsvParser {
    /** Column name for a car's ID. */
    private static String CAR_ID = "carId";

    public static void main(String[] args) throws Exception {
        App app = new App();
        for (String a : args) {
            app.parse(new InputStreamReader(new FileInputStream(a)));
        }
    }

    public List<Car> parse(Reader input) throws IOException {
        // Initialize the list so you'll get an empty list not null if there is no CSV data.
        List<Car> cars = new ArrayList<>(); 
        CSVParser parser = CSVParser.parse(input, CSVFormat.RFC4180);
        boolean hadHeaders = false;
        for (CSVRecord record : parser) {
            if (!hadHeaders) {
                cars = makeCars(record);
                hadHeaders = true;
            }
            else {
                addSpeedsToCars(cars, record);
            }
        }
        return cars;
    }

    private List<Car> makeCars(CSVRecord record) {
        List<Car> cars = new ArrayList<>();
        for (int i = 0; i < record.size(); i++) {
            String field = record.get(i);
            if (CAR_ID.equals(field)) {
                cars.add(new Car(i, speedColumnFor(i)));
            }
        }
        return cars;
    }

    private int speedColumnFor(int idColumn) {
        // Assumes speed is always right of car ID.
        return idColumn + 1;
    }

    private void addSpeedsToCars(List<Car> cars, CSVRecord record) {
        for (int i = 0; i < record.size(); i++) {
            for (Car c : cars) {
                 c.add(i, record.get(i));
            }
        }
    }
}

This will turn your CSV into a list of Cars, each with a list of their speeds.

This code is based on a few assumptions, but you can probably fix them easily as needed:

  • Column names in your CSV are case-sensitive. If they are not, change the equals to equalsIgnoreCase when checking the CAR_ID.

  • The speed column is right next to the car ID column: if that's not the case, adjust the speedColumnFor method to return the appropriate column. This gets a bit more tricky when you don't have a fixed offset but need to look at the headers, e.g., to pick the following mph header after you saw an ID column. In that case you'll have to adjust the logic makeCars to remember the ID column position and only create the Car when you encounter an mph column, using the remembered ID column position.

  • Car IDs are Integers, like in your example data. You could also change them to be Strings.

  • Speeds are Integers. If you need floats, adjust the Integer.parseInt to Float.parseFloat when converting from Strings.

like image 156
Robert Avatar answered Oct 26 '22 16:10

Robert