Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can you represent CSV data in Google's Protocol Buffer format?

I've recently found out about protocol buffers and was wondering if they could be applied to my specific problem.

Basically I have some CSV data that I need to convert to a more compact format for storage as some of the files are several gig.

Each field in the CSV has a header, and there are only two types, strings and decimals (because sometimes there are alot of significant digits and I need to handle all numbers the same way). But each file will have different column names for each field.

As well as capturing the original CSV data I need to be able to add extra information to the file before saving. And I was hoping to make this future proof by handling different file versions.

So, is it possible to use protocol buffers to capture a random number of randomly named columns of data, like a CSV file?

like image 453
Cameron MacFarland Avatar asked Dec 16 '08 14:12

Cameron MacFarland


People also ask

What type of data works best for CSV format?

CSV files can be used with most any spreadsheet program, such as Microsoft Excel or Google Spreadsheets. They differ from other spreadsheet file types because you can only have a single sheet in a file, they can not save cell, column, or row. Also, you cannot not save formulas in this format.

What is Protobuf format?

Protocol buffers, or Protobuf, is a binary format created by Google to serialize data between different services. Google made this protocol open source and now it provides support, out of the box, to the most common languages, like JavaScript, Java, C#, Ruby and others.

Does Protobuf use JSON?

JSON is limited to certain python objects, and it cannot serialize every python object. Protobuf supports a wider range of data types when compared to JSON. For example, enumerations and methods are supported by Protobuf and not supported by JSON. JSON supports only a subset of python data types.

Is Protobuf text or binary?

The protobuf data is always binary, it's just that python and C++ strings can hold binary data. So it doesn't matter what data type you end up using.


1 Answers

Well, it's certainly representable. Something like:

message CsvFile {
    repeated CsvHeader header = 1;
    repeated CsvRow row = 2;
}

message CsvHeader {
    require string name = 1;
    require ColumnType type = 2;
}

enum ColumnType {
    DECIMAL = 1;
    STRING = 2;
}

message CsvRow {
    repeated CsvValue value = 1;
}

// Note that the column is implicit based on position within row    
message CsvValue {
    optional string string_value = 1;
    optional Decimal decimal_value = 2;
}

message Decimal {
    // However you want to represent it (there are various options here)
}

I'm not sure how much benefit it will provide, mind you... You can certainly add more information (add to the CsvFile message) and future proofing is in the "normal PB way" - only add optional fields, etc.

like image 190
Jon Skeet Avatar answered Sep 21 '22 07:09

Jon Skeet