Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read a huge CSV file from Google Cloud Storage line by line using Java?

I'm new to Google Cloud Platform. I'm trying to read a CSV file present in Google Cloud Storage (non-public bucket accessed via Service Account key) line by line which is around 1GB.

I couldn't find any option to read the file present in the Google Cloud Storage (GCS) line by line. I only see the read by chunksize/byte size options. Since I'm trying to read a CSV, I don't want to use read by chunksize since it may split a record while reading.

Solutions tried so far: Tried copying the contents from CSV file present in GCS to temporary local file and read the temp file by using the below code. The below code is working as expected but I don't want to copy huge file to my local instance. Instead, I want to read line by line from GCS.

    StorageOptions options = 
    StorageOptions.newBuilder().setProjectId(GCP_PROJECT_ID)
            .setCredentials(gcsConfig.getCredentials()).build();
    Storage storage = options.getService();
    Blob blob = storage.get(BUCKET_NAME, FILE_NAME);
    ReadChannel readChannel = blob.reader();
    FileOutputStream fileOuputStream = new FileOutputStream(TEMP_FILE_NAME);
    fileOuputStream.getChannel().transferFrom(readChannel, 0, Long.MAX_VALUE);
    fileOuputStream.close();

Please suggest the approach.

like image 932
Tech Guy Avatar asked Mar 18 '19 15:03

Tech Guy


3 Answers

Since, I'm doing batch processing, I'm using the below code in my ItemReader's init() method which is annotated with @PostConstruct. And In my ItemReader's read(), I'm building a List. Size of list is same as chunk size. In this way I can read lines based on my chunkSize instead of reading all the lines at once.

StorageOptions options = 
StorageOptions.newBuilder().setProjectId(GCP_PROJECT_ID)
        .setCredentials(gcsConfig.getCredentials()).build();
Storage storage = options.getService();
Blob blob = storage.get(BUCKET_NAME, FILE_NAME);
ReadChannel readChannel = blob.reader();
BufferedReader br = new BufferedReader(Channels.newReader(readChannel, "UTF-8"));
like image 69
Tech Guy Avatar answered Oct 29 '22 07:10

Tech Guy


One of the easiest ways might be to use the google-cloud-nio package, part of the google-cloud-java library that you're already using: https://github.com/googleapis/google-cloud-java/tree/v0.30.0/google-cloud-contrib/google-cloud-nio

It incorporates Google Cloud Storage into Java's NIO, and so once it's up and running, you can refer to GCS resources just like you'd do for a file or URI. For example:

Path path = Paths.get(URI.create("gs://bucket/lolcat.csv"));
try (Stream<String> lines = Files.lines(path)) {
   lines.forEach(s -> System.out.println(s));
} catch (IOException ex) {
   // do something or re-throw...
}
like image 2
Brandon Yarbrough Avatar answered Oct 29 '22 09:10

Brandon Yarbrough


Brandon Yarbrough is right, and to add to his answer:

if you use gcloud to login with your credentials then Brandon's code will work: google-cloud-nio will use your login to access the files (and that'll work even if they are not public).

If you prefer to do it all in software, you can use this code to read credentials from a local file and then access your file from Google Cloud:

    String myCredentials = "/path/to/my/key.json";
    CloudStorageFileSystem fs =
        CloudStorageFileSystem.forBucket(
            "bucket",
            CloudStorageConfiguration.DEFAULT,
            StorageOptions.newBuilder()
                .setCredentials(ServiceAccountCredentials.fromStream(
                    new FileInputStream(myCredentials)))
                .build());
    Path path = fs.getPath("/lolcat.csv");
    List<String> lines = Files.readAllLines(path, StandardCharsets.UTF_8);

edit: you don't want to read all the lines at once so don't use realAllLines, but once you have the Path you can use any of the other techniques discussed above to read just the part of the file you need: you can read one line at a time or get a Channel object.

like image 2
TubesHerder Avatar answered Oct 29 '22 07:10

TubesHerder