Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a way to read a multi-line csv file in Apache Beam using the ReadFromText transform (Python)?

Is there a way to read a multi-line csv file using the ReadFromText transform in Python? I have a file that contains one line I am trying to make Apache Beam read the input as one line, but cannot get it to work.

def print_each_line(line):
    print line

path = './input/testfile.csv'
# Here are the contents of testfile.csv
# foo,bar,"blah blah
# more blah blah",baz

p = apache_beam.Pipeline()

(p
 | 'ReadFromFile' >> apache_beam.io.ReadFromText(path)
 | 'PrintEachLine' >> apache_beam.FlatMap(lambda line: print_each_line(line))
 )

# Here is the output:
# foo,bar,"blah blah
# more blah blah",baz

The above code parses the input as two lines even though the standard for multi-line csv files is to wrap multi-line elements within double-quotes.

like image 425
Brandon Avatar asked Apr 19 '18 05:04

Brandon


2 Answers

Beam doesn't support parsing CSV files. You can however use Python's csv.reader. Here's an example:

import apache_beam
import csv

def print_each_line(line):
  print line

p = apache_beam.Pipeline()

(p 
 | apache_beam.Create(["test.csv"])
 | apache_beam.FlatMap(lambda filename:
     csv.reader(apache_beam.io.filesystems.FileSystems.open(filename)))
 | apache_beam.FlatMap(print_each_line))

p.run()

Output:

['foo', 'bar', 'blah blah\nmore blah blah', 'baz']
like image 80
Udi Meiri Avatar answered Oct 25 '22 18:10

Udi Meiri


None of the answers worked for me but this did

(
  p
  | beam.Create(['data/test.csv'])
  | beam.FlatMap(lambda filename:
    csv.reader(io.TextIOWrapper(beam.io.filesystems.FileSystems.open(known_args.input)))
  | "Take only name" >> beam.Map(lambda x: x[0])
  | WriteToText(known_args.output)
)
like image 43
Juan Acevedo Avatar answered Oct 25 '22 18:10

Juan Acevedo