Is there a way to read a multi-line csv file using the ReadFromText
transform in Python? I have a file that contains one line I am trying to make Apache Beam read the input as one line, but cannot get it to work.
def print_each_line(line):
print line
path = './input/testfile.csv'
# Here are the contents of testfile.csv
# foo,bar,"blah blah
# more blah blah",baz
p = apache_beam.Pipeline()
(p
| 'ReadFromFile' >> apache_beam.io.ReadFromText(path)
| 'PrintEachLine' >> apache_beam.FlatMap(lambda line: print_each_line(line))
)
# Here is the output:
# foo,bar,"blah blah
# more blah blah",baz
The above code parses the input as two lines even though the standard for multi-line csv files is to wrap multi-line elements within double-quotes.
Beam doesn't support parsing CSV files. You can however use Python's csv.reader. Here's an example:
import apache_beam
import csv
def print_each_line(line):
print line
p = apache_beam.Pipeline()
(p
| apache_beam.Create(["test.csv"])
| apache_beam.FlatMap(lambda filename:
csv.reader(apache_beam.io.filesystems.FileSystems.open(filename)))
| apache_beam.FlatMap(print_each_line))
p.run()
Output:
['foo', 'bar', 'blah blah\nmore blah blah', 'baz']
None of the answers worked for me but this did
(
p
| beam.Create(['data/test.csv'])
| beam.FlatMap(lambda filename:
csv.reader(io.TextIOWrapper(beam.io.filesystems.FileSystems.open(known_args.input)))
| "Take only name" >> beam.Map(lambda x: x[0])
| WriteToText(known_args.output)
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With