I read a CSV file into an RDD in Jupyter and wanted to convert each line into a pair of words rather than singular words, and then to create tuples of the pairs of words, but have no idea how i should do it. The CSV file looks something like this:
Afghanistan, AFG
Albania, ALB
Algeria, ALG
American Somoa, ASA
Anguilla, AIA
I've tried this:
lines = sc.textFile(...) words = lines.flatMap(lambda line : line.split (" "))
but it doesn't return Albania, ALB as one tuple. Instead it counts Albania as one and ALB as another. Help please!
Splitting Words at the End of the Line. There are a few "never" rules you should remember when breaking the words at the end of the line: Never break up a one-syllable word. Never hyphenate a word that already has a hyphen. Never split a proper noun (any noun starting with a capital letter). Never leave one or two letters on either line.
When you know how, it is trivial to setup. When it is set, you just use one tab character to space the lines, Word will handle the alignment. This setting is saved in the current line paragraph mark. If you want to reuse it in the future, save it as a Style. To use it, type the first words, <TAB> second set of words.
Each line that I am needing to split the alignment on is separated by one or more paragraphs, which was making the issue even more tricky. What I ended up doing was highlighting the line that I wanted to have split alignment, went to Insert > Table > Insert Table.
with open ( "data.txt", "r") as ext_file: for line in ext_file: data = line. split ( '||' ) print (data) To split a line or string from text or file in Python, use the split () method. You can split the line from whatever character or special symbol you like.
You need to use map
instead of flatMap
. You can create your list of tuples as follows:
result = sc.textFile("...").map(lambda line:tuple(line.split(",")))
result.collect()
then returns:
[(u'Afghanistan', u' AFG'),
(u'Albania', u' ALB'),
(u'Algeria', u' ALG'),
(u'American Somoa', u' ASA'),
(u'Anguilla', u' AIA')]
Looking at this output, you may want to add unicode.strip
to remove the leading spaces:
sc.textFile("....").
map(lambda line:tuple(map(unicode.strip,line.split(",")))).
collect()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With