Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to I create split a line into pairs of words rather than singular words?

I read a CSV file into an RDD in Jupyter and wanted to convert each line into a pair of words rather than singular words, and then to create tuples of the pairs of words, but have no idea how i should do it. The CSV file looks something like this:

Afghanistan, AFG
Albania, ALB
Algeria, ALG
American Somoa, ASA
Anguilla, AIA

I've tried this:

lines = sc.textFile(...) words = lines.flatMap(lambda line : line.split (" "))

but it doesn't return Albania, ALB as one tuple. Instead it counts Albania as one and ALB as another. Help please!

like image 880
Beatrice Cheung Avatar asked Oct 17 '16 09:10

Beatrice Cheung


People also ask

Can you split words at the end of a line?

Splitting Words at the End of the Line. There are a few "never" rules you should remember when breaking the words at the end of the line: Never break up a one-syllable word. Never hyphenate a word that already has a hyphen. Never split a proper noun (any noun starting with a capital letter). Never leave one or two letters on either line.

How to space the lines between two paragraphs in Microsoft Word?

When you know how, it is trivial to setup. When it is set, you just use one tab character to space the lines, Word will handle the alignment. This setting is saved in the current line paragraph mark. If you want to reuse it in the future, save it as a Style. To use it, type the first words, <TAB> second set of words.

How do I split the alignment of a line in word?

Each line that I am needing to split the alignment on is separated by one or more paragraphs, which was making the issue even more tricky. What I ended up doing was highlighting the line that I wanted to have split alignment, went to Insert > Table > Insert Table.

How to split a line or string from text in Python?

with open ( "data.txt", "r") as ext_file: for line in ext_file: data = line. split ( '||' ) print (data) To split a line or string from text or file in Python, use the split () method. You can split the line from whatever character or special symbol you like.


1 Answers

You need to use map instead of flatMap. You can create your list of tuples as follows:

result = sc.textFile("...").map(lambda line:tuple(line.split(",")))

result.collect() then returns:

[(u'Afghanistan', u' AFG'), 
 (u'Albania', u' ALB'), 
 (u'Algeria', u' ALG'), 
 (u'American Somoa', u' ASA'), 
 (u'Anguilla', u' AIA')]

Looking at this output, you may want to add unicode.strip to remove the leading spaces:

sc.textFile("....").
map(lambda line:tuple(map(unicode.strip,line.split(",")))).
collect()
like image 146
Alex Avatar answered Oct 04 '22 22:10

Alex