I read a CSV file into an RDD in Jupyter and wanted to convert each line into a pair of words rather than singular words, and then to create tuples of the pairs of words, but have no idea how i should do it. The CSV file looks something like this: Afghanistan, AFG Albania, ALB Algeria, ALG American Somoa, ASA Anguilla, AIA I've tried this: lines = sc.textFile(...) words = lines.flatMap(lambda line : line.split (" ")) but it doesn't return Albania, ALB as one tuple. Instead it counts Albania as one and ALB as another. Help please!

You need to use <code>map</code> instead of <code>flatMap</code>. You can create your list of tuples as follows: <pre class="prettyprint"><code>result = sc.textFile("...").map(lambda line:tuple(line.split(","))) </code></pre> <code>result.collect()</code> then returns: <pre class="prettyprint"><code>[(u'Afghanistan', u' AFG'), (u'Albania', u' ALB'), (u'Algeria', u' ALG'), (u'American Somoa', u' ASA'), (u'Anguilla', u' AIA')] </code></pre> Looking at this output, you may want to add <code>unicode.strip</code> to remove the leading spaces: <pre class="prettyprint"><code>sc.textFile("...."). map(lambda line:tuple(map(unicode.strip,line.split(",")))). collect() </code></pre>

How to I create split a line into pairs of words rather than singular words?

Tags:

split

jupyter

pyspark

I read a CSV file into an RDD in Jupyter and wanted to convert each line into a pair of words rather than singular words, and then to create tuples of the pairs of words, but have no idea how i should do it. The CSV file looks something like this:

Afghanistan, AFG
Albania, ALB
Algeria, ALG
American Somoa, ASA
Anguilla, AIA

I've tried this:

lines = sc.textFile(...) words = lines.flatMap(lambda line : line.split (" "))

but it doesn't return Albania, ALB as one tuple. Instead it counts Albania as one and ALB as another. Help please!

880

asked Oct 17 '16 09:10

Beatrice Cheung

1 Answers

You need to use map instead of flatMap. You can create your list of tuples as follows:

result = sc.textFile("...").map(lambda line:tuple(line.split(",")))

result.collect() then returns:

[(u'Afghanistan', u' AFG'), 
 (u'Albania', u' ALB'), 
 (u'Algeria', u' ALG'), 
 (u'American Somoa', u' ASA'), 
 (u'Anguilla', u' AIA')]

Looking at this output, you may want to add unicode.strip to remove the leading spaces:

sc.textFile("....").
map(lambda line:tuple(map(unicode.strip,line.split(",")))).
collect()

146

answered Oct 04 '22 22:10

Alex

Related questions
                            
                                Selectively splitting a string in Perl
                            
                                Does a Python strip() on a split() string do anything?
                            
                                split complex string
                            
                                Why is split(' ') trying to be (too) smart?
                            
                                File::Slurp into a multi-GB scalar - how to split efficiently?
                            
                                why does split coerce double to integer in R and is there a workaround
                            
                                Limiting number of input values in an array/list in Python
                            
                                Using String.split() How can I split a string based on a regular expression excluding a certain string
                            
                                how to split the numeric values from alphanumeric string value using javascript?
                            
                                How to split between two capital letters?
                            
                                Vim often needs Redraw after Splitting
                            
                                Splitting a String in Java throws PatternSyntaxException
                            
                                string split issue
                            
                                Split numpy array at multiple values?
                            
                                Guava Splitter with multiple split parameters
                            
                                split a string at a certain index
                            
                                HBase regions automatic splitting using hbase.hregion.max.filesize
                            
                                Split one row after every 3rd column and transport those 3 columns as a new row in r
                            
                                Python use split with arrays
                            
                                Javascript: split by this|that

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With