creating spark data structure from multiline record

Tags:

I'm trying to read in retrosheet event file into spark. The event file is structured as such.

id,TEX201403310
version,2
info,visteam,PHI
info,hometeam,TEX
info,site,ARL02
info,date,2014/03/31
info,number,0
info,starttime,1:07PM
info,daynight,day
info,usedh,true
info,umphome,joycj901
info,attendance,49031
start,reveb001,"Ben Revere",0,1,8
start,rollj001,"Jimmy Rollins",0,2,6
start,utlec001,"Chase Utley",0,3,4
start,howar001,"Ryan Howard",0,4,3
start,byrdm001,"Marlon Byrd",0,5,9
id,TEX201404010
version,2
info,visteam,PHI
info,hometeam,TEX

As you can see for each game the events loops back.

I've read the file into a RDD, and then via a second for loop added a key for each iteration, which appears to work. But I was hoping to get some feedback on if there was a cleaning way to do this using spark methods.

Click to copy

logFile = '2014TEX.EVA'
event_data = (sc
              .textFile(logfile)
              .collect())

idKey = 0
newevent_list = []
for line in event_dataFile:
    if line.startswith('id'):
        idKey += 1
        newevent_list.append((idKey,line))
    else:
        newevent_list.append((idKey,line))

event_data = sc.parallelize(newevent_list)

213

asked Jul 05 '15 05:07

user1136149

1 Answers

PySpark since version 1.1 supports Hadoop Input Formats.You can use textinputformat.record.delimiter option to use a custom format delimiter as below

Click to copy

from operator import itemgetter

retrosheet = sc.newAPIHadoopFile(
    '/path/to/retrosheet/file',
    'org.apache.hadoop.mapreduce.lib.input.TextInputFormat',
    'org.apache.hadoop.io.LongWritable',
    'org.apache.hadoop.io.Text',
    conf={'textinputformat.record.delimiter': '\nid,'}
)
(retrosheet
    .filter(itemgetter(1))
    .values()
    .filter(lambda x: x)
    .map(lambda v: (
        v if v.startswith('id') else 'id,{0}'.format(v)).splitlines()))

Since Spark 2.4 you can also read data into DataFrame using text reader

Click to copy

spark.read.option("lineSep", '\nid,').text('/path/to/retrosheet/file')

answered Oct 23 '22 01:10

zero323

Related questions
                            
                                Matplotlib's matshow not aligned with grid
                            
                                -bash: /usr/bin/yum: /usr/bin/python: bad interpreter: Permission denied
                            
                                How to create a synchronized function across all instances
                            
                                Python: How do I randomly select a value from a dictionary key?
                            
                                Make numpy.sum() return a sum of matrices instead of a single number
                            
                                pika.exceptions.ProbableAuthenticationError when trying to send message to remote queue
                            
                                Histogram datetime objects in Numpy
                            
                                Why does Python crash while returning a C string?
                            
                                Switch every pair of characters in a string
                            
                                Rotate image without cropping OpenCV
                            
                                Default login_required rather than adding decorator everywhere
                            
                                Regular Expression to find brackets in a string
                            
                                Am I using `all` correctly?
                            
                                Create and pipe a file-like object as input for a command
                            
                                Move given row to end of DataFrame
                            
                                Support vector machine in Python using libsvm example of features
                            
                                Pandas: How to group by and sum MultiIndex
                            
                                Why can I assign True = False (Python 2.7.9) [duplicate]
                            
                                Error using Pytesser :**[WinError 2] The system cannot find the file specified**
                            
                                Python: Float infinite length (Precision float)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

creating spark data structure from multiline record

Tags:

python

apache-spark

pyspark

user1136149

People also ask

1 Answers

zero323

Recent Activity

Donate For Us