Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Interpreting this raw text - a strategy?

I have this raw text:

________________________________________________________________________________________________________________________________
Pos Car  Competitor/Team                Driver                   Vehicle              Cap   CL Laps     Race.Time Fastest...Lap

1     6  Jason Clements                 Jason Clements           BMW M3               3200       10     9:48.5710   3 0:57.3228*
2    42  David Skillender               David Skillender         Holden VS Commodore  6000       10     9:55.6866   2 0:57.9409 
3    37  Bruce Cook                     Bruce Cook               Ford  Escort         3759       10     9:56.4388   4 0:58.3359 
4    18  Troy Marinelli                 Troy Marinelli           Nissan  Silvia       3396       10     9:56.7758   2 0:58.4443 
5    75  Anthony Gilbertson             Anthony Gilbertson       BMW M3               3200       10    10:02.5842   3 0:58.9336 
6    26  Trent Purcell                  Trent Purcell            Mazda RX7            2354       10    10:07.6285   4 0:59.0546 
7    12  Scott Hunter                   Scott Hunter             Toyota  Corolla      2000       10    10:11.3722   5 0:59.8921 
8    91  Graeme Wilkinson               Graeme Wilkinson         Ford  Escort         2000       10    10:13.4114   5 1:00.2175 
9     7  Justin Wade                    Justin Wade              BMW M3               4000       10    10:18.2020   9 1:00.8969 
10   55  Greg Craig                     Grag Craig               Toyota  Corolla      1840       10    10:18.9956   7 1:00.7905 
11   46  Kyle Orgam-Moore               Kyle Organ-Moore         Holden VS Commodore  6000       10    10:30.0179   3 1:01.6741 
12   39  Uptiles Strathpine             Trent Spencer            BMW Mini Cooper S    1500       10    10:40.1436   2 1:02.2728 
13  177  Mark Hyde                      Mark Hyde                Ford  Escort         1993       10    10:49.5920   2 1:03.8069 
14   34  Peter Draheim                  Peter Draheim            Mazda RX3            2600       10    10:50.8159  10 1:03.4396 
15    5  Scott Douglas                  Scott Douglas            Datsun  1200         1998        9     9:48.7808   3 1:01.5371 
16   72  Paul Redman                    Paul Redman              Ford  Focus          2lt         9    10:11.3707   2 1:05.8729 
17    8  Matthew Speakman               Matthew Speakman         Toyota  Celica       1600        9    10:16.3159   3 1:05.9117 
18   74  Lucas Easton                   Lucas Easton             Toyota  Celica       1600        9    10:16.8050   6 1:06.0748 
19   77  Dean Fuller                    Dean Fuller              Mitsubishi  Sigma    2600        9    10:25.2877   3 1:07.3991 
20   16  Brett Batterby                 Brett Batterby           Toyota  Corolla      1600        9    10:29.9127   4 1:07.8420 
21   95  Ross Hurford                   Ross Hurford             Toyota  Corolla      1600        8     9:57.5297   2 1:12.2672 
DNF  13  Charles Wright                 Charles Wright           BMW 325i             2700        9     9:47.9888   7 1:03.2808 
DNF  20  Shane Satchwell                Shane Satchwell          Datsun  1200 Coupe   1998        1     1:05.9100   1 1:05.9100 

Fastest Lap Av.Speed Is 152kph, Race Av.Speed Is 148kph
R=under lap record by greatest margin, r=under lap record, *=fastest lap time
________________________________________________________________________________________________________________________________
Issue# 2 - Printed Sat May 26 15:43:31 2012                     Timing System By NATSOFT (03)63431311 www.natsoft.com.au/results
Amended 

I need to parse it into an object with the obvious Position, Car, Driver etc fields. The issue is I have no idea on what sort of strategy to use. If I split it on whitespace, I would end up with a list like so:

["1", "6", "Jason", "Clements", "Jason", "Clements", "BMW", "M3", "3200", "10", "9:48.5710", "3", "0:57.3228*"]

Can you see the issue. I cannot just interpret this list, because people may have just 1 name, or 3 words in a name, or many different words in a car. It makes it impossible to just reference the list using indexes alone.

What about using the offsets defined by the column names? I can't quite see how that could be used though.

Edit: So the current algorithm I am using works like this:

  1. Split the text on new line giving a collection of lines.
  2. Find the common whitespace characters FURTHEST RIGHT on each line. I.e. the positions (indexes) on each line where every other line contains whitespace. EG:
  3. Split the lines based on those common characters.
  4. Trim the lines

Several issues exist:

If the names contain the same lengths like so:

Jason Adams
Bobby Sacka
Jerry Louis

Then it will interpret that as two separate items: (["Jason" "Adams", "Bobby", "Sacka", "Jerry", "Louis"]).

Whereas if they all differed like so:

Dominic Bou
Bob Adams
Jerry Seinfeld

Then it would correctly split on the last 'd' in Seinfeld (and thus we'd get a collection of three names(["Dominic Bou", "Bob Adams", "Jerry Seinfeld"]).

It's also quite brittle. I am looking for a nicer solution.

like image 213
Dominic Bou-Samra Avatar asked May 28 '12 23:05

Dominic Bou-Samra


1 Answers

This is not a good case for regex, you really want to discover the format and then unpack the lines:

lines = str.split "\n"

# you know the field names so you can use them to find the column positions
fields = ['Pos', 'Car', 'Competitor/Team', 'Driver', 'Vehicle', 'Cap', 'CL Laps', 'Race.Time', 'Fastest...Lap']
header = lines.shift until header =~ /^Pos/
positions = fields.map{|f| header.index f}

# use that to construct an unpack format string
format = 1.upto(positions.length-1).map{|x| "A#{positions[x] - positions[x-1]}"}.join
# A4A5A31A25A21A6A12A10

lines.each do |line|
  next unless line =~ /^(\d|DNF)/ # skip lines you're not interested in
  data = line.unpack(format).map{|x| x.strip}
  puts data.join(', ')
  # or better yet...
  car = Hash[fields.zip data]
  puts car['Driver']
end
like image 70
pguardiario Avatar answered Sep 30 '22 12:09

pguardiario