Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ruby extract data from string using regex

Tags:

regex

ruby

jruby

I'm doing some web scraping, this is the format for the data

Sr.No.  Course_Code Course_Name Credit  Grade   Attendance_Grade

The actual string that i receive is of the following form

1 CA727 PRINCIPLES OF COMPILER DESIGN 3 A M

The things that I am interested in are the Course_Code, Course_Name and the Grade, in this example the values would be

Course_Code : CA727
Course_Name : PRINCIPLES OF COMPILER DESIGN
Grade : A

Is there some way for me to use a regular expression or some other technique to easily extract this information instead of manually parsing through the string. I'm using jruby in 1.9 mode.

like image 680
nikhil Avatar asked Jun 05 '12 21:06

nikhil


1 Answers

Let's use Ruby's named captures and a self-describing regex!

course_line = /
    ^                  # Starting at the front of the string
    (?<SrNo>\d+)       # Capture one or more digits; call the result "SrNo"
    \s+                # Eat some whitespace
    (?<Code>\S+)       # Capture all the non-whitespace you can; call it "Code"
    \s+                # Eat some whitespace
    (?<Name>.+\S)      # Capture as much as you can
                       # (while letting the rest of the regex still work)
                       # Make sure you end with a non-whitespace character.
                       # Call this "Name"
    \s+                # Eat some whitespace
    (?<Credit>\S+)     # Capture all the non-whitespace you can; call it "Credit"
    \s+                # Eat some whitespace
    (?<Grade>\S+)      # Capture all the non-whitespace you can; call it "Grade"
    \s+                # Eat some whitespace
    (?<Attendance>\S+) # Capture all the non-whitespace; call it "Attendance"
    $                  # Make sure that we're at the end of the line now
/x

str = "1   CA727   PRINCIPLES OF COMPILER DESIGN   3   A   M"
parts = str.match(course_line)

puts "
Course Code: #{parts['Code']}
Course Name: #{parts['Name']}
      Grade: #{parts['Grade']}".strip

#=> Course Code: CA727
#=> Course Name: PRINCIPLES OF COMPILER DESIGN
#=>       Grade: A
like image 92
Phrogz Avatar answered Oct 31 '22 09:10

Phrogz