Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Differentiating between slashes in a string using a regular expression

Tags:

java

regex

A program that I'm writing (in Java) gets input data made up of three kinds of parts, separated by a slash /. The parts can be one of the following:

  1. A name matching the regular expression \w*
  2. A call matching the expression \w*\(.*\)
  3. A path matching the expression <.*>|\".*\". A path can contain slashes.

An example string could look like this:

bar/foo()/foo(bar)/<foo/bar>/bar/"foo/bar"/foo()

which has the following structure

name/call/call/path/name/path/call

I want to split this string into parts, and I'm trying to do this using a regular expression. My current expression captures slashes after calls and paths, but I'm having trouble getting it to capture slashes after names without also including slashes that may exist within paths. My current expression, just capturing slashes after paths and calls looks like this:

(?<=[\)>\"])/

How can I expand this expression to also capture slashes after names without including slashes within paths?

like image 292
Mia Clarke Avatar asked May 25 '11 12:05

Mia Clarke


4 Answers

(\w+|\w+\([^/]*\)(?:/\w+\([^/]*\))*|<[^>]*>|"[^"]*")(?=/|$)

captures this from the string 'bar/foo()/foo(bar)/<foo/bar>/bar/"foo/bar"/foo()'

  • 'bar'
  • 'foo()/foo(bar)'
  • '<foo/bar>'
  • 'bar'
  • '"foo/bar"'
  • 'foo()'

It does not capture the separating slashes, though (what for? - just assume they are there).

The simpler (\w+|\w+\([^/]*\)|<[^>]*>|"[^"]*")(?=/|$) would capture calls separately:

  • "foo()"
  • "foo(bar)"

EDIT: Usually, I do a regex breakdown:

(           # begin group 1 (for alternation)
  \w+       #   at least one word character
|           # or...
  \w+       #   at least one word character
  \(        #   a literal "("
  [^/]*     #   anything but a "/", as often as possible
  \)        #   a literal ")"
|           # or...
  <         #   a "<"
  [^>]*     #   anything but a ">", as often as possible
  >         #   a ">"
|           # or...
  "         #   a '"'
  [^"]*     #   anything but a '"', as often as possible
  "         #   a '"'
)           # end group 1
(?=/|$)     # look-ahead: ...followed by a slash or the end of string
like image 192
Tomalak Avatar answered Oct 06 '22 00:10

Tomalak


My first thought was to match slashes with an even number of quotes to the left of it. (I.e., having a positive look behind of something like (".*")* but this ends up in an exception saying

Look-behind group does not have an obvious maximum length

Honestly I think you'd be better of with a Matcher, using an or:ed together version of your components, (something like \w*|\w*\(.*\)|(<.*>|\".*\")) and do while (matcher.find()).

like image 37
aioobe Avatar answered Oct 06 '22 00:10

aioobe


Having your deliminator for your string not escaped when used inside your input might not be the best choice. However, you do have the luxury of the "false" slash being inside a regular pattern. What I suggest...

  1. Split the whole string on "/"
  2. Parse each part until you get to the start of the path
  3. Put the path elements into a list until the end of the path
  4. Rejoin the path back on "/"

I highly recommend you consider escaping the "/" in your paths to make your life easier.

like image 31
Andrew White Avatar answered Oct 06 '22 01:10

Andrew White


This pattern captures all parts of your example string separately without including the delimiter into the results:

\w+\(.*?\)|<.*>|\".*\"|\w+
like image 20
Stephan Avatar answered Oct 06 '22 00:10

Stephan