Regular expression to split by forward slash

Tags:

I have a parse tree which includes some information. To extract the information that I need, I am using a code which splits the string based on forward slash (/), but that is not a perfect code. I explain more details here:

I had used this code in another project earlier and that worked perfectly. But now the parse trees of my new dataset are more complicated and the code makes wrong decisions sometimes.

The parse tree is something like this:

(TOP~did~1~1 (S~did~2~2 (NPB~I~1~1 I/PRP ) (VP~did~3~1 did/VBD not/RB (VP~read~2~1 read/VB (NPB~article~2~2 the/DT article/NN ./PUNC. ) ) ) ) )

As you see, the leaves of the tree are the words right before the forward slashes. To get these words, I have used this code before:

parse_tree.split("/");

But now, in my new data, I see instances like these:

1) (TOP Source/NN http://www.alwatan.com.sa/daily/2007-01-31/first_page/first_page01.htm/X ./. )

where there are multiple slashes due to website addresses (In this case, only the last slash is the separator of the word).

2) (NPB~sister~2~2 Your/PRP$ sister/NN //PUNC: )

Where the slash is a word itself.

Could you please help me to replace my current simple regular expression with an expression which can manage these cases?

To summarize what I need, I would say that I need a regular expression which can split based on forward slash, but it must be able to manage two exceptions: 1) if there is a website address, it must split based on the last slash. 2) If there are two consecutive slashes, it must split based on the second split (and the first slash must NOT be considered as a separator, it is a WORD).

733

asked May 08 '15 10:05

user1419243

2 Answers

I achieved what you requested following this article:

http://www.rexegg.com/regex-best-trick.html

Just to summarize, here is the over all strategy:

1st, you will need to create a Regex in this format:

NotThis | NeitherThis | (IWantThis)

After that, your capture group $1 will contain only the slashes you are interested in perform the splits.

You can then replace them with something less likely to occur, and after that you perform the split in this replaced term.

So, having this strategy in mind, here's the code:

Regex:

\\/(?=\\/)|(?:http:\\/\\/)?www[\\w\\.\\/\\-]*(?=\\/)|(\\/)

Explanation:

NotThis term would be double slashes with lookAhead( to take just 1st slash)

\\/(?=\\/)

NeitherThis term is just a basic url check with a lookahead to not capture the last \/

(?:http:\\/\\/)?www[\\w\\.\\/\\-]*(?=\\/)

IWantThis term is simply the slash:

(\\/)

In the Java code you can put this all together doing something like this:

Pattern p = Pattern.compile("\\/(?=\\/)|(?:http:\\/\\/)?www[\\w\\.\\/\\-]*(?=\\/)|(\\/)");

Matcher m = p.matcher("(TOP~did~1~1 (S~did~2~2 (NPB~I~1~1 I/PRP ) (VP~did~3~1 did/VBD not/RB (VP~read~2~1 read/VB (NPB~article~2~2 the/DT article/NN ./PUNC. ) ) ) ) )\n(TOP Source/NN http://www.alwatan.com.sa/daily/2007-01-31/first_page/first_page01.htm/X ./. )\n(NPB~sister~2~2 Your/PRP$ sister/NN //PUNC: )");
StringBuffer b= new StringBuffer();
while (m.find()) {
    if(m.group(1) != null) m.appendReplacement(b, "Superman");
    else m.appendReplacement(b, m.group(0));
}
m.appendTail(b);
String replaced = b.toString();
System.out.println("\n" + "*** Replacements ***");
System.out.println(replaced);

String[] splits = replaced.split("Superman");
System.out.println("\n" + "*** Splits ***");
for (String split : splits) System.out.println(split);

Output:

*** Replacements ***                                                                                                                                                                                  
(TOP~did~1~1 (S~did~2~2 (NPB~I~1~1 ISupermanPRP ) (VP~did~3~1 didSupermanVBD notSupermanRB (VP~read~2~1 readSupermanVB (NPB~article~2~2 theSupermanDT articleSupermanNN .SupermanPUNC. ) ) ) ) )      
(TOP SourceSupermanNN http://www.alwatan.com.sa/daily/2007-01-31/first_page/first_page01.htmSupermanX .Superman. )                                                                                    
(NPB~sister~2~2 YourSupermanPRP$ sisterSupermanNN /SupermanPUNC: )                                                                                                                                           

*** Splits ***                                                                                                                                                                                        
(TOP~did~1~1 (S~did~2~2 (NPB~I~1~1 I                                                                                                                                                                  
PRP ) (VP~did~3~1 did                                                                                                                                                                                 
VBD not                                                                                                                                                                                               
RB (VP~read~2~1 read                                                                                                                                                                                  
VB (NPB~article~2~2 the                                                                                                                                                                               
DT article                                                                                                                                                                                            
NN .                                                                                                                                                                                                  
PUNC. ) ) ) ) )                                                                                                                                                                                       
(TOP Source                                                                                                                                                                                           
NN http://www.alwatan.com.sa/daily/2007-01-31/first_page/first_page01.htm                                                                                                                             
X .                                                                                                                                                                                                   
. )
(NPB~sister~2~2 Your                                                                                                                                                                                  
PRP$ sister                                                                                                                                                                                           
NN /
PUNC: )

200

answered Sep 18 '22 23:09

Rodrigo López

You should be able to use a negative lookbehind with a regex. This would need a bigger sample of inputs to be sure, but seems to work for your two cases:

    String pattern = "(?<![\\:\\/])\\/";

    String s1 = "(TOP Source/NN http://www.alwatan.com.sa/daily/2007-01-31/first_page/first_page01.htm/X ./. )";
    List<String> a = (List<String>) Arrays.asList(s1.split(pattern));

    System.out.println("first case:");
    System.out.println(a.stream().map(i->i.toString()).collect(Collectors.joining(",\n")));
    System.out.println("\n");

    String s2 = "(NPB~sister~2~2 Your/PRP$ sister/NN //PUNC: )";
    a = (List<String>) Arrays.asList(s2.split(pattern));
    System.out.println("second case");
    System.out.println(a.stream().map(i->i.toString()).collect(Collectors.joining(",\n")));

This outputs:

first case:
(TOP Source,
NN http://www.alwatan.com.sa,
daily,
2007-01-31,
first_page,
first_page01.htm,
X .,
. )


second case
(NPB~sister~2~2 Your,
PRP$ sister,
NN ,
/PUNC: )

answered Sep 18 '22 23:09

ncoronges

Related questions
                            
                                Real-time screensharing to Java app (localhost)
                            
                                Java Generics - Cannot convert from <? extends MyObject> to <MyObject> [duplicate]
                            
                                Insert JSON Array into mongodb
                            
                                Font extensions does not work for JasperReports
                            
                                Is there a default, class-level annotation that is NOT deprecated that specifies non-null return values by default
                            
                                How to pass the date as URL parameter
                            
                                Why is Tomcat throwing FileNotFoundExceptions for existing JAR files?
                            
                                Why does OpenJDK place private methods into vtable?
                            
                                Spring Data JPA: How not to repeat myself in countQueries?
                            
                                How to remove repeatable keys, key preview of the Android custom keyboard
                            
                                Visibility scope project/module
                            
                                Slf4j or Logback: Turn off logging for 1 unit test (or 1 thread)
                            
                                Apache Commons Math3 Percentile of Number
                            
                                How float is converted to double in java? [duplicate]
                            
                                What currency to use in unit tests?
                            
                                Can you reset the counter of a for-each Loop?
                            
                                Make "class" transient or serializable BUT the class is serializable
                            
                                How to convert String to Date in java whatever system format is [duplicate]
                            
                                How to compare String representations of doubles if it ends in .00
                            
                                jersey ws 2.0 @suspended AsyncResponse, what does it do?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Regular expression to split by forward slash

Tags:

java

regex

user1419243

People also ask

2 Answers

Rodrigo López

ncoronges

Recent Activity

Donate For Us