Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How should I use UIMA Ruta to match the all words between line break?

Tags:

uima

ruta

Thank for any strong hands!

I have some text like the following

aaaaa aaaa aaaaa aaaaaa
bbbbb bbbbb bbbb bbbbbb
cccccc ccccc ccccc cccccc

I want to use Ruta to create annotation that matches all strings between line break. I want my annotation to create the following three match:

1. aaaaa aaaa aaaaa aaaaaa
2. bbbbb bbbbb bbbb bbbbbb
3. cccccc ccccc ccccc cccccc

I try to match everything between line break, like the following

BREAK #{-> MARK(Stuff)} BREAK;

But no luck. Could anyone please make some suggestion?

Thank you very much!

like image 618
Cheung Brian Avatar asked Dec 12 '25 19:12

Cheung Brian


1 Answers

The problem with your rule is probably the currently used filtering setting. Whitespaces, breaks and markup are not visible by default. The rule is probably not able to find any anchors to start the match process. You need to make breaks visible for the rules, e.g, with RETAINTYPE:

Document{-> RETAINTYPE(BREAK)};
BREAK #{-> MARK(Stuff)} BREAK;
Document{-> RETAINTYPE}; // for restoring the default setting

There is also an analysis engine that is able to create these annotations: PlainTextAnnotator. This analysis engine includes however also whitespaces at the beginning and end of the line. These could be removed with something like:

Document{-> RETAINTYPE(SPACE)};
Line{->TRIM(SPACE)};

In UIMA Ruta 2.2.1 (next release) you can also write something like:

Document{-> RETAINTYPE(BREAK)};
(#{-> Stuff} BREAK)+;

(I am a developer of UIMA Ruta)

like image 131
Peter Kluegl Avatar answered Dec 16 '25 22:12

Peter Kluegl



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!