Getting text around a specific element reference

Tags:

Having an HTML snippet like this:

<p>Lorem ipsum <mark>dolor</mark> sit amet. <mark>Lorem</mark> ipsum again
   and <mark>dolor</mark></p>

I can select the  elements using $("mark"). I want to get a list of strings representing the marked word and 5 characters on the left side and 5 characters in the right side and prefix and suffix the strings with [...].

For this example it would be:

Click to copy

[
   "[...] psum dolor sit [...]",
   "[...] met. Lorem ipsu [...]",
   "[...] and dolor [...]",
]

Currently I'm something like this:

Click to copy

var $highlightMarks = $("mark");
var results = [];

for (var i = 0; i < $highlightMarks.length; ++i) {
  var $c = $highlightMarks.eq(i);
  var text = $c.parent().text().trim().replace(/\n/g, " ");
  var indexStart = new RegExp($c.html(), "gim").exec(text).index;
  text = "[...] " + text.substring(indexStart - 5, $c.html().length + indexStart + 5) + " [...]";
  results.push(text);
}

alert(JSON.stringify(results))

Click to copy

<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<p>Lorem ipsum <mark>dolor</mark> sit amet. <mark>Lorem</mark> ipsum again and <mark>dolor</mark>.</p>

But this fails when two words are the same in the same paragraph (in this example: the dolor case).

Instead of showing psum dolor sit at the end of the array, it should be and dolor..

So, having a reference to the  element, what's the correct way to get some text on the right side and some text on the left side?

368

asked Dec 31 '15 09:12

Ionică Bizău

1 Answers

This is a two steps bulletproof implementation (counterexamples are welcomed) using only regex .

Its greatest virtue is to work independently from a tag container (just like the ... to extract the text around marks).

Click to copy

var filter = /<(?![/]?mark)[^><]*>/gi;

var regex  = /((?:(?!<[/]mark\s*>).){0,5})<mark\s*>([^<]*)<[/]mark\s*>(?=((?:(?!<mark\s*>).){0,5}))/ig;
var subst  = "$1 $2 $3";

var tests  = ['<p>Lorem ipsum mark> <MARK  >dolor</MARK > < mark sitamet. <mark>Lorem</mark> ipsum again and <mark>dolor</mark>.</p>','<P style="margin: 0 15px 15px 0;">um <mark>dolor</mark> sit amet. <mark>Lorem</mark> ipsum again and <mark>dolor</mark>.</P>','<p>um <mark>dolor</mark> <span>sit</ span> <test amet. <mark>Lorem</mark> <b>i</b>psum again and <mark>dolor</mark>.</p>','<p style="margin: 0 15px 15px 0;" another_tag="123">Lorem ipsum <MARK  >dolor</MARK > sit <mark>amet.</mark><mark>Lorem</mark> ipsum again and <mark>dolor</mark>.</P>'];


while(t = tests.pop()) {

    document.write('<b>INPUT</b> <xmp>' + t + '</xmp>');

    var t = t.replace(filter,'');
    document.write('<b>Filtered:</b> <xmp>' + t + '</xmp>');

    while ((r = regex.exec(t)) != null) {

        pre = r[1]; marked = r[2]; post = r[3];
        document.write('<b>Match:</b> "' + pre + ' <mark>' + marked + '</mark> ' + post + '"<hr/>');
    }
}

How it works

Filter out every tag that is not a  or a  tag (case insensitive and space relaxed according to what is accepted by chrome and firefox: the regex does accept also the variations  or  as valid tags but not  or :

Click to copy
```
/<(?![/]?mark)[^><]*>/gi
```
Regex 101 Demo

NOTE: this filter handles the single chars '<' and '>' correctly (with or without text after/before them).

This behaves differently from a browser regard the opening tag char <: anything after <someText till the next valid tag will be removed (breaking valid html tags). I prefer do not do this way and treat an opening not closed '<' as a simple char.

e.g.: Some text <notAtag other text marked. chrome or firefox will output Some text marked (with marked actually not marked cause the  tag is been filtered out together with <notAtag other text).

Select the marked text and its context (till 5 characters)

Click to copy

/((?:(?!<[/]mark\s*>).){0,5}) #* 0 to 5 chars that not belongs to '<mark\s*>' 
                              #  the round brackets save them in group $1
<mark\s*>                     #* literal string '<mark' followed by 
                              #  0 or more whitespace chars then literal '>'
([^<]*)                       #* 0 or more chars that is not '<'
                              #  the round brackets save them in group $2
<[/]mark\s*>                  #* literal string '</mark' followed by 
                              #  0 or more whitespace chars then literal '>'
(?=((?:(?!<mark\s*>).){0,5})) #* 0 to 5 chars that not belongs to '</mark\s*>'
                              #  lookahead (?=...) used to not consume them
                              #  round brackets save them in $3

/ig                           #* i: Case-insensitive, g: global search

Regex 101 Demo

Regular expression visualization

NOTE: The regex is smart enough to select 5 chars both from the previous and the next  if is the case (e.g. 12345, 12345 will be both post context of the closing tag and the pre context of the opening tag).

In addiction the context selection avoid to select over  tags so :

where there is two adjacent ...... tags nothing is selected as post/pre context;
123: only 123 is selected as post/pre context.

answered Oct 31 '22 13:10

Giuseppe Ricupero

Related questions
                            
                                Can I prevent passing wrong number of parameters to methods with JS Lint, JS Hint, or some other tool?
                            
                                Explain "you can have functions that change other functions"
                            
                                Do promises in AngularJS catch every exception/error?
                            
                                THREE.js SphereGeometry Panorama hotspots using DOMElements
                            
                                Pull an entry from an array via Meteor
                            
                                Output a server generated json object in jade without json parse
                            
                                Debugging bundled javascript in Visual Studio 2015
                            
                                Simulating keyboard event through javascript
                            
                                Get the error line in a Ruby Opal code
                            
                                Comparing SHA256 made with PHP hash() and NodeJS crypto.createHash()
                            
                                IPython notebook ~ Using javascript to run python code?
                            
                                Impossible to select all_day_slots with fullcalendar
                            
                                Array.find(value) return value 'is not a function'
                            
                                Session attribute is lost after invoking a client-side redirect
                            
                                Why do we have to call `.done()` at the end of a promise chain in react-native?
                            
                                How to fix image perspective distortion and rotation with JavaScript?
                            
                                Cache invalidation and synchronisation Angular/back-end
                            
                                JavaScript subclassing in Parse.com
                            
                                How do you create stateful, modular, self-contained web components in Elm?
                            
                                Break out of a Promise "then" chain with errorCallback

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Getting text around a specific element reference

Tags:

javascript

html

jquery

regex

Ionică Bizău

People also ask

1 Answers

How it works

Giuseppe Ricupero

Recent Activity

Donate For Us