I'm trying to remove whitespace from an HTML fragment between <p>
tags
<p>Foo Bar</p> <p>bar bar bar</p> <p>bla</p>
as you can see, there always is a blank space between the <p> </p>
tags.
The problem is that the blank spaces create <br>
tags when saving the string into my database.
Methods like strip
or gsub
only remove the whitespace in the nodes, resulting in:
<p>FooBar</p> <p>barbarbar</p> <p>bla</p>
whereas I'd like to have:
<p>Foo Bar</p><p>bar bar bar</p><p>bla</p>
I'm using:
Occasionally there are children nodes of the <p>
Tags that generate the same problem: white space between
Sample Code
Note: the Code normally is in one Line, I reformatted it because it would be unbearable otherwise...
<p>
<p>
<strong>Selling an Appartment</strong>
</p>
<ul>
<li>
<p>beautiful apartment!</p>
</li>
<li>
<p>near the train station</p>
</li>
.
.
.
</ul>
<ul>
<li>
<p>10 minutes away from a shopping mall </p>
</li>
<li>
<p>nice view</p>
</li>
</ul>
.
.
.
</p>
How would I strip those white spaces aswell?
It turns out that I messed up using the gsub
method and didn't further investigate the possibility of using gsub
with regex
...
The simple solution was adding
data = data.gsub(/>\s+</, "><")
It deleted whitespace between all different kinds of nodes... Regex ftw!
This is how I'd write the code:
require 'nokogiri'
doc = Nokogiri::HTML::DocumentFragment.parse(<<EOT)
<p>Foo Bar</p> <p>bar bar bar</p> <p>bla</p>
EOT
doc.search('p, ul, li').each { |node|
next_node = node.next_sibling
next_node.remove if next_node && next_node.text.strip == ''
}
puts doc.to_html
It results in:
<p>Foo Bar</p><p>bar bar bar</p><p>bla</p>
Breaking it down:
doc.search('p')
looks for only the <p>
nodes in the document. Nokogiri returns a NodeSet from search
, or a nil if nothing matched. The code loops over the NodeSet, looking at each node in turn.
next_node = node.next_sibling
gets the pointer to the next node following the current <p>
node.
next_node.remove if next_node && next_node.text.strip == ''
next_node.remove
removes the current next_node
from the DOM if the next node isn't nil and its text isn't empty when stripped, in otherwords, if the node has only whitespace.
There are other techniques to locate only the TextNodes if all of them should be stripped from the document. That's risky, because it can end up deleting all blanks between tags, causing run-on sentences and joined words, which probably isn't what you want.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With