I'm trying to remove hard spaces (from
entities in the HTML). I can't remove it with .trim()
or .replace(" ", "")
, etc! I don't get it.
I even found on Stackoverflow to try with \\u00a0
but didn't work neither.
I tried this (since text()
returns actual hard space characters, U+00A0):
System.out.println( "'"+fields.get(6).text().replace("\\u00a0", "")+"'" ); //'94,00 '
System.out.println( "'"+fields.get(6).text().replace(" ", "")+"'" ); //'94,00 '
System.out.println( "'"+fields.get(6).text().trim()+"'"); //'94,00 '
System.out.println( "'"+fields.get(6).html().replace(" ", "")+"'"); //'94,00' works
But I can't figure out why I can't remove the white space with .text()
.
What It Is. jsoup can parse HTML files, input streams, URLs, or even strings. It eases data extraction from HTML by offering Document Object Model (DOM) traversal methods and CSS and jQuery-like selectors. jsoup can manipulate the content: the HTML element itself, its attributes, or its text.
Where. document − document object represents the HTML DOM. Jsoup − main class to parse the given HTML String. html − HTML String. sampleDiv − Element object represent the html node element identified by id "sampleDiv".
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors. jsoup implements the WHATWG HTML specification, and parses HTML to the same DOM as modern browsers do.
Your first attempt was very nearly it, you're quite right that Jsoup maps
to U+00A0. You just don't want the double backslash in your string:
System.out.println( "'"+fields.get(6).text().replace("\u00a0", "")+"'" ); //'94,00'
// Just one ------------------------------------------^
replace
doesn't use regular expressions, so you aren't trying to pass a literal backslash through to the regex level. You just want to specify character U+00A0 in the string.
The question has been edited to reflect the true problem.
New answer;
The hardspace, ie. entity (Unicode character NO-BREAK SPACE U+00A0 ) can in Java be represented by the character \u00a0,
thus code becomes, where str
is the string gotten from the text()
method
str.replaceAll ("\u00a0", "");
Old answer; Using the JSoup library,
import org.jsoup.parser.Parser;
String str1 = Parser.unescapeEntities("last week, Ovokerie Ogbeta", false);
String str2 = Parser.unescapeEntities("Entered » Here", false);
System.out.println(str1 + " " + str2);
Prints out:
last week, Ovokerie Ogbeta Entered » Here
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With