Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove hard spaces with Jsoup?

Tags:

java

jsoup

I'm trying to remove hard spaces (from   entities in the HTML). I can't remove it with .trim() or .replace(" ", ""), etc! I don't get it.

I even found on Stackoverflow to try with \\u00a0 but didn't work neither.

I tried this (since text() returns actual hard space characters, U+00A0):

System.out.println( "'"+fields.get(6).text().replace("\\u00a0", "")+"'" ); //'94,00 '
System.out.println( "'"+fields.get(6).text().replace(" ", "")+"'" ); //'94,00 '
System.out.println( "'"+fields.get(6).text().trim()+"'"); //'94,00 '
System.out.println( "'"+fields.get(6).html().replace(" ", "")+"'"); //'94,00' works

But I can't figure out why I can't remove the white space with .text().

like image 765
Carlos Goce Avatar asked Jan 15 '14 12:01

Carlos Goce


People also ask

What does jsoup parse do?

What It Is. jsoup can parse HTML files, input streams, URLs, or even strings. It eases data extraction from HTML by offering Document Object Model (DOM) traversal methods and CSS and jQuery-like selectors. jsoup can manipulate the content: the HTML element itself, its attributes, or its text.

What is Dom in jsoup?

Where. document − document object represents the HTML DOM. Jsoup − main class to parse the given HTML String. html − HTML String. sampleDiv − Element object represent the html node element identified by id "sampleDiv".

What is jsoup API?

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors. jsoup implements the WHATWG HTML specification, and parses HTML to the same DOM as modern browsers do.


2 Answers

Your first attempt was very nearly it, you're quite right that Jsoup maps   to U+00A0. You just don't want the double backslash in your string:

System.out.println( "'"+fields.get(6).text().replace("\u00a0", "")+"'" ); //'94,00'
// Just one ------------------------------------------^

replace doesn't use regular expressions, so you aren't trying to pass a literal backslash through to the regex level. You just want to specify character U+00A0 in the string.

like image 81
T.J. Crowder Avatar answered Oct 11 '22 15:10

T.J. Crowder


The question has been edited to reflect the true problem.

New answer; The hardspace, ie. entity   (Unicode character NO-BREAK SPACE U+00A0 ) can in Java be represented by the character \u00a0, thus code becomes, where str is the string gotten from the text() method

str.replaceAll ("\u00a0", "");

Old answer; Using the JSoup library,

import org.jsoup.parser.Parser;

String str1 = Parser.unescapeEntities("last week, Ovokerie Ogbeta", false);
String str2 = Parser.unescapeEntities("Entered » Here", false);
System.out.println(str1 + " " + str2);

Prints out:

last week, Ovokerie Ogbeta Entered » Here 
like image 43
Ovokerie Ogbeta Avatar answered Oct 11 '22 16:10

Ovokerie Ogbeta