Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Eliminate whitespace around single letters

Tags:

regex

perl

I frequently receive PDFs that contain (when converted with pdftotext) whitespaces between the letters of some arbitrary words:

This i s a n example t e x t that c o n t a i n s strange spaces.

For further automated processing (looking for specific words) I would like to remove all whitespace between "standalone" letters (single-letter words), so the result would look like this:

This isan example text that contains strange spaces.

I tried to achieve this with a simple perl regex:

s/ (\w) (\w) / $1$2 /g

Which of course does not work, as after the first and second standalone letters have been moved together, the second one no longer is a standalone, so the space to the third will not match:

This is a n example te x t that co n ta i ns strange spaces.

So I tried lockahead assertions, but failed to achieve anything (also because I did not find any example that uses them in a substitution).

As usual with PRE, my feeling is, that there must be a very simple and elegant solution for this...

like image 791
Daniel Avatar asked Jan 24 '23 04:01

Daniel


2 Answers

Just match a continuous series of single letters separated by spaces, then delete all spaces from that using a nested substitution (the /e eval modifier).

s{\b ((\w\s)+\w) \b}{ my $s = $1; $s =~ s/ //g; $s }xge;
like image 67
Dave Mitchell Avatar answered Feb 01 '23 04:02

Dave Mitchell


Excess whitespace can be removed with a regex, but Perl by itself cannot know what is correct English. With that caveat, this seems to work:

$ perl -pe's/(?<!\S)(\S) (?=\S )/$1/g' spaces.txt
This isan example text that contains strange spaces.

Note that i s a n cannot be distinguished from a normal 4 letter word, that requires human correction, or some language module.

Explanation:

  • (?<!\S) negative look-behind assertion checks that the character behind is not a non-whitespace.
  • (\S) next must follow a non-whitespace, which we capture with parens, followed by a whitespace, which we will remove (or not put back, as it were).
  • (?=\S ) next we check with a look-ahead assertion that what follows is a non-whitespace followed by a whitespace. We do not change the string there.
  • Then put back the character we captured with $1

It might be more correct to use [^ ] instead of \S. Since you only seem to have a problem with spaces being inserted, there is no need to match tabs, newlines or other whitespace. Feel free to do that change if you feel it is appropriate.

like image 38
TLP Avatar answered Feb 01 '23 04:02

TLP