Repair sentences that have line breaks in the middle of them: Python is
is fun

Question

I am currently extracting text from PDFs using Apache Tika. I am using NLTK to do named entity recognition and other tasks. I am encountering an issue where sentences in the pdf document are extracted with line breaks in the middle of them. For example,

I am a sentence that has a python line break in the middle of it.

The pattern is usually a space followed by a line break, <space> or sometimes <space> <space>. I want to repair these sentences so I can use a sentence tokenizer on them.

I am trying to use the regular expression pattern, (.+?)(?: | )(.+[.!?]+[\s|$]) to replace with .

Issues:

A sentence that starts on the same line after another sentence ends does not match.

How do I match sentences that have line breaks across more than one line? In other words, how do I allow multiple occurences of (?: | )?

text = """
Random Data, Company
2015

This is a sentence that has line 
break in the middle of it due to extracting from a PDF.

How do I support
3 line sentence 
breaks please?

HEADER HERE

The first sentence will 
match. However, this line will not match
for some reason 
that I cannot figure out.

Portfolio: 
http://DoNotMatchMeBecauseIHaveAPeriodInMe.com 

Full Name 
San Francisco, CA  
94000

1500 testing a number as the first word in
a broken sentence.

Match sentences with capital letters on the next line like 
Wi-Fi.

This line has 
trailing spaces after exclamation mark!       
"""
import re
new_text = re.sub(pattern=r'(.+?)(?:
|
)(.+[.!?]+[\s|$])', repl='\g<1>\g<2>', string=text, flags=re.MULTILINE)
print(new_text)

expected_result = """
Random Data, Company
2015

This is a sentence that has line break in the middle of it due to extracting from a PDF.

How do I support 3 line sentence breaks please?

HEADER HERE

The first sentence will match. However, this line will not match for some reason that I cannot figure out.

Portfolio: 
http://DoNotMatchMeBecauseIHaveAPeriodInMe.com 

Full Name 
San Francisco, CA  
94000

1500 testing a number as the first word in a broken sentence.

Match sentences with capital letters on the next line like Wi-Fi.

This line has trailing spaces after exclamation mark!       
"""

Demo at regex101.com

svsd · Accepted Answer

The regex does not match lines that have a space at the end, which was the case with the sentence that was split into 3 lines. As a result, the sentence was not combined into one.

Here's an alternate regex which joins all lines between two empty lines into one, ensuring that there's just one space between the joined lines:

# The new regex
(\S)[ 	]*(?:
|
)[ 	]*(\S)
# The replacement string: \1 \2

Explanation This searches for any non space character \S followed by a new line, then followed by spaces and then by a \S again. It replaces the newline and spaces between the two '\S's with a single space. Space and Tab are given explicitly since \s matches new lines as well. Here's the demo link.

Repair sentences that have line breaks in the middle of them: Python is \n is fun

Tags:

python

regex

nltk

scottwernervt

1 Answers

svsd

Recent Activity

Donate For Us