I am currently extracting text from PDFs using Apache Tika. I am using NLTK to do named entity recognition and other tasks. I am encountering an issue where sentences in the pdf document are extracted with line breaks in the middle of them. For example,
I am a sentence that has a python line \nbreak in the middle of it.
The pattern is usually a space followed by a line break, <space>\n or sometimes <space>\n<space>. I want to repair these sentences so I can use a sentence tokenizer on them.
I am trying to use the regular expression pattern, (.+?)(?:\r\n|\n)(.+[.!?]+[\s|$]) to replace \n with .
Issues:
How do I match sentences that have line breaks across more than one line? In other words, how do I allow multiple occurences of (?:\r\n|\n)?
text = """
Random Data, Company
2015
This is a sentence that has line
break in the middle of it due to extracting from a PDF.
How do I support
3 line sentence
breaks please?
HEADER HERE
The first sentence will
match. However, this line will not match
for some reason
that I cannot figure out.
Portfolio:
http://DoNotMatchMeBecauseIHaveAPeriodInMe.com
Full Name
San Francisco, CA
94000
1500 testing a number as the first word in
a broken sentence.
Match sentences with capital letters on the next line like
Wi-Fi.
This line has
trailing spaces after exclamation mark!
"""
import re
new_text = re.sub(pattern=r'(.+?)(?:\r\n|\n)(.+[.!?]+[\s|$])', repl='\g<1>\g<2>', string=text, flags=re.MULTILINE)
print(new_text)
expected_result = """
Random Data, Company
2015
This is a sentence that has line break in the middle of it due to extracting from a PDF.
How do I support 3 line sentence breaks please?
HEADER HERE
The first sentence will match. However, this line will not match for some reason that I cannot figure out.
Portfolio:
http://DoNotMatchMeBecauseIHaveAPeriodInMe.com
Full Name
San Francisco, CA
94000
1500 testing a number as the first word in a broken sentence.
Match sentences with capital letters on the next line like Wi-Fi.
This line has trailing spaces after exclamation mark!
"""
Demo at regex101.com
The regex does not match lines that have a space at the end, which was the case with the sentence that was split into 3 lines. As a result, the sentence was not combined into one.
Here's an alternate regex which joins all lines between two empty lines into one, ensuring that there's just one space between the joined lines:
# The new regex
(\S)[ \t]*(?:\r\n|\n)[ \t]*(\S)
# The replacement string: \1 \2
Explanation This searches for any non space character \S followed by a new line, then followed by spaces and then by a \S again. It replaces the newline and spaces between the two '\S's with a single space. Space and Tab are given explicitly since \s matches new lines as well. Here's the demo link.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With