Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Repair sentences that have line breaks in the middle of them: Python is \n is fun

Tags:

python

regex

nltk

I am currently extracting text from PDFs using Apache Tika. I am using NLTK to do named entity recognition and other tasks. I am encountering an issue where sentences in the pdf document are extracted with line breaks in the middle of them. For example,

I am a sentence that has a python line \nbreak in the middle of it.

The pattern is usually a space followed by a line break, <space>\n or sometimes <space>\n<space>. I want to repair these sentences so I can use a sentence tokenizer on them.

I am trying to use the regular expression pattern, (.+?)(?:\r\n|\n)(.+[.!?]+[\s|$]) to replace \n with .

Issues:

  1. A sentence that starts on the same line after another sentence ends does not match.
  2. How do I match sentences that have line breaks across more than one line? In other words, how do I allow multiple occurences of (?:\r\n|\n)?

    text = """
    Random Data, Company
    2015
    
    This is a sentence that has line 
    break in the middle of it due to extracting from a PDF.
    
    How do I support
    3 line sentence 
    breaks please?
    
    HEADER HERE
    
    The first sentence will 
    match. However, this line will not match
    for some reason 
    that I cannot figure out.
    
    Portfolio: 
    http://DoNotMatchMeBecauseIHaveAPeriodInMe.com 
    
    Full Name 
    San Francisco, CA  
    94000
    
    1500 testing a number as the first word in
    a broken sentence.
    
    Match sentences with capital letters on the next line like 
    Wi-Fi.
    
    This line has 
    trailing spaces after exclamation mark!       
    """
    import re
    new_text = re.sub(pattern=r'(.+?)(?:\r\n|\n)(.+[.!?]+[\s|$])', repl='\g<1>\g<2>', string=text, flags=re.MULTILINE)
    print(new_text)
    
    expected_result = """
    Random Data, Company
    2015
    
    This is a sentence that has line break in the middle of it due to extracting from a PDF.
    
    How do I support 3 line sentence breaks please?
    
    HEADER HERE
    
    The first sentence will match. However, this line will not match for some reason that I cannot figure out.
    
    Portfolio: 
    http://DoNotMatchMeBecauseIHaveAPeriodInMe.com 
    
    Full Name 
    San Francisco, CA  
    94000
    
    1500 testing a number as the first word in a broken sentence.
    
    Match sentences with capital letters on the next line like Wi-Fi.
    
    This line has trailing spaces after exclamation mark!       
    """
    

Demo at regex101.com

like image 416
scottwernervt Avatar asked Dec 03 '25 02:12

scottwernervt


1 Answers

The regex does not match lines that have a space at the end, which was the case with the sentence that was split into 3 lines. As a result, the sentence was not combined into one.

Here's an alternate regex which joins all lines between two empty lines into one, ensuring that there's just one space between the joined lines:

# The new regex
(\S)[ \t]*(?:\r\n|\n)[ \t]*(\S)
# The replacement string: \1 \2

Explanation This searches for any non space character \S followed by a new line, then followed by spaces and then by a \S again. It replaces the newline and spaces between the two '\S's with a single space. Space and Tab are given explicitly since \s matches new lines as well. Here's the demo link.

like image 194
svsd Avatar answered Dec 05 '25 15:12

svsd



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!