Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm for re-wrapping hard-wrapped text?

Let's say that I have written a custom e-mail management application for the company that I work for. It reads e-mails from the company's support account and stores cleaned-up, plain text versions of them in a database, doing other neat things like associating it with customer accounts and orders in the process. When an employee replies to a message, my program generates an e-mail that is sent to the customer with a formatted version of the discussion thread. If the customer responds, the app looks for a unique number in the subject line to read the incoming message, strip out the previous discussion, and add it as a new item in the thread. For example:

This is a message from Contoso customer service.

Recently, you requested customer support. Below is a summary of your 
request and our reply.

--------------------------------------------------------------------
Contoso (Fred) on Tuesday, December 30, 2008 at 9:04 a.m.
--------------------------------------------------------------------
John:

I've modified your address. You can confirm my work by logging into
"Your Account" on our Web site. Your order should ship out today.

Thanks for shopping at Contoso.

--------------------------------------------------------------------
You on Tuesday, December 30, 2008 at 8:03 a.m.
--------------------------------------------------------------------
Oops, I entered my address incorrectly. Can you change it to

Fred Smith
123 Main St
Anytown, VA 12345

Thanks!

--
Fred Smith
Contoso Product Lover

Generally, this all works great, but there's one area that I've kind of putting off cleaning up for a while now, and it deals with text wrapping. In order to generate the pretty e-mail format like the one above, I need to re-wrap the text that the customer originally sent.

I've written an algorithm that does this (though looking at the code, I'm not entirely sure how it works anymore--it could use some refactoring). But it can't distinguish between a hard-wrap newline, an "end of paragraph" newline, and a "semantic" newline. For example, a hard-wrap newline is one that the e-mail client inserted within a paragraph to wrap a long line of text, say, at 79 columns. An end of paragraph newline is one that the user added after the last sentence in a paragraph. And a semantic newline would be something like the br tag, such as the address that the Fred typed above.

My algorithm instead only sees two newlines in a row as indicating a new paragraph, so it would make the customer's e-mail be formatted something like the following:

Oops, I entered my address incorrectly. Can you change it to

Fred Smith 123 Main St Anytown, VA 12345

Thanks!

-- Fred Smith Contoso Product Lover

Whenever I try to write a version that would re-wrap this text as intended, I basically hit a wall in that I need to know the semantics of the text, the difference between a "hard-wrap" newline and a "I really meant it like a br"-type newline, such as in the customer's address. (I use two newlines in a row to determine when to start a new paragraph, which coincides with how the majority of people seem to actually type e-mails.)

Anyone have an algorithm that can re-wrap the text as intended? Or is this implementation "good enough" when weighing the complexity of any given solution?

Thanks.

like image 550
Nicholas Piasecki Avatar asked Dec 30 '08 14:12

Nicholas Piasecki


3 Answers

You could try to check if a newline has been inserted to keep the line length below a maximum (aka hard wrap): Just check for the longest line in the text. Then, for any given line, you append the first word of the following line to it. If the resulting line exceeds the maximum length, the line break probably was a hard wrap.

Even simpler you might just consider all breaks in (maxlength - 15) <= length <= maxlength as being hardwraps (with 15 just being an educated guess). This would certainly filter out intentional breaks as in addresses and stuff, and any missed break in this range wouldn't influence the result too badly.

like image 169
Ole Avatar answered Sep 20 '22 23:09

Ole


I have two suggestions, as follows.

  • Pay attention to punctuation: this will help you to distinguish between a "hard-wrap" newline and an "end of paragraph" newline (because, if the line ends with a full stop, then it's more likely that the user intended it to be an end-of-paragraph.

  • Pay attention to whether a line is much shorter than the maximum line length: in the example above, you might have text that's being "hard-wrapped" at 79 characters, plus you have address lines which are only 30 characters long; because 30 is much less than 79, you know that the address lines were broken by the user and not by the user's text-wrap algorithm.

Also, pay attention to indents: lines which are indented with whitespace from the left may be supposed to be new paragraphs, broken from the previous lines, as they are on this forum.

like image 29
ChrisW Avatar answered Sep 19 '22 23:09

ChrisW


Following Ole's advice above, I re-worked my implementation to look at a threshold. It seems to handle most scenarios I throw at it well enough without me having to go nuts and write code that actually understand the English language.

Basically, I first scan through the input string and record the longest line length in the variable inputMaxLineLength. Then as I'm rewrapping, if I encounter a newline that has an index between inputMaxLineLength and 85% of inputMaxLineLength, then I replace that newline with a space because I think it's a hard wrap newline--unless it's immediately followed by another newline, because then I assume that it's just a one-line paragraph that just happens to within that range. This can happen if someone types out a short bulleted list, for example.

Certainly not perfect, but "good enough" for my scenario, considering the text is usually half-mangled by a previous e-mail client to begin with.

Here's some code, my a-few-hours-old implementation that probably still underwraps in a few edge cases (using C#). It's a lot less complicated than my previous solution, which is nice.

Source Code

And here's some unit tests that exercise that code (using MSTest):

Test Code

If anyone has a better implementation (and no doubt a better implementation exists), I'll be happy to read your thoughts! Thanks.

like image 24
Nicholas Piasecki Avatar answered Sep 21 '22 23:09

Nicholas Piasecki