Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Discovering "templates" in a given text?

If I have significant amounts of text and am trying to discover templates that occur most frequently, I was thinking of solving it using the N-Gram approach and in fact it was suggested as a solution in this question as well but my requirement is slightly different. Just to clarify, I have some text like this:

I wake up every day morning and read the newspaper and then go to work
I wake up every day morning and eat my breakfast and then go to work
I am not sure that this is the solution but I will try
I am not sure that this is the answer but I will try
I am not feeling well today but I will get the work done and deliver it tomorrow
I was not feeling well yesterday but I will get the work done and let you know by tomorrow

and am trying to extract "templates" like this:

I wake up every day morning and ... and then go to work
I am not sure that this is the ... but I will try
I ... not feeling well ... but I will get the work done and ... tomorrow

I am looking for an approach that can scale to million of lines of text so I was just wondering if I can adapt the same N-gram approach to solve this problem or are there any alternatives?

like image 327
Legend Avatar asked Jun 29 '11 21:06

Legend


People also ask

How do I find templates in Word?

To find and apply a template in Word, do the following: On the File tab, click New. Under Available Templates, do one of the following: To use one of the built-in templates, click Sample Templates, click the template that you want, and then click Create.

What is a template in writing?

A writing template is a guide that a writer follows while writing an article, a book, a letter, essay etc. A template aims to help the writer follow a specific structure and write faster. Writing templates are important because they can save you lots of time if you're a beginner.

How do I find a document template?

If you want to find out which template is attached to a document, you can do so by displaying the Developer tab of the ribbon and then clicking on the Document Template tool. Word displays the Templates and Add-ins dialog box.


1 Answers

Millions of lines of text isn't a really big number :)

What you're looking for is at least similar to collocation finding. You could try to compute pointwise mutual information on n-grams. See Manning & Schütze (1999) for this and other approaches to the problem.

like image 101
Fred Foo Avatar answered Sep 25 '22 15:09

Fred Foo