this is more of an algorithmic question rather than a specific language question, so I am happy to receive an answer in any language - even pseudocode, even just an idea. Here is my problem: I need to work on large dataset of papers that come from articles in PDF and that were brutally copied/pasted into .txt. I only have the result of this abomination, which is around 16k papers, for 3.5 GB or text (the corpus I am using is the ACL Antology Network, http://clair.si.umich.edu/clair/aan/DatasetContents.html ). The "junk" comes from things like formulae, images, tables, and so on. It just pops in the middle of the running text, so I can't use regular expressions to clean it, and I can't think of any way to use machine learning for it either. I already spent a week on it, and then I decided to move on with a quick&dirty fix. I don't care about cleaning it completely anymore, I don't care about false negatives and positives as long as the majority of this areas of text is removed. Some examples of the text: note that formulae contain junk characters, but tables and caption don't (but they still make my sentence very long, and thus unparsable). Junk in bold. Easy one: <blockquote> The experiments were repeated while inhibiting specialization of first the scheme with the most expansions, and then the two most expanded schemata. Measures of coverage and speedup are important 1 As long as we are interested in preserving the f-structure assigned to sentences, this notion of coverage is stricter than necessary. The same f-structure can in fact be assigned by more than one parse, so that in some cases a sentence is considered out of coverage even if the specialized grammar assigns to it the correct f-structure. 2'VPv' and 'VPverb[main]' cover VPs headed by a main verb. 'NPadj' covers NPs with adjectives attached. 205 The original rule: l/Pperfp --+ ADVP* SE (t ADJUNCT) ($ ADV_TYPE) = t,padv ~/r { @M_Head_Perfp I@M_Head_Passp } @( Anaph_Ctrl $) { AD VP+ SE ('~ ADJUNCT) ($ ADV_TYPE) = vpadv is replaced by the following: ADVP,[.E (~ ADJUNCT) (.l. ADV_TYPE) = vpadv l/'Pperfp --+ @PPadjunct @PPcase_obl {@M.Head_Pevfp [@M..Head_Passp} @( Anaph_Ctrl ~ ) V { @M_Head_Perfp I@M_Head_Passp } @( Anaph_Ctrl ~) Figure 1: The pruning of a rule from the actual French grammar. The "*" and the "+" signs have the usual interpretation as in regular expressions. A sub-expression enclosed in parenthesis is optional. Alternative sub-expressions are enclosed in curly brackets and separated by the "[" sign. An "@" followed by an identifier is a macro expansion operator, and is eventually replaced by further functional descriptions. Corpus --.. ,, 0.1[ Disambiguated Treebank treebank Human expert Grammar specialization Specialized grammar Figure 2: The setting for our experiments on grammar specialization. indicators of what can be achieved with this form of grammar pruning. However, they could potentially be misleading, since failure times for uncovered sentences might be considerably lower than their sentences times, had they not been out of coverage. </blockquote> Hard one: <blockquote> Table 4 summarizes the precision results for both English and Romanian coreference. The results indicate that the English coreference is more indicate than the Romanian coreference, but SNIZZLE improves coreference resolution in both languages. There were 64% cases when the English coreference was resolved by a heuristic with higher priority than the corresponding heuristic for the Romanian counterpart. This result explains why there is better precision enhancement for English Romanian SWIZZLE on English SWIZZLE on Romanian Nominal Pronominal 73% 89% 66% 78% 76% 93% 71°/o 82% Table 4: Coreference precision Total 84% 72% 87% 76% English Romanian SWIZZLE on English SWIZZLE on Romanian Nominal 69% 63% 66% 61% Pronominal Total 89% 78% 83% 72% 87% 77% 80% 70% Table 5: Coreference recall the English coreference. Table 5 also illustrates the recall results. The advantage of the data-driven coreference resolution over other methods is based on its better recall performance. This is explained by the fact that this method captures a larger variety of coreference patterns. Even though other coreference resolution systems perform better for some specific forms of systems, their recall results are surpassed by the systems approach. Multilingual coreference in turn improves more the precision than the recall of the monolingual data-driven coreference systems. In addition, Table 5 shows that the English coref- erence results in better recall than Romanian coref- erence. However, the recall shows a decrease for both languages for SNIZZLE because imprecise coreference links are deleted. As is usually the case, deleting data lowers the recall. All results were obtained by using the automatic scorer program developed for the MUC evaluations. </blockquote> Note how the table does not contain strange characters and goes right in the middle of the sentence: "This result explains why there is better precision enhancement for -TABLE HERE- the English coreference." I can't know where the table will be in regard to the running text. It may occur before a sentence, after it or within it like in this case. Also note that the table shit does not end with a full stop (most captions in papers don't...) so I can't rely on punctuation to spot it. I am happy with non-accurate boundaries of course, but I still need to do something with these tables. Some of them contain words rather than numbers, and I don't have enough information in those cases: no junky characters, nothing. It is obvious to only humans :S

(I hate crappy copy&pastes. ) Few ideas that you might find helpful (I used each and every one of them myself in that point or another) <ol> <li>(Very brute force) : Using a tokenizer and a dictionary (real dictionary, not the data structure) - parse the words out and any word which is not a dictionary word - remove it. It might prove problematic if your text contains a lot of company/products names - but this too can be solved using the correct indexes (there are a few on the web - I'm using some propriety ones so I can't share them, sorry) </li> <li>Given a set of clean documents (lets say a 2K), build an tf/idf index of them, and use this as a dictionary - every term from the other documents that doesn't appear in the index (or appears with a very low tf/idf) - remove it. This should give you a rather clean document.</li> <li>Use Amazon's mechanical turk mechanism : set up a task where the person reading the document needs to mark the paragraph that doesn't make sense. Should be rather easy for the mechanical turk platform (16.5K is not that much) - this will probably cost you a couple of hundred $ , but you'll probably get a rather nice cleanup of the text (So if it's on corporate money, that can be your way out - they need to pay for their mistakes :) ).</li> <li>Considering your documents are from the same domain (same topics, all in all), and the problems are quite the same (same table headers, roughly same formulas): Break all the documents to sentences, and try clustering the sentences using ML. If the table headers / formulas are relatively similar, they should cluster nicely away from the rest of the sentences, and then you can clean the documents sentence-by-sentence (Get a document, break it to sentences, for each sentence, if it's part of the "weird" cluster, remove it)</li> </ol>

Clean text coming from PDFs

Tags:

language-agnostic

nlp

stanford-nlp

this is more of an algorithmic question rather than a specific language question, so I am happy to receive an answer in any language - even pseudocode, even just an idea.

Here is my problem: I need to work on large dataset of papers that come from articles in PDF and that were brutally copied/pasted into .txt. I only have the result of this abomination, which is around 16k papers, for 3.5 GB or text (the corpus I am using is the ACL Antology Network, http://clair.si.umich.edu/clair/aan/DatasetContents.html ).

The "junk" comes from things like formulae, images, tables, and so on. It just pops in the middle of the running text, so I can't use regular expressions to clean it, and I can't think of any way to use machine learning for it either. I already spent a week on it, and then I decided to move on with a quick&dirty fix. I don't care about cleaning it completely anymore, I don't care about false negatives and positives as long as the majority of this areas of text is removed.

Some examples of the text: note that formulae contain junk characters, but tables and caption don't (but they still make my sentence very long, and thus unparsable). Junk in bold.

Easy one:

The experiments were repeated while inhibiting specialization of first the scheme with the most expansions, and then the two most expanded schemata. Measures of coverage and speedup are important 1 As long as we are interested in preserving the f-structure assigned to sentences, this notion of coverage is stricter than necessary. The same f-structure can in fact be assigned by more than one parse, so that in some cases a sentence is considered out of coverage even if the specialized grammar assigns to it the correct f-structure. 2'VPv' and 'VPverb[main]' cover VPs headed by a main verb. 'NPadj' covers NPs with adjectives attached. 205 The original rule: l/Pperfp --+ ADVP* SE (t ADJUNCT) ($ ADV_TYPE) = t,padv ~/r { @M_Head_Perfp I@M_Head_Passp } @( Anaph_Ctrl $) { AD VP+ SE ('~ ADJUNCT) ($ ADV_TYPE) = vpadv is replaced by the following: ADVP,[.E (~ ADJUNCT) (.l. ADV_TYPE) = vpadv l/'Pperfp --+ @PPadjunct @PPcase_obl {@M.Head_Pevfp [@M..Head_Passp} @( Anaph_Ctrl ~ ) V { @M_Head_Perfp I@M_Head_Passp } @( Anaph_Ctrl ~) Figure 1: The pruning of a rule from the actual French grammar. The "*" and the "+" signs have the usual interpretation as in regular expressions. A sub-expression enclosed in parenthesis is optional. Alternative sub-expressions are enclosed in curly brackets and separated by the "[" sign. An "@" followed by an identifier is a macro expansion operator, and is eventually replaced by further functional descriptions. Corpus --.. ,, 0.1[ Disambiguated Treebank treebank Human expert Grammar specialization Specialized grammar Figure 2: The setting for our experiments on grammar specialization. indicators of what can be achieved with this form of grammar pruning. However, they could potentially be misleading, since failure times for uncovered sentences might be considerably lower than their sentences times, had they not been out of coverage.

Hard one:

Table 4 summarizes the precision results for both English and Romanian coreference. The results indicate that the English coreference is more indicate than the Romanian coreference, but SNIZZLE improves coreference resolution in both languages. There were 64% cases when the English coreference was resolved by a heuristic with higher priority than the corresponding heuristic for the Romanian counterpart. This result explains why there is better precision enhancement for English Romanian SWIZZLE on English SWIZZLE on Romanian Nominal Pronominal 73% 89% 66% 78% 76% 93% 71°/o 82% Table 4: Coreference precision Total 84% 72% 87% 76% English Romanian SWIZZLE on English SWIZZLE on Romanian Nominal 69% 63% 66% 61% Pronominal Total 89% 78% 83% 72% 87% 77% 80% 70% Table 5: Coreference recall the English coreference. Table 5 also illustrates the recall results. The advantage of the data-driven coreference resolution over other methods is based on its better recall performance. This is explained by the fact that this method captures a larger variety of coreference patterns. Even though other coreference resolution systems perform better for some specific forms of systems, their recall results are surpassed by the systems approach. Multilingual coreference in turn improves more the precision than the recall of the monolingual data-driven coreference systems. In addition, Table 5 shows that the English coref- erence results in better recall than Romanian coref- erence. However, the recall shows a decrease for both languages for SNIZZLE because imprecise coreference links are deleted. As is usually the case, deleting data lowers the recall. All results were obtained by using the automatic scorer program developed for the MUC evaluations.

Note how the table does not contain strange characters and goes right in the middle of the sentence: "This result explains why there is better precision enhancement for -TABLE HERE- the English coreference." I can't know where the table will be in regard to the running text. It may occur before a sentence, after it or within it like in this case. Also note that the table shit does not end with a full stop (most captions in papers don't...) so I can't rely on punctuation to spot it. I am happy with non-accurate boundaries of course, but I still need to do something with these tables. Some of them contain words rather than numbers, and I don't have enough information in those cases: no junky characters, nothing. It is obvious to only humans :S

585

asked May 02 '12 14:05

Tex

1 Answers

(I hate crappy copy&pastes. )

Few ideas that you might find helpful (I used each and every one of them myself in that point or another)

(Very brute force) : Using a tokenizer and a dictionary (real dictionary, not the data structure) - parse the words out and any word which is not a dictionary word - remove it. It might prove problematic if your text contains a lot of company/products names - but this too can be solved using the correct indexes (there are a few on the web - I'm using some propriety ones so I can't share them, sorry)
Given a set of clean documents (lets say a 2K), build an tf/idf index of them, and use this as a dictionary - every term from the other documents that doesn't appear in the index (or appears with a very low tf/idf) - remove it. This should give you a rather clean document.
Use Amazon's mechanical turk mechanism : set up a task where the person reading the document needs to mark the paragraph that doesn't make sense. Should be rather easy for the mechanical turk platform (16.5K is not that much) - this will probably cost you a couple of hundred $ , but you'll probably get a rather nice cleanup of the text (So if it's on corporate money, that can be your way out - they need to pay for their mistakes :) ).
Considering your documents are from the same domain (same topics, all in all), and the problems are quite the same (same table headers, roughly same formulas): Break all the documents to sentences, and try clustering the sentences using ML. If the table headers / formulas are relatively similar, they should cluster nicely away from the rest of the sentences, and then you can clean the documents sentence-by-sentence (Get a document, break it to sentences, for each sentence, if it's part of the "weird" cluster, remove it)

answered Oct 01 '22 02:10

Yossale

Related questions
                            
                                What is a typical algorithm for finding a string within a string?
                            
                                Is there a fully online IDE for testing out simple algorithms [closed]
                            
                                Is there a programming language with full and correct Unicode support?
                            
                                When is performance gain significant enough to implement that optimization?
                            
                                Is there an alternative to hyper-indented code?
                            
                                How is "Is offline for maintenance" page implemented?
                            
                                Programmatically detecting "most important content" on a page
                            
                                what value does null really have?
                            
                                Finite State Machine : Bad design?
                            
                                In 0-based indexing system, do people call the element at index 0 the "first" or the "zeroth" element?
                            
                                Use case for try-catch-finally with both catch and finally
                            
                                What should coding guidelines do, and are there any good examples of guidelines? [closed]
                            
                                worth to learn groovy? [closed]
                            
                                undo/redo with cascading deletions
                            
                                Algorithms for compression of set tries
                            
                                How to detect if the given graph has a cycle containing all of its nodes? Does the suggested algorithm have any flaws?
                            
                                Longest Simple Path
                            
                                How To Mutex Across a Network?
                            
                                Is there an algorithm to decide if a * b fits within the possible values of an integer? (without casting to a wider type)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With