Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Clean text coming from PDFs

this is more of an algorithmic question rather than a specific language question, so I am happy to receive an answer in any language - even pseudocode, even just an idea.

Here is my problem: I need to work on large dataset of papers that come from articles in PDF and that were brutally copied/pasted into .txt. I only have the result of this abomination, which is around 16k papers, for 3.5 GB or text (the corpus I am using is the ACL Antology Network, http://clair.si.umich.edu/clair/aan/DatasetContents.html ).

The "junk" comes from things like formulae, images, tables, and so on. It just pops in the middle of the running text, so I can't use regular expressions to clean it, and I can't think of any way to use machine learning for it either. I already spent a week on it, and then I decided to move on with a quick&dirty fix. I don't care about cleaning it completely anymore, I don't care about false negatives and positives as long as the majority of this areas of text is removed.

Some examples of the text: note that formulae contain junk characters, but tables and caption don't (but they still make my sentence very long, and thus unparsable). Junk in bold.

Easy one:

The experiments were repeated while inhibiting specialization of first the scheme with the most expansions, and then the two most expanded schemata. Measures of coverage and speedup are important 1 As long as we are interested in preserving the f-structure assigned to sentences, this notion of coverage is stricter than necessary. The same f-structure can in fact be assigned by more than one parse, so that in some cases a sentence is considered out of coverage even if the specialized grammar assigns to it the correct f-structure. 2'VPv' and 'VPverb[main]' cover VPs headed by a main verb. 'NPadj' covers NPs with adjectives attached. 205 The original rule: l/Pperfp --+ ADVP* SE (t ADJUNCT) ($ ADV_TYPE) = t,padv ~/r { @M_Head_Perfp I@M_Head_Passp } @( Anaph_Ctrl $) { AD VP+ SE ('~ ADJUNCT) ($ ADV_TYPE) = vpadv is replaced by the following: ADVP,[.E (~ ADJUNCT) (.l. ADV_TYPE) = vpadv l/'Pperfp --+ @PPadjunct @PPcase_obl {@M.Head_Pevfp [@M..Head_Passp} @( Anaph_Ctrl ~ ) V { @M_Head_Perfp I@M_Head_Passp } @( Anaph_Ctrl ~) Figure 1: The pruning of a rule from the actual French grammar. The "*" and the "+" signs have the usual interpretation as in regular expressions. A sub-expression enclosed in parenthesis is optional. Alternative sub-expressions are enclosed in curly brackets and separated by the "[" sign. An "@" followed by an identifier is a macro expansion operator, and is eventually replaced by further functional descriptions. Corpus --.. ,, 0.1[ Disambiguated Treebank treebank Human expert Grammar specialization Specialized grammar Figure 2: The setting for our experiments on grammar specialization. indicators of what can be achieved with this form of grammar pruning. However, they could potentially be misleading, since failure times for uncovered sentences might be considerably lower than their sentences times, had they not been out of coverage.

Hard one:

Table 4 summarizes the precision results for both English and Romanian coreference. The results indicate that the English coreference is more indicate than the Romanian coreference, but SNIZZLE improves coreference resolution in both languages. There were 64% cases when the English coreference was resolved by a heuristic with higher priority than the corresponding heuristic for the Romanian counterpart. This result explains why there is better precision enhancement for English Romanian SWIZZLE on English SWIZZLE on Romanian Nominal Pronominal 73% 89% 66% 78% 76% 93% 71°/o 82% Table 4: Coreference precision Total 84% 72% 87% 76% English Romanian SWIZZLE on English SWIZZLE on Romanian Nominal 69% 63% 66% 61% Pronominal Total 89% 78% 83% 72% 87% 77% 80% 70% Table 5: Coreference recall the English coreference. Table 5 also illustrates the recall results. The advantage of the data-driven coreference resolution over other methods is based on its better recall performance. This is explained by the fact that this method captures a larger variety of coreference patterns. Even though other coreference resolution systems perform better for some specific forms of systems, their recall results are surpassed by the systems approach. Multilingual coreference in turn improves more the precision than the recall of the monolingual data-driven coreference systems. In addition, Table 5 shows that the English coref- erence results in better recall than Romanian coref- erence. However, the recall shows a decrease for both languages for SNIZZLE because imprecise coreference links are deleted. As is usually the case, deleting data lowers the recall. All results were obtained by using the automatic scorer program developed for the MUC evaluations.

Note how the table does not contain strange characters and goes right in the middle of the sentence: "This result explains why there is better precision enhancement for -TABLE HERE- the English coreference." I can't know where the table will be in regard to the running text. It may occur before a sentence, after it or within it like in this case. Also note that the table shit does not end with a full stop (most captions in papers don't...) so I can't rely on punctuation to spot it. I am happy with non-accurate boundaries of course, but I still need to do something with these tables. Some of them contain words rather than numbers, and I don't have enough information in those cases: no junky characters, nothing. It is obvious to only humans :S

like image 585
Tex Avatar asked May 02 '12 14:05

Tex


People also ask

How do I clean up text on a PDF?

Choose Tools > Redact. On the Edit menu, choose Redact Text & Images. Select the text or image in a PDF, right-click, and select Redact. Select the text or image in a PDF, choose Redact in the floating context-menu.

How do I remove shadow text from a PDF?

1 Correct answer While the PDF is open, go to EDIT, select PREFERENCES, choose FORMS, uncheck the highlight color box.


1 Answers

(I hate crappy copy&pastes. )

Few ideas that you might find helpful (I used each and every one of them myself in that point or another)

  1. (Very brute force) : Using a tokenizer and a dictionary (real dictionary, not the data structure) - parse the words out and any word which is not a dictionary word - remove it. It might prove problematic if your text contains a lot of company/products names - but this too can be solved using the correct indexes (there are a few on the web - I'm using some propriety ones so I can't share them, sorry)

  2. Given a set of clean documents (lets say a 2K), build an tf/idf index of them, and use this as a dictionary - every term from the other documents that doesn't appear in the index (or appears with a very low tf/idf) - remove it. This should give you a rather clean document.

  3. Use Amazon's mechanical turk mechanism : set up a task where the person reading the document needs to mark the paragraph that doesn't make sense. Should be rather easy for the mechanical turk platform (16.5K is not that much) - this will probably cost you a couple of hundred $ , but you'll probably get a rather nice cleanup of the text (So if it's on corporate money, that can be your way out - they need to pay for their mistakes :) ).

  4. Considering your documents are from the same domain (same topics, all in all), and the problems are quite the same (same table headers, roughly same formulas): Break all the documents to sentences, and try clustering the sentences using ML. If the table headers / formulas are relatively similar, they should cluster nicely away from the rest of the sentences, and then you can clean the documents sentence-by-sentence (Get a document, break it to sentences, for each sentence, if it's part of the "weird" cluster, remove it)

like image 71
Yossale Avatar answered Oct 01 '22 02:10

Yossale