PDF scraping using R

Tags:

I have been using the XML package successfully for extracting HTML tables but want to extend to PDF's. From previous questions it does not appear that there is a simple R solution but wondered if there had been any recent developments

Failing that, is there some way in Python (in which I am a complete Novice) to obtain and manipulate pdfs so that I could finish the job off with the R XML package

929

asked Oct 27 '11 15:10

pssguy

1 Answers

Extracting text from PDFs is hard, and nearly always requires lots of care.

I'd start with the command line tools such as pdftotext and see what they spit out. The problem is that PDFs can store the text in any order, can use awkward font encodings, and can do things like use ligature characters (the joined up 'ff' and 'ij' that you see in proper typesetting) to throw you.

pdftotext is installable on any Linux system...

102

answered Oct 15 '22 13:10

Spacedman

Related questions
                            
                                How do I check if the python debug option is set from within a script
                            
                                Python hexadecimal comparison
                            
                                Filter array to show rows with a specific value in a specific column
                            
                                Where do I get a list of all known viruses signatures? [closed]
                            
                                First parameter of os.exec*
                            
                                python and overflowing byte?
                            
                                Moving to an arbitrary position in a file in Python
                            
                                Why doesn't sys.stdout.write('\b') backspace against newlines?
                            
                                Python SocketServer: sending to multiple clients?
                            
                                Is there a way to run a python script that is inside a zip file from bash?
                            
                                App Engine: Is time.sleep() counting towards my quotas?
                            
                                Installing M2Crypto on CentOS
                            
                                Python Socket Flush
                            
                                Python script - connect to SSH and run command
                            
                                Python: How to do a system wide search for a file when just the filename (not path) is available
                            
                                use imaplib and oauth for connection with Gmail
                            
                                Chi square numpy.polyfit (numpy)
                            
                                Advice on writing to a log file with python
                            
                                SSH to machine through a middle host
                            
                                Convert a date string into YYYYMMDD

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

PDF scraping using R

Tags:

python

r

pdf

screen-scraping

pssguy

People also ask

1 Answers

Spacedman

Recent Activity

Donate For Us