How to extract text from a PDF file in Python?

Tags:

How can I extract text from a PDF file in Python?

I tried the following:

import sys
import pyPdf

def convertPdf2String(path):
      content = ""
      pdf = pyPdf.PdfFileReader(file(path, "rb"))
      for i in range(0, pdf.getNumPages()):
          content += pdf.getPage(i).extractText() + " \n"
          content = " ".join(content.replace(u"\xa0", u" ").strip().split())
      return content

f = open('a.txt','w+')

f.write(convertPdf2String(sys.argv[1]).encode("ascii","xmlcharrefreplace"))
f.close()

But the result is as follows, rather than readable text:

728;ˇˆ˜ ˚ˇˇ!""˘ˇˆ˙ˆ˝˛˛˛˛ˆ˜ˆ ˆ ˆ˘ˆ˛˙ˆ"ˆ˘"ˆˆˆ˜#$˙ˆ˚ˆ %&ˆ ˘˛ˆ˜'˙˙%˝˛ˆˇ˙ ˜ˆˆ˜'ˆ ˇˆ#$%&('%$&))$$+%#,-.+&&˝())˝)˝+,,-./012)(˝)*˝+,-3˙ˆ/0245)6#57+82,55)6#57+,+2,+ /!#!!&˘˘1"%˘20˛˛3ˆ07%4!˘"6 ˛ˆ ˝ˆ ˆ˘&/&4"9ˆ %6ˇ%4%4&5˘2)˘˘˛%:6(

590

asked Mar 23 '13 04:03

lost

1 Answers

if you are running linux or mac you can use ps2ascii command in your code:

import os

input="someFile.pdf"
output="out.txt"
os.system(("ps2ascii %s %s") %( input , output))

answered Sep 21 '22 10:09

Moj

Related questions
                            
                                Java equivalent of c++ equal_range (or lower_bound & upper_bound)
                            
                                How to make JSLint scan the whole file?
                            
                                eval and quote in data.table
                            
                                How-to inject the Entity Framework DbContext into the ConfigurationBasedRepository of SharpRepository
                            
                                Google Maps API v3 SVG markers disappear
                            
                                All projects referencing sub-project must install NuGet package Microsoft.Bcl.Build (C#/Windows Phone 7)?
                            
                                Fast Way to slice image into overlapping patches and merge patches to image
                            
                                How to TextWrap a TextBlock within a width Auto Column?
                            
                                Where is the documentation for Fragment.onCreateAnimator()? [closed]
                            
                                How does Angularjs handle Memory Management with ngView?
                            
                                Ruby Source code Auto-Formatter [closed]
                            
                                Handling multiple parallel HTTP requests in Node.js

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to extract text from a PDF file in Python?

Tags:

728;ˇˆ˜ ˚ˇˇ!""˘ˇˆ˙ˆ˝˛˛˛˛ˆ˜ˆ ˆ ˆ˘ˆ˛˙ˆ"ˆ˘"ˆˆˆ˜#$˙ˆ˚ˆ %&ˆ ˘˛ˆ˜'˙˙%˝˛ˆˇ˙ ˜ˆˆ˜'ˆ ˇˆ#$%&('%$&))$$+%#,-.+&&˝())˝)˝+,,-./012)(˝)*˝+,-3˙ˆ/0245)6#57+82,55)6#57+,+2,+ /!#!!&˘˘1"%˘20˛˛3ˆ07%4!˘"6 ˛ˆ ˝ˆ ˆ˘&/&4"9ˆ %6ˇ%4%4&5˘2)˘˘˛%:6(

lost

People also ask

1 Answers

Moj

Recent Activity

Donate For Us