How can i read a PDF file from inline raw_bytes (not from file)?

Tags:

I am trying to create a pdf puller from the Australian Stock Exchange website which will allow me to search through all the 'Announcements' made by companies and search for key words in the pdfs of those announcements.

So far I am using requests and PyPDF2 to get the PDF file, write it to my drive and then read it. However, I want to be able to skip the step of writing the PDF file to my drive and reading it, and going straight from getting the PDF file to converting it to a string. What I have so far is:

import requests, PyPDF2

url = 'http://www.asx.com.au/asxpdf/20171108/pdf/43p1l61zf2yct8.pdf'
response = requests.get(url)
my_raw_data = response.content

with open("my_pdf.pdf", 'wb') as my_data:
    my_data.write(my_raw_data)


open_pdf_file = open("my_pdf.pdf", 'rb')
read_pdf = PyPDF2.PdfFileReader(open_pdf_file)
num_pages = read_pdf.getNumPages()

ann_text = []
for page_num in range(num_pages):
    if read_pdf.isEncrypted:
        read_pdf.decrypt("")
        print(read_pdf.getPage(page_num).extractText())
        page_text = read_pdf.getPage(page_num).extractText().split()
        ann_text.append(page_text)

    else:
        print(read_pdf.getPage(page_num).extractText())
print(ann_text)

This prints a list of strings in the PDF file from the url provided.

Just wondering if i can convert the my_raw_data variable to a readable string?

Thanks so much in advance!

411

asked Nov 08 '17 10:11

James Ward

1 Answers

you can use io

import requests, PyPDF2, io

url = 'http://www.asx.com.au/asxpdf/20171108/pdf/43p1l61zf2yct8.pdf'
response = requests.get(url)

with io.BytesIO(response.content) as open_pdf_file:
    read_pdf = PyPDF2.PdfFileReader(open_pdf_file)
    num_pages = read_pdf.getNumPages()
    print(num_pages)

PS. To open files, always use a context manager (with-statement)

102

answered Oct 24 '22 01:10

Maarten Fabré

Related questions
                            
                                Download progressbar for Python 3
                            
                                Python code works, but eclipse shows error - Syntax error while detecting tuple
                            
                                Possible to add newline to .format() method?
                            
                                How to use a <ComboboxSelected> virtual event with tkinter
                            
                                How can I use tensorboard with tf.estimator.Estimator
                            
                                brew install doesn't link python3
                            
                                Cannot import pywinauto on Windows 10
                            
                                Python3 sleep() problem
                            
                                Can a from __future__ import ... guarantee Python 2 and 3 compatibility?
                            
                                Nested List Indices [duplicate]
                            
                                Python 3.2 input date function
                            
                                sqlite3, IntegrityError: UNIQUE constraint failed when inserting a value
                            
                                How to set coordinates when cropping an image with PIL?
                            
                                python pandas conditional cumulative sum
                            
                                How to print multiple non-consecutive values from a list with Python 3.5.1
                            
                                Skip specific set of columns when reading excel frame - pandas
                            
                                How to randomly split a DataFrame into several smaller DataFrames?
                            
                                Improve current implementation of a setInterval
                            
                                Error importing cv2 in python3, Anaconda
                            
                                Unable to get a sha256 hash of a string [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can i read a PDF file from inline raw_bytes (not from file)?

Tags:

python-3.x

pdf

python-requests

James Ward

People also ask

1 Answers

Maarten Fabré

Recent Activity

Donate For Us