Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python how to read a latex generated pdf with equations

Consider the following article

https://arxiv.org/pdf/2101.05907.pdf

It's a typically formatted academic paper with only two pictures in pdf file.

The following code was used to extract the text and equation from the paper

#Related code explanation: https://stackoverflow.com/questions/45470964/python-extracting-text-from-webpage-pdf
import io
import requests
r = requests.get(url)
f = io.BytesIO(r.content)

#Related code explanation: https://stackoverflow.com/questions/45795089/how-can-i-read-pdf-in-python
import PyPDF2
fileReader = PyPDF2.PdfFileReader(f)

#Related code explanation: https://automatetheboringstuff.com/chapter13/
print(fileReader.getPage(0).extractText())

However, the result was not quite correct

Bohmpotentialforthetimedependentharmonicoscillator
FranciscoSoto-Eguibar
1
,FelipeA.Asenjo
2
,SergioA.Hojman
3
andH
´
ectorM.
Moya-Cessa
1
1
InstitutoNacionaldeAstrof´
´
OpticayElectr´onica,CalleLuisEnriqueErroNo.1,SantaMar´Tonanzintla,
Puebla,72840,Mexico.
2
FacultaddeIngenier´yCiencias,UniversidadAdolfoIb´aŸnez,Santiago7491169,Chile.
3
DepartamentodeCiencias,FacultaddeArtesLiberales,UniversidadAdolfoIb´aŸnez,Santiago7491169,Chile.
DepartamentodeF´FacultaddeCiencias,UniversidaddeChile,Santiago7800003,Chile.
CentrodeRecursosEducativosAvanzados,CREA,Santiago7500018,Chile.
Abstract.
IntheMadelung-Bohmapproachtoquantummechanics,weconsidera(timedependent)phasethatdependsquadrati-
callyonpositionandshowthatitleadstoaBohmpotentialthatcorrespondstoatimedependentharmonicoscillator,providedthe
timedependentterminthephaseobeysanErmakovequation.
Introduction
Harmonicoscillatorsarethebuildingblocksinseveralbranchesofphysics,fromclassicalmechanicstoquantum
mechanicalsystems.Inparticular,forquantummechanicalsystems,wavefunctionshavebeenreconstructedasisthe
caseforquantizedincavities[1]andforion-laserinteractions[2].Extensionsfromsingleharmonicoscillators
totimedependentharmonicoscillatorsmaybefoundinshortcutstoadiabaticity[3],quantizedpropagatingin
dielectricmedia[4],Casimire

ect[5]andion-laserinteractions[6],wherethetimedependenceisnecessaryinorder
totraptheion.
Timedependentharmonicoscillatorshavebeenextensivelystudiedandseveralinvariantshavebeenobtained[7,8,9,
10,11].Alsoalgebraicmethodstoobtaintheevolutionoperatorhavebeenshown[12].Theyhavebeensolvedunder
variousscenariossuchastimedependentmass[12,13,14],timedependentfrequency[15,11]andapplicationsof
invariantmethodshavebeenstudiedindi

erentregimes[16].Suchinvariantsmaybeusedtocontrolquantumnoise
[17]andtostudythepropagationoflightinwaveguidearrays[18,19].Harmonicoscillatorsmaybeusedinmore
generalsystemssuchaswaveguidearrays[20,21,22].
Inthiscontribution,weuseanoperatorapproachtosolvetheone-dimensionalSchr
¨
odingerequationintheBohm-
Madelungformalismofquantummechanics.ThisformalismhasbeenusedtosolvetheSchr
¨
odingerequationfor
di

erentsystemsbytakingtheadvantageoftheirnon-vanishingBohmpotentials[23,24,25,26].Alongthiswork,
weshowthatatimedependentharmonicoscillatormaybeobtainedbychoosingapositiondependentquadratictime
dependentphaseandaGaussianamplitudeforthewavefunction.Wesolvetheprobabilityequationbyusingoperator
techniques.Asanexamplewegivearationalfunctionoftimeforthetimedependentfrequencyandshowthatthe
Bohmpotentialhasdi

erentbehaviorforthatfunctionalitybecauseanauxiliaryfunctionneededinthescheme,
namelythefunctionsthatsolvestheErmakovequation,presentstwodi

erentsolutions.
One-dimensionalMadelung-Bohmapproach
ThemainequationinquantummechanicsistheSchrodingerequation,thatinonedimensionandforapotential
V
(
x
;
t
)
iswrittenas(forsimplicity,weset
}
=
1)
i
@ 
(
x
;
t
)
@
t
=

1
2
m
@
2
 
(
x
;
t
)
@
x
2
+
V
(
x
;
t
)
 
(
x
;
t
)
(1)
arXiv:2101.05907v1  [quant-ph]  14 Jan 2021

As shown:

  1. The spacing, such as the title, disappeared and resulted meaning less strings.
  2. The latex equations was wrong, and it got worse on the second page.

How to fix this and extract text and equations correctly from the pdf file that was generated from latex?

like image 545
ShoutOutAndCalculate Avatar asked Oct 19 '25 10:10

ShoutOutAndCalculate


1 Answers

In the mean time, PyPDF2 got deprecated. Use pypdf (I'm the maintainer of both; see migrtion guide).

We don't have anything specific for equations, but text extraction in general:

import io
import requests
from pypdf import PdfReader

# Download content
url = "https://arxiv.org/pdf/2101.05907.pdf"
r = requests.get(url)
f = io.BytesIO(r.content)

# Extract text
reader = PdfReader(f)
print(reader.pages[0].extract_text())

The last paragraph is

enter image description here

and pypdf==3.16.4 gives:

The main equation in quantum mechanics is the Schrodinger equation, that in one dimension and for a potential V(x,t)
is written as (for simplicity, we set ℏ=1)
i∂ψ(x,t)
∂t=−1
2m∂2ψ(x,t)
∂x2+V(x,t)ψ(x,t) (1)

You can see that the text is fine, but all of the math characters / equation structure is not represented well.

Math text extraction will for sure stay suboptimal for a long time, but I've opened a ticket to improve the text extraction (the partial, phi, maybe also the hbar): https://github.com/py-pdf/pypdf/issues/2009

See also: Why text extracting is hard. In summary: pypdf will hopefully get better with extracting the greek letters

Full extractiong with pypdf==3.16.4

Bohm potential for the time dependent harmonic oscillator
Francisco Soto-Eguibar1, Felipe A. Asenjo2, Sergio A. Hojman3and H ´ector M.
Moya-Cessa1
1Instituto Nacional de Astrof´ ısica, ´Optica y Electr´ onica, Calle Luis Enrique Erro No. 1, Santa Mar´ ıa Tonanzintla,
Puebla, 72840, Mexico.
2Facultad de Ingenier´ ıa y Ciencias, Universidad Adolfo Ib´ a˜ nez, Santiago 7491169, Chile.
3Departamento de Ciencias, Facultad de Artes Liberales, Universidad Adolfo Ib´ a˜ nez, Santiago 7491169, Chile.
Departamento de F´ ısica, Facultad de Ciencias, Universidad de Chile, Santiago 7800003, Chile.
Centro de Recursos Educativos Avanzados, CREA, Santiago 7500018, Chile.
Abstract. In the Madelung-Bohm approach to quantum mechanics, we consider a (time dependent) phase that depends quadrati-
cally on position and show that it leads to a Bohm potential that corresponds to a time dependent harmonic oscillator, provided the
time dependent term in the phase obeys an Ermakov equation.
Introduction
Harmonic oscillators are the building blocks in several branches of physics, from classical mechanics to quantum
mechanical systems. In particular, for quantum mechanical systems, wavefunctions have been reconstructed as is the
case for quantized fields in cavities [1] and for ion-laser interactions [2]. Extensions from single harmonic oscillators
to time dependent harmonic oscillators may be found in shortcuts to adiabaticity [3], quantized fields propagating in
dielectric media [4], Casimir e ffect [5] and ion-laser interactions [6], where the time dependence is necessary in order
to trap the ion.
Time dependent harmonic oscillators have been extensively studied and several invariants have been obtained [7, 8, 9,
10, 11]. Also algebraic methods to obtain the evolution operator have been shown [12]. They have been solved under
various scenarios such as time dependent mass [12, 13, 14], time dependent frequency [15, 11] and applications of
invariant methods have been studied in di fferent regimes [16]. Such invariants may be used to control quantum noise
[17] and to study the propagation of light in waveguide arrays [18, 19]. Harmonic oscillators may be used in more
general systems such as waveguide arrays [20, 21, 22].
In this contribution, we use an operator approach to solve the one-dimensional Schr ¨odinger equation in the Bohm-
Madelung formalism of quantum mechanics. This formalism has been used to solve the Schr ¨odinger equation for
different systems by taking the advantage of their non-vanishing Bohm potentials [23, 24, 25, 26]. Along this work,
we show that a time dependent harmonic oscillator may be obtained by choosing a position dependent quadratic time
dependent phase and a Gaussian amplitude for the wavefunction. We solve the probability equation by using operator
techniques. As an example we give a rational function of time for the time dependent frequency and show that the
Bohm potential has di fferent behavior for that functionality because an auxiliary function needed in the scheme,
namely the functions that solves the Ermakov equation, presents two di fferent solutions.
One-dimensional Madelung-Bohm approach
The main equation in quantum mechanics is the Schrodinger equation, that in one dimension and for a potential V(x,t)
is written as (for simplicity, we set ℏ=1)
i∂ψ(x,t)
∂t=−1
2m∂2ψ(x,t)
∂x2+V(x,t)ψ(x,t) (1)arXiv:2101.05907v1  [quant-ph]  14 Jan 2021
like image 167
Martin Thoma Avatar answered Oct 21 '25 23:10

Martin Thoma