I'm currently trying to read text from pdf file using itextsharp using the following code and assigning to a textbox (MultiLine) - (Windows Desktop App)
Note: This code works fine.
public string ReadPdfFile(string fileName)
{
StringBuilder text = new StringBuilder();
if (File.Exists(fileName))
{
PdfReader pdfReader = new PdfReader(fileName);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
}
pdfReader.Close();
}
return text.ToString();
}
BUT My pdf file has an equation
and all i'm getting is the follwing output
what could be added here to achieve the following text? Any sort of help would really be appreciated!
I used itextsharp and i am 100% sure its not possible. Problem is within pdf format itself. It does not contains any tags refered to some text. Pdf contains specific graphical representation of content which has its position on pdf page. Without OCR its even impossible to detect bolded text. Pdf isnt good format to parse.
My problem was even easier than yours and it was hell to read from pdf. It was just text, but it was formated as 2 pages in one(2 column text). Itextsharp read content by coordinates, so my text got mixed up as he read first line of first column than first line of second column (not as text flows). As for latex, after latex code is converted to pdf there is no reverse to latex code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With