Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read Math Equation using itextSharp

Tags:

c#

pdf

itext

I'm currently trying to read text from pdf file using itextsharp using the following code and assigning to a textbox (MultiLine) - (Windows Desktop App)

Note: This code works fine.

public string ReadPdfFile(string fileName)
        {
            StringBuilder text = new StringBuilder();

            if (File.Exists(fileName))
            {
                PdfReader pdfReader = new PdfReader(fileName);

                for (int page = 1; page <= pdfReader.NumberOfPages; page++)
                {
                    ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
                    string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);

                    currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
                    text.Append(currentText);
                }
                pdfReader.Close();
            }
            return text.ToString();
        }

BUT My pdf file has an equation

enter image description here

and all i'm getting is the follwing output

enter image description here

what could be added here to achieve the following text? Any sort of help would really be appreciated!

like image 504
Aimal Khan Avatar asked Aug 06 '16 13:08

Aimal Khan


1 Answers

I used itextsharp and i am 100% sure its not possible. Problem is within pdf format itself. It does not contains any tags refered to some text. Pdf contains specific graphical representation of content which has its position on pdf page. Without OCR its even impossible to detect bolded text. Pdf isnt good format to parse.

My problem was even easier than yours and it was hell to read from pdf. It was just text, but it was formated as 2 pages in one(2 column text). Itextsharp read content by coordinates, so my text got mixed up as he read first line of first column than first line of second column (not as text flows). As for latex, after latex code is converted to pdf there is no reverse to latex code.

like image 78
Djuro Avatar answered Nov 14 '22 23:11

Djuro