Extracting text from PDFs in C# [closed]

Tags:

Pretty simply, I need to rip text out of multiple PDFs (quite a lot actually) in order to analyse the contents before sticking it in an SQL database.

I've found some pretty sketchy free C# libraries that sort of work (the best one uses iTextSharp), but there are umpteen formatting errors and some characters are scrambled and alot of the time there are spaces (' ') EVERYWHERE - inside words, between every letter, huge blocks of them taking up several lines, it all seems a bit random.

Is there any easy way of doing this that I'm completely overlooking (quite likely!) or is it a bit of an arduous task that involves converting the extracted byte values into letters reliably?

924

asked Jan 22 '10 10:01

Duncan Tait

1 Answers

There may be some difficulty in doing this reliably. The problem is that PDF is a presentation format which attaches importance to good typography. Suppose you just wanted to output a single word: Tap.

A PDF rendering engine might output this as 2 separate calls, as shown in this pseudo-code:

moveto (x1, y); output ("T") moveto (x2, y); output ("ap")

This would be done because the default kerning (inter-letter spacing) between the letters T and a might not be acceptable to the rendering engine, or it might be adding or removing some micro space between characters to get a fully justified line. What this finally results in is that the actual text fragments found in PDF are very often not full words, but pieces of them.

123

answered Sep 23 '22 18:09

Tarydon

Related questions
                            
                                Avoid Registry Wow6432Node Redirection
                            
                                Mixing .NET 3.5 with 4/4.5 assemblies in the same process
                            
                                User.Identity.IsAuthenticated is false after successful login
                            
                                How free memory used by a large list in C#?
                            
                                Multi-Tenant With Code First EF6
                            
                                Is "Copy Local" transitive for project references?
                            
                                What's a good threadsafe singleton generic template pattern in C#
                            
                                Where are methods stored in memory?
                            
                                Execute multiple SQL commands in one round trip
                            
                                When/How to Unit Test CRUD applications?
                            
                                How do I log into a site with WebClient?
                            
                                VS 2010 setting non-GUI class file as Component
                            
                                Where do I put my extension method?
                            
                                How to display items in Canvas through Binding
                            
                                Is it possible to override MultipartFormDataStreamProvider so that is doesn't save uploads to the file system?
                            
                                Pair bluetooth devices to a computer with 32feet .NET Bluetooth library
                            
                                How to render a partial view asynchronously
                            
                                Why does my .NET application crash when run from a network drive?
                            
                                Getting / setting file owner in C#
                            
                                DataTable internal index is corrupted

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Extracting text from PDFs in C# [closed]

Tags:

c#

text

pdf

extract

Duncan Tait

People also ask

1 Answers

Tarydon

Recent Activity

Donate For Us