Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How would you get count of a given word in a given PDF?

Tags:

pdf

Interview Question

I have been asked this question in an interview, and the answer doesn't have to be specific programming language, platform- or tool- specific.

The question was phrased as following:

How would you get the instance count of a given word in a PDF. The answer doesn't have to be programming, platform, or tool specific. Just let me know how would you do it in a memory and speed efficient way

I am posting this question for following reasons:

  1. To better understand the context - I still fail to understand the context of this question, what might the interviewer be looking for by asking this question?
  2. To get diverse opinions - I tend to answer such questions based on my skills on a programming language (C#), but there might be other valid options to get this done.

Thanks for your interest.

like image 934
Manish Basantani Avatar asked Jan 24 '12 03:01

Manish Basantani


2 Answers

If I had to write a program to do it, I'd find a PDF rendering library capable of extracting text from PDF files, such as Xpdf and then count the words. If this was a one-of task or something that needed to be automated for a non-production quality task, I'd just feed the file into pdftotext program and then parsed the output file with python, splitting into words, putting them in a dictionary and counting number of occurances.

If I was asking this interviewing question, I'd be looking for a couple of things:

  1. understanding the difference between the setting for this task: one-off script thingy vs production code
  2. not attempting to implement PDF rendered yourself and trying to find a library instead.

Now I wouldn't expect this from any random candidate with no PDF experience, but you can have a very meaningful discussion about what PDF is and what a "word" is. You see, PDF stored text as a bunch of string with coordinates. Each string is not necessarily a word. Often times, the words will be split into a couple of completely separate strings which are absolutely positioned in the document to make a single word. This is why sometimes when searching for words in a PDF document you get strange looking results. So to implement word searching in a document you'd have to glue these strings back together (pdftotext takes care of that for you).

It's not a bad question at all.

like image 131
MK. Avatar answered Oct 29 '22 20:10

MK.


You can use Trie It is very easy to get the count of given word.

like image 22
Sandeep Avatar answered Oct 29 '22 20:10

Sandeep