Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting the lines of each paragraphs of a docx with Apache-POI

I am using the library Apache-POI for my app. Specifically, POIshadow-all (ver. 3.17) for reading a Word document. I am successfully extracting every paragraph as follows:

enter image description here

what I actually need is extract every line, as follows:

enter image description here

The code to extract every paragraph is this:

 try {

            val fis = FileInputStream(path.path + "/" + document)
            val xdoc = XWPFDocument(OPCPackage.open(fis))

            val paragraphList: MutableList<XWPFParagraph> = xdoc.paragraphs

            private val newParagraph = paragraph.createRun()

                ...

            for (par in paragraphList) {

                    var currentParagraph = par.text
                    Log.i("TAG","current: $currentParagraph")

                        ...

The variable currentParagraph returns a whole paragraph, as expected. However, I would need a variable named currentLine which returns a single line.

I've research about this issue in stackoverflow and other sites. I've found some proposals but none of them works for me. I also tried get dates by ctr and using XWPFRun, without any success.

I would be grateful for any recommendation on how to proceed.

Thanks in advance for your help.

like image 688
Sergio76 Avatar asked Sep 13 '20 21:09

Sergio76


1 Answers

The metadata of a document does not store how many lines are there in a given paragraph because it depends on how you render or view it. Think of a word document, if you have a larger font-size, you will have more lines in a given paragraph, alternatively, if you have a smaller font-size, you would have fewer lines in a paragraph. Therefore, the number of lines in each paragraph is inconsistent i.e. a variable.

However, if there’s a hard and fast requirement within your application to have an estimate, you can program some logic like “start a new line after X (a constant) number of characters (round off to the end of the word)”. This again could change depending on the screen size, font-size, zoom-level etc. so my suggestion would be to work out a scenario in your application where you do not explicitly measure the number of lines in a given paragraph, rather the number of words or characters and use that as a yardstick measure to insert a line-break if absolutely necessary.

Another potential approach you could use would be to separate sentences using escape characters e.g. “Start a new sentence after each ‘?’, ‘!’ or ‘.’ character within a paragraph.” This too can get rather tricky, depending on the structure of certain sentences.

Therefore, the answer to your question is that there is no “out of the box” way to detect the number of lines in a given paragraph using Apache POI, you would have to program your own logic there (perhaps using an approach outlined above), if absolutely necessary.

like image 199
Mikaal Anwar Avatar answered Oct 29 '22 10:10

Mikaal Anwar