PDF was created for printing but not for storing information. Internally in PDF, every text snippet/text object is made for drawing and not connected to others logically.
Our extractor tries to reproduce the output layout by analyzing text objects and generating lines, spaces line breaks similar to what we see on a page.
- One way is to export to plain text so the extractor will add line breaks.
- Another way is to export into CSV/JSON where it will try to output every paragraph as a separate cell.
- JSON also outputs text as a virtual grid separated by rows and columns
For better paragraph output, you may need to use the
lineGrouping option to control how lines are merged into paragraphs.
We recommend that you try this with the PDF Multitool. It’s a desktop app that comes with the SDK. It will help you test your file quicker. You only need to load your file and choose the output format you want it to be extracted.
You can simply search the PDF Multitool on your machine to use it.