History
Free Trial Web API version Licensing Request A Quote
HAVE QUESTIONS OR NEED HELP?SUBMIT THE SUPPORT REQUEST FORM or write email toSUPPORT@BYTESCOUT.COM
History (changes log)
ByteScout PDF Extractor SDK history of changes. Legend: ------------------------- - - bug fixed + - new feature = - changed ! - critical ------------------------- 13.4.0.4659 (April 10, 2023) ================================ + Added support for WEBP image format in 'RasterRenderer' and 'HTMLExtractor' + Adding Variant methods to extractors - Improved fonts rendering - fixing crash on text object where contentLength = Performance improvements = Other minor fixes and improvements. 13.3.0.4514 (September 27, 2022) ================================ + DocumentSplitter: added support for "**" split range that splits document into pairs of pages. + Added methods to all extractors that support Variant datatype for input and output. They allow to perform in-memory processing when using the SDK as COM/ActiveX object from Delphi, VC++, VBScript, etc. - Fixed text search for RTL languages. - Input photo images are now rotated according to EXIF information. = Improved parsing of PDF documents. = Other minor fixes and improvements. 13.2.0.4485 (June 7, 2022) ========================== = 'DocumentRotator' now can automatically fix rotation of PDF files using OCR. = Improved line removal algorithm. = Improved loading of embedded fonts. = Performance improvements. - Rotated text objects were combined with unrotated ones. Fixed now. - Fixed parsing of names of file attachments. - 'SearchablePDFMaker': fixed coordinates of transparent text in the output document when the input is an image. = Suppressed junk console message. = Improved parsing of PDF documents. = Other minor fixes and improvements. 13.1.0.4386 (January 24, 2022) ============================== + DocumentMerger: Added property 'MergedDocumentTitle' allowing to override the title of merged document. + XLSExtractor: Added property 'CustomColumnWidths' allowing to specify exact column widths in generated Excel spreadsheet. = JSONExtractor: The mode 'OutputStructure.Full' is renamed to 'OutputStructure.LegacyFixed' and made maximally compatible in field names with the mode 'OutputStructure.Legacy'. + Added support for UniKS-UCS2-H text encoding. + InfoExtractor: Added method 'GetFormFields()' returning information about form fields in PDF document. = Improved COM/ActiveX interfaces for in-memory processing without file operations. + Extractors and SearchablePDFMaker: Added property 'OCRDisableAutoSegmentation' to solve OCR engine's segmentation issues. = .NET Core min required version is 2.1 now (was 2.0). - Line grouping was not affected by 'ConsiderFontSizes' and 'ConsiderFontColors' properties. Fixed now. - Fixed disposing issue in 'SearchablePDFMaker'. = Improved parsing of PDF documents. = Other minor fixes and improvements. 13.0.0.4253 (October 4, 2021) ============================= + New column detection mode 'ColumnDetectionMode.ContentGroupsAI' that works better on tables without borders and on pages with multiple tables. = Greatly improved tables detection in 'TableDetector2'. = Improved filtering of shadow-like text ('ExtractShadowLikeText' option). = Improved the 'LineGroupingMode.JoinOrphanedRows'. = 'DocumentMerger': Improved merging of PDF forms. Now it can link fields with matching names or rename them to avoid unwanted linking. See the property 'RenameMatchingFieldsDuringMerge'. = 'JSONExtractor' and 'XMLExtractor' now output the page size for each page. = All extractor classes now support extraction of page ranges. + Added properties 'DetectUnderlineTextStyle' and 'DetectStrikeoutTextStyle' to 'CSVExtractor' and 'XLSExtractor'. They help to prevent underlined text affecting the line grouping in table cells. = Improved background color detection for the option 'ConsiderBackgroundColors'. + Added property 'NormalizeText' to all extractors. It replaced unicode spaces and hyphens in the extracted text with normal ' ' and '-' characters. - 'Remover2': fixed handling of PDF page rotation. - 'Remover2': making unsearchable now performed only for edited pages. + 'XMLExtractor': Added property 'IndentedXML' to control indentation. + 'JSONExtractor': Added property 'IndentedJSON' to control indentation. - 'Stamper': fixed stamping of rotated pages. + Added new OCR mode - 'OCRMode.AutoRepairFonts'. It automatically tries to detect PDF documents with corrupted text and forces OCR font repair for them. Works only for English texts. + Added property 'PageSeparator' to CSV and XLS extractors. = 'XLSExtractor': improved negative numbers detection. - 'TextExtractor.FindAll()' method was ignoring the case sensitivity option. Fixed now. + Added property 'OCRDetectLines' that helps to detect table structure in scanned documents. + 'JSONExtractor' and 'XMLExtractor' now outputs number of pages in the result and number of pages for which OCR was performed. + Added property 'OCRPageCount' to extractors that contains number of pages for which OCR was performed during the last extraction. + 'JSONExtractor': Added property 'OutputStructure' that allows to select structure of output JSON. + 'JSONExtractor': Added property 'OutputTransformation' that allows to apply JSONPath expression to the output JSON. = Performance improvements. = Improved parsing of PDF documents. = Other minor fixes and improvements. 12.1.0.4136 (May 18, 2021) ========================== + Added property 'TextExtractor.FuzzySearch' that enables 'fuzzy' text search algorithm. It allows to find 'approximately equal' strings. + Added 'DocumentSplitter2' class that splits document by found text. + Added 'CSVExtractor.NormalizeCSV' property. It makes CSV data produced from different document pages to contain the same number of columns. + Added property 'JSONExtractor.OutputStructure' that allows to change the structure of the generated JSON to one of predefined variants for easier postprocessing. + Added property 'JSONExtractor.OutputTransformation' that allows to apply JSONPath expression to the generated JSON. + Added property 'OCRPageCount' to extractor classes that contains number of pages for which OCR was performed. + 'JSONExtractor' and 'XMLExtractor' now add to the generated JSON and XML result the number of process pages and the number of pages for which OCR was performed. + Added property 'OCRDetectLines' to extractor classes that improves column detection in scanned documents. + Added property 'ConsiderBackgroundColors' to extractor classes that enables detection of background color under text objects. It may helps to improve row and column detection in tables without borders but with color stripes. + Added properties 'DocumentMerger.GenerateBookmarks' and 'DocumentMerger.BookmarkTitles' to enable automatic generation of bookmarks pointing to the merged parts. = Improved PDF optimization in 'DocumentSplitter'. = 'DocumentMerger' now uses the first input document as the base for the merged document. This allows to keep document information properties and outlines. = DocumentMerger: added support for profiles. = MultimediaExtractor: added support for more media types. - 'TextExtractor.FindAll()' method was ignoring the case sensitivity option. - Fixed issue with junk empty temporary files generated during OCR. = Improved parsing of PDF documents. = Other minor fixes and improvements. 12.0.0.4062 (February 8, 2021) ============================== + Added public 'BaseExtractor.ExtractionArea' property (in addition to 'SetExtractionArea()' method) for more intuitive use. = Added the new property 'ColumnDetectionByTextAlignment' to extractors that affects the detection of table columns without separating lines between. + Added support for simplified profiles. + DocumentOptimizer: Added the property 'OptimizationOptions.GrayscaleImages' that converts all color images to grayscale. + UnsearchablePDFMaker: Added the new property 'KeepSkippedPages' that keeps pages excluded from the processing in the output document. + UnsearchablePDFMaker: Added the new property 'Grayscale' that converts all processed pages to grayscale. + Added the property 'BaseTextExtractor.TextAnalysisCorruptedTextThreshold' to fine-tune the text analysis. = Member names in profiles are case-insensitive now. = Improved filtering of invisible objects. = Improved detection of bold fonts. = Improved OCR rotation detection. = Added missing OCR mode 'OCRMode.TextFromVectorsAndRepairedFonts'. = RTL fonts detection is now enabled by default. = JSON extractor now generates clean JSON (without the @ and# characters for attributes). = Improved support for external Chinese fonts. = Improved positioning of rotated PDF objects. = Now the damaged CCITT and JBIG2 images are skipped from rendering avoiding crashes. = SearchablePDFMaker: improved OCR when 'DiscardExistingDocumentText' is enabled. = 'SearchablePDFMaker.GetPageOCRCells()' now detects text color. = OCR in all extractors now detects text color if the 'ConsiderFontColors' property is enabled. = 'LineGroupingMode.JoinOrphanedRows' now separates rows of different color if 'ConsiderFontColors' property is enabled. - InfoExtractor: Fixed a crash if the input document is an image. - Fixed OCR crash on rotated text. - 'IsOCRRecommendedForPage()' now skips text objects outside the page crop box. = Improved parsing of PDF documents. = Other minor fixes and improvements. 11.3.0.3983 (October 26, 2020) ============================== + DocumentSplitter: Added support for regions with inverted page numbers. For example, "!1" means "the last page", "!1-!3" or "!3-" means "last three pages". + DocumentSplitter: Added support for "*" split range that means "split every single page". + Added 'InfoExtractor.Metadata' property that gets XMP metadata from the document. = Improved joining of multi-line cells in tables without borders ('LineGroupingMode.JoinOrphanedRows' mode). = Improved detection of OCR language file versions. = Improved .NET Core 2.0 compatibility. = Improved unwrapping of multi-line cell text. - Fixed issue when invisible vector drawings were causing unwanted separation of text objects. - Fixed extraction from area when running OCR against image file (not PDF!). = Improved parsing of PDF documents. - Other minor fixes and improvements. 11.2.0.3919 (June 20, 2020) =========================== + 'MultimediaExtractor' now supports extraction of 3D-animation objects. - 'TextExtractor.Find()' now keeps original font names in found object information. = Improved column detection in 'ColumnDetectionMode.Borders' mode. - 'SearchablePDFMaker' did not process vector-only pages. Fixed now. = Improved regex text search in 'TextExtractor'. + Added 'DetectUnderlineTextStyle' and 'DetectStrikeoutTextStyle' properties to 'JSONExtractor' and 'XMLExtractor'. + Added 'OCRWhiteList' and 'OCRBlackList' properties to extractors. + Added 'Invert' OCR preprocessing filter. + Added 'Scale' OCR preprocessing filter. = Improved joining of multi-line cells in tables without borders ('LineGroupingMode.JoinOrphanedRows' mode). = Improved performance of 'ImageExtractor'. + Added page rectangles to 'InfoExtractor'. = Improved 'OCRAnalyzer'. = Improved automatic deletion of duplicated text objects during the extraction. - Fixed extraction issues in .NET Core version. = Improved parsing of PDF documents. - Other minor fixes and improvements. 11.1.0.3845 (March 19, 2020) ============================ + Added 'OCROverallConfidence' property in all extractors that. + SearchablePDFMaker: Added 'KeepOriginalRotation' property. - SearchablePDFMaker: fixed crash on mixed English-Arabic text recognition. + PDF Multitool: Added "Developer Tools" sub-menu to the context menu. = Improved parsing of PDF documents. - Other minor fixes and improvements. 11.0.0.3805 (February 11, 2020) =============================== + Added support for new revision of PDF encryption (ISO 32000-2:2017 compliance). + Added 'LicenseInfo' property providing detailed information about your license. + Added 'Grayscale' filter to OCRImagePreprocessingFilters. = Dramatically improved column extraction for multiple tables on a page. Works only in 'ColumnDetectionMode.Borders' mode for tables with borders between columns and rows. = Greatly improved 'ColumnDetectionMode.BorderedTables'. As in the table detection, it now uses optical recognition to detect bordered tables and their columns on scanned documents. = Improved 'InfoExtractor' to return the encrypted and password-protected states without asking a password or throwing an exception. = Added document permissions information to 'InfoExtractor'. = DocumentSplitter: added zero-padding to page numbers in generated file names. = Improved extraction of duplicated text (shadow-like effect). = Improved 'MultimediaExtractor'. - Fixed text search issues on some documents. - Fixed bug that damaged extracted text only during multi-thread processing. - Fixed crash on subsequent extractions with different OCR modes. - Fixed .NET Core compatibility issue. = Improved parsing of PDF documents. - Other minor fixes and improvements. 10.8.0.3732 (December 4, 2019) ============================== + Remover2: Added 'MaskColor' property that allows to change color of masking rectangle. - Remover & Remover2: Fixed incomplete removal of the text in some cases. - XMLExtractor and XFDFExtractor: fixed missing control types. - Fixed parsing of combobox items that consist of value+label pairs. = Improved handling of Arabic fonts and charsets. = Improved handling of CJK fonts and charsets. = Improved parsing of PDF documents. - Other minor fixes and improvements. 10.7.0.3697 (November 1, 2019) ============================== = Improved extraction of embedded images. = Improved table columns detection. - Remover2: fixed crash on sequential Add*() method calls. - PDF Multitool: fixed crash on multimedia extraction. = Improved parsing and processing of PDF documents. - Other minor fixes and improvements. 10.6.0.3659 (October 1, 2019) ============================= + Added methods to remove vector objects to 'Remover' and 'Remover2' classes. + Added experimental 'TableDetector2' class demonstrating new table detection method. = Improved replacement of not embedded PDF fonts. = Improved splitting of text objects when using CustomExtractionColumns. - Fixed text search on some documents. - Added 'CreateProfile()' method to all extractors that creates profile from current object. + PDF Multitool: Added tools to remove text, image, and vector objects. - PDF Multitool: Fixed "Save Vectors" option in XML extraction. = Improved parsing and processing of PDF documents. - Other minor fixes and improvements. 10.5.0.3637 (September 2, 2019) =============================== + DocumentMerger: Added "MergeFolder()" method allowing to merge all PDF files in folder. = Improved extraction by CustomExtractionColumns. = Remover: Improved appearance of partially removed text objects. = Renderer and Viewer: Improved rendering of small fonts with stroke. + PDF Multitool: Added Full Screen mode. + PDF Multitool: Added "Night Mode". - PDF Multitool: Fixed selection reset on switching a tool. = Improved parsing and processing of PDF documents. - Other minor fixes and improvements. 10.4.0.3600 (August 6, 2019) ============================ + Added extracted text analysis. See "EnableTextAnalysis" property. = Improved columns detection. + Implemented replacement filters allowing to replace extracted text before analysis of table structure. See "AddFilter()" method. + Added "SensitiveDataDetector" class allowing to detect sensitive data in PDF documents. + Added new "Remover2" class: improved version of "Remover" with better interface. + PDF Multitool: Added "Save vector objects" option to XML and JSON converters. = PDF Multitool: Improved "Detect Tables" dialog. = PDF Multitool: Improved conversion to HTML format. + PDF Multitool: Added set of tools "Sensitive Data Suite" allowing to detect and remove sensitive data in PDF documents. = PDF Multitool: Reduced memory consumption on extraction from very large documents. - Other minor fixes and improvements. 10.3.0.3566 (July 2, 2019) ========================== + Added property 'OCRMaximizeCPUUtilization' that allows to improve OCR performance at the cost of maximized CPU utilization. = Improved OCR rotation detection. - Fixed OCR crash on systems with CPU without AVX and AVX2 extensions. - Fixed OCR crash when working under limited system accounts. = Improved the detection of the visibility of text objects when they are hidden by a overlying opaque vector object. = Improved extraction from cropped PDF pages. - Fixed 'OutOfMemoryException' on tiling patterns with very large step or bounding box. = Improved extraction of embedded images. = Improved extraction of multimedia files. - Fixed decoding of UTF-8 encoded text objects. = Improved Japanese fonts decoding. - Fixed 'LineGroupingMode.JoinOrphanedRows' mode for multiple single-cell lines. = PDF Multitool: Replaced legacy 'FolderBrowserDialog' with modern 'FolderSelectDialog' everywhere. = PDF Multitool: Added Ctrl-Shift-O hot key to open recent document. - Other minor fixes and improvements. 10.2.0.3512 (May 28, 2019) ========================== = Improved OCR engine stability when working in strict environments. = Improved columns separation by 'CustomExtractionColumns'. + Added parameter for 'TextExtractor.Find()' method that allows to specify RegexOptions. + Added support for streams to 'DocumentSplitter' and 'DocumentMerger' + Added property 'TableDetector.EnhanceTableBorders' affecting the table detection in 'Bordered Tables' mode. = Improved parsing and processing of PDF documents. = PDF Multitool: Visited pages are now displayed much faster. = PDF Multitool: Improved keyboard navigation. = PDF Multitool: Improved CSV preview. + PDF Multitool: All tools now shows elapsed time in the status bar. = PDF Multitool: Changed default OCR grade to 'Best'. - Other minor fixes and improvements. 10.1.0.3439 (April 4, 2019) =========================== - Fixed detection of rotation of scanned documents ('OCRDetectPageRotation' property in extractors). + PDF Multitool: Added "Detect rotation" to OCR options of basic extractors. = Improved parsing and processing of PDF documents. - Other minor fixes and improvements. 10.0.0.3420 (March 21, 2019) ============================ + Greatly improved OCR quality and performance. + PDF Multitool: New option to select OCR grade. - PDF Multitool: Fixed behavior of "Remove" button in "Merge documents" tool. = PDF Multitool: Reduced excessive painting in selection mode. = Improved parsing and rendering of PDF documents. - Other minor fixes and improvements. 9.4.0.3398 (March 12, 2019) =========================== + Added TextExtractor.FindAll() and TextExtractor.FindAllToJSON() methods. + Added 'AnnotationExtractor' class. = Improved handling of embedded PDF fonts. = Improved parsing of PDF documents. + PDF Multitool can now be set as default PDF viewer application in Windows. + PDF Multitool: Added the ability to preview the conversion. + PDF Multitool: Reworked converters' options dialogs. Removed weird options, added actual ones. = PDF Multitool: Now Ctrl-PageUp and Ctrl-PageDown keys switch pages even if PDFViewerControl is not focused. = PDF Multitool: Improved handling of PDF extraction permissions. - Fixed unwanted byte order mark (BOM) when writing extracted text to MemoryStream. - Fixed line grouping in table cells. - Fixed crash in XMLExtractor when input document is image. = Improved parsing of XFA forms. = Improved Deskew image preprocessing filter. + Added 'ShrinkMultipleSpaces' property improving column detection if text in a table contains multiple spaces between words. - Fixed column detection in rotated pages. + Improved support of Microsoft Excel formats. - Other minor fixes and improvements. 9.3.0.3352 (January 31, 2019) ============================= + Added OCRCorrections property to all extractors that implement OCR. + Added .NET Core compatible assemblies. = Improved support of Korean fonts. = Improved parsing of PDF documents. = Improved columns detection. + XMLExtractor, JSONExtractor: Added 'SaveVectors' property. = OCRExtension: Suppressed unwanted console messages. - Removed C++ runtime dependencies. - Fixed merging of PDF forms containing fields with the same name. - Other minor fixes and improvements. 9.2.0.3254 (October 22, 2018) ============================= = Changed font rendering engine to improve text rendering and to circumvent Windows GDI font processing issues. = Improved extraction of embedded media files. = Improved detection of columns when extracting tabular data. = PDF Rederer SDK: Property 'RenderingOptions.PreferSystemFonts' made obsolete due to change of font rendering engine. = XLSExtractor: improved Excel format support. = Embedded default fonts to fallback to if a font is missing in Windows. = Improved support of cropped PDF documents. = Improved extraction of text from rotated pages. + PDF Multitool: Added "OCR Analyzer" tool. = Performance improvements. - Other minor fixes and improvements. 9.1.0.3163 (July 18, 2018) ========================== + Added new line grouping mode 'LineGroupingMode.JoinOrphanedRows'. + Added new OCRAnalyzer class that can help to find optimal combination of OCR image preprocessing filters. See source code examples. + Added new LineDetector class allowing to find all vertical and horizontal lines in document. + Added public methods GetPreprocessedPagePreview() and SavePreprocessedPagePreview() allowing to preview the result of OCR image preprocessing filters work. = Greatly improved the line removing OCR image preprocessing filters. - SearchablePDFMaker: fixed hanging on processing PDF documents with large count of vector objects. - Fixed bug in RotationAngle property when processing already rotated PDF documents. - ImageExtractor now correctly handles the rotation of embedded images. + PDF Multitool: added new feature "Optimize PDF document". - PDF Multitool: fixed resolution selection in "Make PDF unsearchable". - PDF Multitool: fixed rotation angle selection in "Rotate Document". - Other minor fixes and improvements. 9.0.0.3079 (April 11, 2018) =========================== + Added RotationAngle property to rotate document pages before the extraction. = TextExtractor: Improved plaint text columns alignment. = XLSExtractor: Improved numbers detection. = DocumentOptimizer: Greatly improved optimization effectiveness. = Greatly improved Deskew algorithm for OCR of rotated scans. = Remover: more accurate deletion of text objects. - SearchablePDFMaker: Fixed processing of rotated scans. - SearchablePDFMaker: Fixed resolution issues when the input is image. - Other minor fixes and improvements. 8.8.1.3025 (January 29, 2018) ============================= = Improved formatting of extracted plain text (TextExtractor). Now columns look better. 8.8.0.3015 (January 22, 2018) ============================= - Fixed: OCR preprocessing filters were not applied if input document is image. - PDF Multitool: Fixed image preprocessing filters in "Find Text" dialog. + TableDetector now provides detected cells information for ColumnDetectionMode.BorderedTables (see 'FoundTableCells' property). + XMLExtractor: Added annotations extraction; = XMLExtractor: Object coordinates in XML are fractional now for better precision (were integer). = Improved support of encrypted PDF documents. - Other minor fixes and improvements. 8.7.0.2980 (November 8, 2017) ============================= + DocumentOptimizer: added automatic resampling of high resolution images. + Added 'ParsingError' event allowing to handle parsing errors and interrupt or continue the processing. + SearchablePDFMaker: Added DiscardExistingDocumentText property allowing to overwrite previous OCR. + Added AllowStandalonePunctuation property to tabular extractors (CSV, XML, JSON, XLS). = Performance improvements. - SearchablePDFMaker: Invisible text dimensions now match recognized text pieces. - DocumentSplitter: Fixed 'outputFolder' parameter in SplitCOM() method. - Made IBaseTextExtractor interface public. - Other minor fixes and improvements. 8.6.0.2911 (August 1, 2017) =========================== + XMLExtractor, JSONExtractor, HTMLExtractor: Added KeepOriginalFontNames property. + TextComparer: Added GetChanges() method to get comparison results in form convenient for programmatic analysis. + DocumentRotator: It is now possible to specify pages to rotate. = TextExtractor.ExtractColumnByColumn property now affects Find() method. - Fixed font names in SearchResult elements. - Fixed Contrast preprocessing filter. - Extraction: subscript and superscript text objects were merged with normal text. Fixed now. = Other minor fixes and improvements. 8.5.0.2855 (June 1, 2017) ========================= = Improved Japanese text extraction. = Removed obsolete ClientProfile builds. = Improved multimedia files extraction. - Other minor fixes and improvements. 8.4.0.2820 (March 29, 2017) =========================== + New event ProgressChanged in all time-consuming classes. The event reports the progress in percents and also allows to interrupt the processing. + SearchablePDFMaker now supports single and multi-page images as the input and produces a PDF document at the output. = Performance improvements. - Fixed crash when the input document is image and it's loading from stream. - Other minor fixes and improvements. 8.3.0.2792 (March 06, 2017) =========================== + Added new Remover class allowing te remove text from PDF documents. + InfoExtractor now able to read custom document properties (see CustomProperties property). + XMLExtractor and JSONExtractor now able to extract document images and put them to outer files or embed as Base64 string. = Text extraction: Unwrap property now affects the text in table cells. = Text extraction: Improved lines grouping in table cells. - AttachmentExtractor: Fixed extraction of attachments and portfolio created with Microsoft Outlook. - DocumentSplitter: Fixed document optmization (OptimizeSplittedDocuments property). = Performance improvements. = Other minor improvements and bug fixes. 8.2.0.2697 (January 11, 2017) ============================= - Fixed Unwrap option. = Improved bordered tables detection. = Improved attachments extraction. + Added support for profiles - quick way to apply multiple settings at once. + OCR: Implemented rotation detection of wrongly oriented scanned PDF pages. + SearchablePDFMaker now able to automatically rotate wrongly oriented scanned PDF pages. - Fixed exception in SearchablePDFMaker when loading document from stream. - Fixed memory leaks in OCR. + TextExtractor and CSVExtractor: Added Save* methods overrides allowing to specify the charachers encoding. = Improved media files extraction. = Improved Vertical Line Remover OCR preprocessing filter. = Other minor improvements and bug fixes. 8.1.1.2606 (October 25, 2016) ============================= - Fixed OCR preprocessing filters in SearchablePDFMaker. - Fixed OCR preprocessing filters PDF Multitool demo app. + Added Gamma Correction preprocessing filter. + Added Horizontal Lines Remover preprocessing filter. = Improved Dilate preprocessing filter. 8.1.0.2600 (October 21, 2016) ============================= + Added OCR preprocessing filters to improve the recognition quality on low-quality scanned documents. + Added new DocumentOptimizer class able to recompress all document images with JPEG or CCITT compression. + Added text removal filters. + All extraction class (TextExtractor, XMLExtractor, etc.) now able to load image files and extract text from them using OCR. + PDF Multitool demo app now able to load image files and extract text from them using OCR. - Fixed extraction of text in Korean charset (KSCms-UHC-H / Code Page 949). = Improved text extraction from specified rectangular area. - Improved extraction of invisible text. - Fixed transparent color representation in XML extraction. = Other minor improvements and bug fixes. 8.0.0.2523 (August 19, 2016) ============================ + Added filtering of extracted content by font name, font size and color. ! Updated OCR engine to the latest version. Update language files from "tessdata" folder. = Improved text extraction. = Improved lines grouping in tabular data. = Improved performance. = Improved XFA forms extraction. = Improved TableDetector. - Fixed PDF parsing issues. - Fixed JBIG images decoding. - ImageExtractor: fixed per-page image extraction. - MultimediaExtractor: fixed extraction on embedded MPEG audio. - TextExtractor: fixed non-working RemoveHyphenation property. = Other minor improvements and bug fixes. 7.0.0.2474 (May 26, 2016) ========================== + Added new JSONExtractor class. + Added override for DocumentSplitter.Split() method allowing to specify the output folder for generated files. - Fixed multi-threading bug in DocumentSplitter. - TableDetector now respects extraction area set by SetExtractionArea() method. + New properties in extraction classes: ExtractionColumns - contains coordinates of detected columns; CustomExtractionColumns - allows to override the column detection. - GetPageRect* methods did not take the page rotation into account. - Fixed bug in installer causing some files from previous installation were interfering with updates. = Reworked the registration checking. Now the library will not throw an exception, but work in demo mode if you missed or input wrong RegistrationName and RegistrationKey. + PDF Multitool: Added recent document list to "Open PDF Document" button. + PDF Multitool: Selection can be resized now. + PDF Multitool: Added Extract JSON feature. = PDF Multitool: Improved Table Detector UI. = PDF Multitool: Greatly improved font rendering quality. + PDF Multitool: Added debug option "Show Detected Extraction Columns" to the context menu to display the detected columns on the current page. Becomes visible only after running any extraction against the current displayed page. - PDF Multitool: Fixed font rendering issue on 32-bit Windows. = Other minor improvements and bug fixes. 6.30.0.2421 (March 23, 2016) ============================ + Added TextComparer utility class (available in .NET 4.0 assemblies only) allowing to compare text in two PDF documents and generate report. = Improved support of ICC color profiles. = Improved handling of embedded fonts. = Improved AttachmentExtractor. - Fixed XMLExtractor.SaveXMLToStream() method. - Fixed extracted text duplication when using OCRCacheMode.WholePage option. = Other bug fixes and improvements. 6.20.0.2354 (January 20, 2016) ============================== PDF To Text, PDF To CSV, PDF To XML functions improved New Extract Video, Extract Audio examples CSV and XML extractors improved support for tables with empty columns inside new MultimediaExtractor to extract video and audio from PDF new property PageDataCaching new "MemoryCareProcessingOfHugeFiles" example fixed null exception when trying to dispose already disposed pages XLSExtractor: improves fonts support SkipInvisibleText now skips clipped text (which is not visible) text output rendering improved XFDF Extractor: added support for checkboxes Images output improved to support more sub-formats Unicode text handling improved 6.11.2193 (August 3, 2015) ========================== Batch Processing samples updated to show the use of Reset() method C++ source code sample added for Pages Extraction DocumentMerger adds Merge2(inputfile1, inputfile2, outputfile) method to merge 2 files XLS Extractor minor bug-fixes PDF Multitool now allows to enable/disable text, image, vector layers, adds advanced settings for text extraction XML, CSV, Table extraction improves support for tables with emtpry cells inside columns 6.10.2136 (June 16, 2015) ========================= improved PDF to Text extraction .ExtractShadowLikeText property improved: better filtering for shadow-like text improved stability and PDF text support 6.00.2071 (May 14, 2015) ======================== PDF to XML, PDF To CSV, PDF To Text functionality improved PDF To XLS command line sample added (based on vbscript) PDF To HTML SDK adds new .DetectHyperLinks property (TRUE by default) to enable/disable automated links detection in the text New SearchablePDFMaker (available for PRO licenses) to convert PDF into searchable PDF files new properties in extractor: ConsiderFontNames, ConsiderFontSizes, ConsiderFontColors, ConsiderVerticalBorders in CFG files header columns detection (when AutoAlighHeaderToColumns = true) improved .DetectLinesInsteadOfParagraphs replaced with new .LineGroupingMode to control how lines are merged into paragraphs IMPORTANT PDF To XML fixes long time issue with incorrect Y coordinate for text objects (was point to the bottom left instead of top left) .TableXMinIntersectionRequiredInPercents and .TableYMinIntersectionRequiredInPercents properties added C++ source code sample added XML Extractor fixes missing empty columns in PreserveFormatting=true mode Minor fixes in colors in some PDF files support for for multiple OCR languages added PDF Multitool GUI: adds Copy to Clipboard button to TXT, CSV, XML and raster renderer dialogs XLSExtractor: adds PageToWorksheet property to enable/disable generation of separate worksheets per page. new .TextEncodingCodePage property PDFViewerControl: adds ValidateContextMenu allowing user to add custom items to context menu PDF Viewer control: adds properties ShowTextObjects, ShowImageObjects, ShowVectorObjects. XMLExtractor now adds "OCRConfidence" attribute for recognized text PDF/A checking functionality (in beta) improving controls and text checking and alignment according to the original layout. The issue was caused by the shift of Y coordinates in controls while parsing: that was incorrect. The correct way is to shif... XML Extractor updated: now produces <CONTROL> tag for checkboxes and text fields changed using of current directory to temp directory. checkboxes,radioboxes, editboxes, comboboxes are better supported now allows partial trust callers. 5.20.1781 (January 27, 2015) PDF to XML, PDF to CSV, PDF to Text functionality improved OCRMode now provides 9 modes .DetectLineInsteadOfParagraph now works much better. Set it to False to capture multiline text in table cells! PDF controls support improved FDF and XFDF data extraction added Table detection improved to support multline text in cells and tables with absent rows beta version of PDF/A validator added minor fixes and improvements 5.10.1747 (November 25, 2014) PDF to XML, PDF to CSV, PDF to Text functions improved now supports text extraction from text controls XML extractor now adds font style, size, name, text coordinates into <text> tags ASP.NET sample for OCR usage added new property OCRLanguageDataFolder to specify the location of "tessdata" folder improved support of PDF files improves support for rotated text updated source code samples updated documentation minor improvements and fixes 5.00.1626 (August 14, 2014) OCR (text from images) functionality added: now you may extract text from embedded images and repair damaged text issue fixed with CSV and XML extractor missing last columns with some settings improved support for damaged PDF files multiline search text search with word matching modes is now supported now may search text with hyphens and on different lines: see new source code sample Find Text With Hyphens new property .RTLTextAutoDetectionEnabled (false by default) to auto detect RTL languages PDF Viewer GUI demo improved minor improvements and fixes 4.00.1487 (May 30, 2014) improved pdf to text, pdf to csv, pdf to xml issue with extraction area fixed Improved Unicode handling new .ContentType to check if PDF is PDF, Portfolio or XFAForm new properties: Unwrap, ExtractionAreaUsageMode new AttachmentInfo class to obtain details about attachment new XFA Form XML extraction support (see XFAFormExtractor and XFAFormToXML samples) new ZuGFeRD PDF support added Multhithreading performance improved Licensing updated: Now Licensing is per developer new "match whole word" parameter to TextExtractor.Find() improved XLS and XLSX output 3.40.1349 (March 10, 2014) improved stability of the text extraction issue with the very last text line missing in some PDF files fixed tables with empty cells are handled better now issue with incorrect extraction of overlapped text objects fixed issue with missing spaces between words in some files fixed issue with incorrect X coordinate returned while searching with extraction area defined minor bug-fixes and improvements 3.30.1240 (November 27, 2013) improved support for old formats PDF files image flipping issue in some PDF files fixed improved text rendering in PDF files minor bug-fixes 3.20.1209 (October 31, 2013) table detection was not returning proper coordinates for 2nd and further tables, fixed minor source code samples updates DocumentSplitter now works with multipage TIF files minor bug-fixes 3.20.1200 (October 28, 2013) minor rotated text issues fixed table detection was not returning proper coordinates, fixed minor bug-fixes 3.20.1179 (October 22, 2013) pdf to text and pdf data extraction improved new .AutoAlignColumnsToHeader (true by default) property to automatically align cells to the header column or not (switching this setting will help if you are getting some shifted cells) new DocumentRotator class to rotate pages in PDF documents new ExtractRawImages property in Images Extractor to define if we are extracting raw images or images with rotation and transformation applied improved support of PDF files with rotated objects and pages new source code sample showing how to extract page found by a keyword "Find Keyword And Extract Page" Images Extractor: SetExtractionArea() method added to define a rectangle area to extract images from improved Splitting Pages example improved pages extraction from PDF new RemoveUnusedResources method to remove unused resources from PDF to reduce file size minor bug-fixes and improvements 3.20.1100 (August 22, 2013) new method: DocumentSplitter.Split(sourcefile, splitPages) to extract mulitple ranges of pages from the same PDF file minor bug-fixes in pdf to text engine 3.20.1093 (August 5, 2013) pdf to text minor functionality fixes x64 installer improvements minor fixes for error messages PDFDocument.Dispose() now not disposing the source stream with PDF if this stream was supplied by the user (so user should dispose it) improved PDF format support minor bug-fixes 3.20.1075 (July 11, 2013) improved PDF To CSV, PDF To XLS, PDF To XML extraction improved PDF reading speed and stability minor bug-fixes 3.10.1051 (June 29, 2013) improved table extraction support improved pdf files support 3.10.1038 (June 26, 2013) improved text extraction support issues fixed related to incorrect extraction area coordinates for some PDF files with scanned images speed improvements improved support for various PDF files 3.10.942 (May 30, 2013) improved pdf text extraction support minor bug-fixes and improvements 3.10.899 (May 14, 2013) improved pdf to text conversion improved PDF reading support more source Visual Basic .NET, C# and VBScript code samples added documentation updated 3.00.864 (April 11, 2013) improved PDF extraction support improved PDF handling pdf splitting and merging: new property to optimize PDF files after splitting DocumentSplitter.OptimizeSplittedDocuments may decrease file size when needed improved PDF fonts handling demo utility updated source code samples updated to run on any .NET framework by default minor bug-fixes 3.00.825 (March 12, 2013) improved pdf to text, pdf to csv demo utility PDF Viewer reworked and updated for better UI experience minor improvements and fixes in PDF support improved PDF stability while working with PDF files with high density vector graphics inside improved support for indexed color pallettes improved embedded fonts rendering better support for Unicode fonts new .Version property to read exact version of the dll minor updates and improvements 2.50.708 (November 11, 2012) PDF data extraction speed improved Windows 8 support improved PDF images and colors support improved PDF to csv, PDF xml, PDF to xls/xslx now skips first leading rows if they are empty pdf text search now works better and provides more intelligent support for regular expressions ActiveX support and installation improved and now provides single batches to run on Windows x86/x64 for Windows XP to 8 Pro new property: .ExtractShadowLikeText to enable/disable extraction of shadowed text (where it is used as effect to create visual shadows) minor bug-fixes and improvements 2.40.650 (November 1, 2012) improved support for Unicode text extraction improved support for PDF/A pdf files issues with white stripes appearing on multiple images combined fixed data extraction internal optimizations improved support for 8 bit images inside PDF vector drawings improved to provide better support for multiple small objects Color representation in images with indexed colors fixed Type2 fonts support improved Improved support for embedded fonts in PDF produced by Ghostscript engine CCIT images compression compression related issues fixed LZW compressed PDF support improved improved support for shading objects improved PDF fonts support improved support for PDF with 4 bit images 2.30.594 (September 18, 2012) PDF data extraction improved memory and speed optimizations fixing issue with empty data while extracting data from some PDF files improved images extraction support (more image encoding variations are supported) minor updates in examples minor bug-fixes 2.30.568 (June 21, 2012) pdf to text conversion quality improved multithreading usage stability has been improved hanging issue on some PDF fixed PDF Extractor SDK: updated sample for StructuredExtractor (previously known as TableExtractor interface) minor fixes and improvements 2.20.0.539 (May 4, 2012) improved stability demo utility improved important security fixes 2.20.525 (April 14, 2012) improved speed (up to x2 faster on some documents) Tables detection improved updated PDF Viewer utility improved support for structured text extraction (CSV and XML data extraction) minor bug-fixes 2.20.458 (February 2, 2012) minor fixes in TableDetector class (.TableDetectionMinNumberOfColumns and .TableDetectionMinNumberOfRows were working incorrectly) improved text extraction for PDF files generated from text files improved support for PDF files produced by Adobe Acrobat PDF Viewer: CSV, XML and Text extractor forms updated to show .PreserveFormattingOnTextExtraction option minor fixes in .NET 4.0 assemblies Renderer SDK adds /Visual Basic/PDF To BMP using streams/ sample improved support for PDF with forms objects improved leading spaces format detection in text extraction .SetExtractionArea() added to define area on a page to work with in PDF Renderer SKD improved fonts information reading support in PDF files new .PageSeparator property in TextExtractor allowing to define a separator string for pages if you need one fixing issue with indexed colorspaces in PDF improved PDF format support 2.20.415 (December 21, 2011) PDF Extractor SDK: minor update for PDF to XLS sample rendering: improved fonts support text extraction with formatting improved new source code sample to show how to save extracted text to a stream performance optimized and pdf processing speed improved improved support for PDF format 2.20.396 (November 30, 2011) fixing issues with CSV, XML and XLS extraction on long tables PDF Viewer now provides ability to turn on/off text formatting support on extraction PDF support improved minor bug-fixes 2.20.392 (November 25, 2011) NEW table detection implemented, see new Bytescout.PDFExtractor.TableDetector interface and source code samples in /Find Table And Extract As CSV/ sub-folder in examples NEW regular expressions support for text search in TextExtractor (see .RegexSearch property) Text search functionality improved minor bug-fixes 2.10.303 (October 4, 2011) NEW: DocumentMerger and DocumentSplitter interfaces and classes to merge and split PDF documents improved support for PDF documents PDF processing speed increased minor bug-fixes 2.10.276 (August 26, 2011) NEW: AttachmentExtractor interface to extract file attachments and embedded files from PDF (see /Examples/Extract Attachments/ for sample source code) NEW: XLSExtractor interface to extract tables from PDF as XLS and XLSX Excel files (including font formatting) improved text extraction functionality improved output image quality improved support of Unicode text improved support of damaged PDF files (not hanging on damaged files anymore) 2.00.228 (12 July 2011) CSVExtractor: SeparationSymbol and QuotationSymbol properties were added TrimValues property for CSVExtractor and XMLExtractor: turned on by default to trim detected cell values automatically Default properties for CSV extraction improved fixed incorrect default space ratio in text extractor to 0.4, previous value 1.2 was causing to join some words into a single one TextExtractor.detectNewColumnBySpacesRatio renamed into .SpaceRatioBetweenWords property PDFViewer now shows options dialog to adjust SpaceRatioBetweenWords if needed minor bug-fixes 2.00.217 (21 June 2011) CSV and XML extraction speed greatly improved CSVExtractor and XMLExtractor classes add new .DetectNewColumnBySpacesRatio property: use this property to control space between detected columns of text XML and CSV Extractor adds .SkipCellsWithEmptyValues property (true by default to skip cells with empty values) PDF Viewer now shows extraction options dialog for XML and CSV export functions PDF To CSV to XLS source code sample added PDF To CSV\Delphi\ source code sample added minor bug-fixes and improvements 2.00.206 (6 June 2011) support for .NET 3.5, .NET 4.00 added Delphi source code sample has been added minor bug-fixes and improvements 2.00.186 (May 16, 2011) pdf processing speed increased up to x10 times minor bug-fixes and improvements 1.10.168 (May 6 2011) support for password protected PDF documents improved (was not working properly in previous release) minor bug-fixes and improvements 1.10.160 (12 April 2011) XML comments are available now to show hints for methods, classes and properties in Visual Studio New property: .ExtractColumnByColumn (false default), set to True to extract text column by column instead of line by line PDF Viewer freeware utility updated to feature "Extract Text (line by line)" and "Extract Text (column by column)" buttons improved support for single paged PDF documents produced by Acrobat Distiller software clipping issues were fixed fixed hanging on some broken PDF documents improved text decoding support minor bug-fixes 1.10.150 (10 March 2011) * PDF files support improved + now handles PDF files from Google Doc without errors * minor bug-fixes 1.10.144 (26 February 2011) + now works with secured documents (provide passsword if needed in .Password property) + minor bug-fixes and improvements + updated GUI demo application 1.10.121 (11 February 2011) + PDF to CSV extractor added + PDF to XML extractor added + support for invisible text extraction added + minor bug-fixes and improvements 1.00.30 (9 November 2010) + new version