Link Search Menu Expand Document

HTMLExtractor Class

Extracts text and images from PDF document and creates formated HTML page from extracted data.
Inheritance Hierarchy
SystemObject
Bytescout.PDF2HTMLBaseExtractor
Bytescout.PDF2HTMLHTMLExtractor

Namespace:Bytescout.PDF2HTML
Assembly: Bytescout.PDF2HTML (in Bytescout.PDF2HTML.dll) Version: 13.3.1.4759-master
Syntax
public class HTMLExtractor : BaseExtractor, 
	IHTMLExtractor

The HTMLExtractor type exposes the following members.

Constructors
NameDescription
Public methodHTMLExtractor
Initializes a new instance of the HTMLExtractor class.
Public methodHTMLExtractor(String, String)
Initializes a new instance of the HTMLExtractor class.
Top
Properties
NameDescription
Public propertyAddFontStyleHTMLTagsToText
Controls if HTML output adds font style information to text objects. Default is true. Set to false to output text objects as plain text objects without font size and style defined.
Public propertyAdditionalCssStyles
Sets additional CSS styles. Works only with HTMLWithCSS (see ExtractionMode). Example: "#canvas { zoom: 50%; }" - scale the div that contains all generated HTML pages by 50%.
Public propertyCheckPermissions (Inherited from BaseExtractor.)
Public propertyColumnDetectionMode
Column detection mode.
Public propertyControlsAsText
Sets whether to render form controls to plain text objects. Default is false. Set to true to display controls as text.
Public propertyDetectHyperLinks
Sets whether to detect URLs and present them as clickable links. Default is true.
Public propertyDetectLinesInsteadOfParagraphsObsolete.
Gets or sets a value indicating whether to detect single lines or multiple lines of text.
Public propertyDetectNewColumnBySpacesRatio
Table columns detection option: defines space between columns to detect text as a new column.
Public propertyDetectStrikeoutTextStyle
Get or sets whether to detect the "strikeout" text style. Default is false.
Public propertyDetectUnderlineTextStyle
Get or sets whether to detect the "underline" text style. Default is false.
Public propertyExtractAnnotations
Gets or sets a value indicating whether to extract text from annotation objects. Default is true.
Public propertyExtractColumnByColumn
Gets or sets a value indicating whether to extract text column by column or use the visual layout of the text while extracting. False by default. If you are processing PDF newspapers with text columns, set this property to True so you get column by column instead of line by line.
Public propertyExtractInvisibleText
Gets or sets a value indicating whether to extract invisible text from PDF document.
Public propertyExtractionArea (Inherited from BaseExtractor.)
Public propertyExtractionAreaRect (Inherited from BaseExtractor.)
Public propertyExtractionMode
Extraction mode: plain HTML or formatted HTML with CSS.
Public propertyExtractShadowLikeText
Gets or sets a value indicating whether to include characters used to create "shadow" effect (when the same character appears with some offset) from PDF document. True by default (includes all encoded characters disregarding their real appearance).
Public propertyFontSubstitutionMap
Map to substitute fonts. You can add new mappings to match a font to another font in output HTML code.
Public propertyHighPrecisionTextPositioning
Sets whether to use the high precision text positioning. Every symbol will be positioned individually providing better look but reduce HTML parsing convenience.
Public propertyIsDocumentLoaded (Inherited from BaseExtractor.)
Public propertyKeepOriginalFontNames
By default HTMLExtractor replaces names of embedded fonts with standard (or "descendant") fonts similar by metrics and typeface. This is because embedded fonts differ from fonts installed into your system or absent there at all. Set this property to true if you want to keep the original font names.
Public propertyLicenseInfo (Inherited from BaseExtractor.)
Public propertyLineGroupingMode
Sets how lines are grouped into paragraphs. Default: no lines grouping is performed.
Public propertyOptimizeImages
Gets or sets optimization of images. Default is true.
Public propertyOutputImageFormat
Defines format for output images. Default is PNG (with transparency). If you do NOT need the transparency support and want to have smaller image sizes (so the page will load faster) then set this property to OutputImageFormat.JPEG.
Public propertyOutputPageWidth
Set or get width (in pixels) of the output pages rendered into HTML. Default output width is 1024 (height is calculated and used according to the original pdf pages ratio)
Public propertyPageDataCaching (Inherited from BaseExtractor.)
Public propertyPassword (Inherited from BaseExtractor.)
Public propertyPreserveFormattingOnTextExtraction
Gets or sets a value indicating whether to preserve the text formatting on the extraction.
Public propertyProfiles
Comma-separated list of profiles to apply to the extractor. Profiles must be previously loaded.
(Inherited from BaseExtractor.)
Public propertyRegistrationKey (Inherited from BaseExtractor.)
Public propertyRegistrationName (Inherited from BaseExtractor.)
Public propertyRemoveHyphenation
Gets or sets a value indicating whether to automatically remove hyphenations in end of lines (works when Unwrap is true).
Public propertySaveImages
Get or sets the image handling (skip, embed, or save to outer file).
Public propertyTrimSpaces
Gets or sets a value indicating whether to remove trailing and ending spaces from table cell values.
Public propertyUnwrap
Gets or sets a value indicating whether to unwrap lines into single lines or not (especially could be useful in the column layout mode - see ExtractColumnByColumn property). Default is False.
Public propertyVersion (Inherited from BaseExtractor.)
Top
Methods
NameDescription
Public methodCreateProfile(String, Boolean, Boolean, Boolean)
Creates JSON profile will all extractor properties with current values.
(Inherited from BaseExtractor.)
Public methodCreateProfile(String, String, Boolean, Boolean, Boolean)
Creates JSON profile will all extractor properties with current values.
(Inherited from BaseExtractor.)
Public methodDispose
Releases the unmanaged resources used by the instance and optionally releases the managed resources.
(Inherited from BaseExtractor.)
Public methodDisposePage
Disposes the page object. Uses this method carefully to destroy the page object that should not be used further. Useful to free allocated memory when processing large PDF documents.
Public methodEquals (Inherited from Object.)
Protected methodFinalize (Inherited from Object.)
Protected methodFireParsingError (Inherited from BaseExtractor.)
Public methodGetHashCode (Inherited from Object.)
Public methodGetHTML
Extracts HTML from the entire document.
Public methodGetHTML(IListInt32)
Extracts HTML from specified pages.
Public methodGetHTML(String)
Extracts HTML from specified page ranges.
Public methodGetHTML(Int32, Int32)
Extracts HTML from specified page range.
Public methodGetHTMLPage
Extracts HTML from specified document page.
Public methodGetOutputHTMLPageHeight
Get height of the output page rendered in HTML format.
Public methodGetPageCount (Inherited from BaseExtractor.)
Public methodGetPageHeight
Height of the PDF page (in pdf units).
Public methodGetPageRect_Height (Inherited from BaseExtractor.)
Public methodGetPageRect_Left (Inherited from BaseExtractor.)
Public methodGetPageRect_Top (Inherited from BaseExtractor.)
Public methodGetPageRect_Width (Inherited from BaseExtractor.)
Public methodGetPageRectangle(Int32) (Inherited from BaseExtractor.)
Public methodGetPageRectangle(Int32, Boolean) (Inherited from BaseExtractor.)
Public methodGetPageWidth
Width of the PDF page (in pdf units).
Public methodGetType (Inherited from Object.)
Public methodLoadAndApplyProfiles
Loads profiles from JSON string and automatically applies them. Note that profiles containing detection keywords will be deferred until the extraction.
(Inherited from BaseExtractor.)
Public methodLoadDocumentFromFile (Inherited from BaseExtractor.)
Public methodLoadDocumentFromStream (Inherited from BaseExtractor.)
Public methodLoadProfiles
Loads profiles from JSON file.
(Inherited from BaseExtractor.)
Public methodLoadProfilesFromString
Loads profiles from JSON string.
(Inherited from BaseExtractor.)
Protected methodMemberwiseClone (Inherited from Object.)
Public methodReset
Resets the instance, disposes internal resources and releases the file. Use this method before loading another PDF file.
(Overrides BaseExtractorReset.)
Public methodResetExtractionArea (Inherited from BaseExtractor.)
Public methodSaveHtmlPageToFile
Extracts HTML from specified page to stream.
Public methodSaveHtmlPageToStream
Extracts HTML from specified page to stream.
Public methodSaveHtmlToFile(String)
Extracts HTML from the entire document to file.
Public methodSaveHtmlToFile(IListInt32, String)
Extracts HTML from specified pages to file.
Public methodSaveHtmlToFile(String, String)
Extracts HTML from specified page ranges to file.
Public methodSaveHtmlToFile(Int32, Int32, String)
Extracts HTML from specified page range to file.
Public methodSaveHtmlToStream(Stream)
Extracts HTML from the entire document to stream.
Public methodSaveHtmlToStream(IListInt32, Stream)
Extracts HTML from specified pages to stream.
Public methodSaveHtmlToStream(String, Stream)
Extracts HTML from specified page ranges to stream.
Public methodSaveHtmlToStream(Int32, Int32, Stream)
Extracts HTML from specified page range to stream.
Public methodSetExtractionArea(RectangleF) (Inherited from BaseExtractor.)
Public methodSetExtractionArea(Double, Double, Double, Double) (Inherited from BaseExtractor.)
Public methodSetExtractionArea(Single, Single, Single, Single) (Inherited from BaseExtractor.)
Public methodToString (Inherited from Object.)
Top
Events
NameDescription
Public eventParsingError (Inherited from BaseExtractor.)
Public eventPasswordRequired (Inherited from BaseExtractor.)
Top
Fields
NameDescription
Protected fieldExtractionAreaInternal (Inherited from BaseExtractor.)
Top
See Also

Reference