Return to previous page Explore PDF To HTML SDK

HTMLExtractor Class

Free Trial Web API version Licensing Request A Quote

HAVE QUESTIONS OR NEED HELP?SUBMIT THE SUPPORT REQUEST FORM or write email toSUPPORT@BYTESCOUT.COM

Extracts text and images from PDF document and creates formated HTML page from extracted data.

Inheritance Hierarchy

SystemObject
Bytescout.PDF2HTMLBaseExtractor
Bytescout.PDF2HTMLHTMLExtractor

Namespace:Bytescout.PDF2HTML
Assembly: Bytescout.PDF2HTML (in Bytescout.PDF2HTML.dll) Version: 13.3.1.4759-master

Syntax

C++

Copy

public class HTMLExtractor : BaseExtractor, 
	IHTMLExtractor

Public Class HTMLExtractor
	Inherits BaseExtractor
	Implements IHTMLExtractor

public ref class HTMLExtractor : public BaseExtractor, 
	IHTMLExtractor

type HTMLExtractor =  
    class
        inherit BaseExtractor
        interface IHTMLExtractor
    end

The HTMLExtractor type exposes the following members.

Constructors

	Name	Description
	HTMLExtractor	Initializes a new instance of the HTMLExtractor class.
	HTMLExtractor(String, String)	Initializes a new instance of the HTMLExtractor class.

Top

Properties

	Name	Description
	AddFontStyleHTMLTagsToText	Controls if HTML output adds font style information to text objects. Default is true. Set to false to output text objects as plain text objects without font size and style defined.
	AdditionalCssStyles	Sets additional CSS styles. Works only with HTMLWithCSS (see ExtractionMode). Example: "#canvas { zoom: 50%; }" - scale the div that contains all generated HTML pages by 50%.
	CheckPermissions	(Inherited from BaseExtractor.)
	ColumnDetectionMode	Column detection mode.
	ControlsAsText	Sets whether to render form controls to plain text objects. Default is false. Set to true to display controls as text.
	DetectHyperLinks	Sets whether to detect URLs and present them as clickable links. Default is true.
	DetectLinesInsteadOfParagraphs	Obsolete. Gets or sets a value indicating whether to detect single lines or multiple lines of text.
	DetectNewColumnBySpacesRatio	Table columns detection option: defines space between columns to detect text as a new column.
	DetectStrikeoutTextStyle	Get or sets whether to detect the "strikeout" text style. Default is false.
	DetectUnderlineTextStyle	Get or sets whether to detect the "underline" text style. Default is false.
	ExtractAnnotations	Gets or sets a value indicating whether to extract text from annotation objects. Default is true.
	ExtractColumnByColumn	Gets or sets a value indicating whether to extract text column by column or use the visual layout of the text while extracting. False by default. If you are processing PDF newspapers with text columns, set this property to True so you get column by column instead of line by line.
	ExtractInvisibleText	Gets or sets a value indicating whether to extract invisible text from PDF document.
	ExtractionArea	(Inherited from BaseExtractor.)
	ExtractionAreaRect	(Inherited from BaseExtractor.)
	ExtractionMode	Extraction mode: plain HTML or formatted HTML with CSS.
	ExtractShadowLikeText	Gets or sets a value indicating whether to include characters used to create "shadow" effect (when the same character appears with some offset) from PDF document. True by default (includes all encoded characters disregarding their real appearance).
	FontSubstitutionMap	Map to substitute fonts. You can add new mappings to match a font to another font in output HTML code.
	HighPrecisionTextPositioning	Sets whether to use the high precision text positioning. Every symbol will be positioned individually providing better look but reduce HTML parsing convenience.
	IsDocumentLoaded	(Inherited from BaseExtractor.)
	KeepOriginalFontNames	By default HTMLExtractor replaces names of embedded fonts with standard (or "descendant") fonts similar by metrics and typeface. This is because embedded fonts differ from fonts installed into your system or absent there at all. Set this property to true if you want to keep the original font names.
	LicenseInfo	(Inherited from BaseExtractor.)
	LineGroupingMode	Sets how lines are grouped into paragraphs. Default: no lines grouping is performed.
	OptimizeImages	Gets or sets optimization of images. Default is true.
	OutputImageFormat	Defines format for output images. Default is PNG (with transparency). If you do NOT need the transparency support and want to have smaller image sizes (so the page will load faster) then set this property to OutputImageFormat.JPEG.
	OutputPageWidth	Set or get width (in pixels) of the output pages rendered into HTML. Default output width is 1024 (height is calculated and used according to the original pdf pages ratio)
	PageDataCaching	(Inherited from BaseExtractor.)
	Password	(Inherited from BaseExtractor.)
	PreserveFormattingOnTextExtraction	Gets or sets a value indicating whether to preserve the text formatting on the extraction.
	Profiles	Comma-separated list of profiles to apply to the extractor. Profiles must be previously loaded. (Inherited from BaseExtractor.)
	RegistrationKey	(Inherited from BaseExtractor.)
	RegistrationName	(Inherited from BaseExtractor.)
	RemoveHyphenation	Gets or sets a value indicating whether to automatically remove hyphenations in end of lines (works when Unwrap is true).
	SaveImages	Get or sets the image handling (skip, embed, or save to outer file).
	TrimSpaces	Gets or sets a value indicating whether to remove trailing and ending spaces from table cell values.
	Unwrap	Gets or sets a value indicating whether to unwrap lines into single lines or not (especially could be useful in the column layout mode - see ExtractColumnByColumn property). Default is False.
	Version	(Inherited from BaseExtractor.)

Top

Methods

	Name	Description
	CreateProfile(String, Boolean, Boolean, Boolean)	Creates JSON profile will all extractor properties with current values. (Inherited from BaseExtractor.)
	CreateProfile(String, String, Boolean, Boolean, Boolean)	Creates JSON profile will all extractor properties with current values. (Inherited from BaseExtractor.)
	Dispose	Releases the unmanaged resources used by the instance and optionally releases the managed resources. (Inherited from BaseExtractor.)
	DisposePage	Disposes the page object. Uses this method carefully to destroy the page object that should not be used further. Useful to free allocated memory when processing large PDF documents.
	Equals	(Inherited from Object.)
	Finalize	(Inherited from Object.)
	FireParsingError	(Inherited from BaseExtractor.)
	GetHashCode	(Inherited from Object.)
	GetHTML	Extracts HTML from the entire document.
	GetHTML(IListInt32)	Extracts HTML from specified pages.
	GetHTML(String)	Extracts HTML from specified page ranges.
	GetHTML(Int32, Int32)	Extracts HTML from specified page range.
	GetHTMLPage	Extracts HTML from specified document page.
	GetOutputHTMLPageHeight	Get height of the output page rendered in HTML format.
	GetPageCount	(Inherited from BaseExtractor.)
	GetPageHeight	Height of the PDF page (in pdf units).
	GetPageRect_Height	(Inherited from BaseExtractor.)
	GetPageRect_Left	(Inherited from BaseExtractor.)
	GetPageRect_Top	(Inherited from BaseExtractor.)
	GetPageRect_Width	(Inherited from BaseExtractor.)
	GetPageRectangle(Int32)	(Inherited from BaseExtractor.)
	GetPageRectangle(Int32, Boolean)	(Inherited from BaseExtractor.)
	GetPageWidth	Width of the PDF page (in pdf units).
	GetType	(Inherited from Object.)
	LoadAndApplyProfiles	Loads profiles from JSON string and automatically applies them. Note that profiles containing detection keywords will be deferred until the extraction. (Inherited from BaseExtractor.)
	LoadDocumentFromFile	(Inherited from BaseExtractor.)
	LoadDocumentFromStream	(Inherited from BaseExtractor.)
	LoadProfiles	Loads profiles from JSON file. (Inherited from BaseExtractor.)
	LoadProfilesFromString	Loads profiles from JSON string. (Inherited from BaseExtractor.)
	MemberwiseClone	(Inherited from Object.)
	Reset	Resets the instance, disposes internal resources and releases the file. Use this method before loading another PDF file. (Overrides BaseExtractorReset.)
	ResetExtractionArea	(Inherited from BaseExtractor.)
	SaveHtmlPageToFile	Extracts HTML from specified page to stream.
	SaveHtmlPageToStream	Extracts HTML from specified page to stream.
	SaveHtmlToFile(String)	Extracts HTML from the entire document to file.
	SaveHtmlToFile(IListInt32, String)	Extracts HTML from specified pages to file.
	SaveHtmlToFile(String, String)	Extracts HTML from specified page ranges to file.
	SaveHtmlToFile(Int32, Int32, String)	Extracts HTML from specified page range to file.
	SaveHtmlToStream(Stream)	Extracts HTML from the entire document to stream.
	SaveHtmlToStream(IListInt32, Stream)	Extracts HTML from specified pages to stream.
	SaveHtmlToStream(String, Stream)	Extracts HTML from specified page ranges to stream.
	SaveHtmlToStream(Int32, Int32, Stream)	Extracts HTML from specified page range to stream.
	SetExtractionArea(RectangleF)	(Inherited from BaseExtractor.)
	SetExtractionArea(Double, Double, Double, Double)	(Inherited from BaseExtractor.)
	SetExtractionArea(Single, Single, Single, Single)	(Inherited from BaseExtractor.)
	ToString	(Inherited from Object.)

Top

Events

	Name	Description
	ParsingError	(Inherited from BaseExtractor.)
	PasswordRequired	(Inherited from BaseExtractor.)

Top

Fields

	Name	Description
	ExtractionAreaInternal	(Inherited from BaseExtractor.)

Top

Reference

Bytescout.PDF2HTML Namespace