Link Search Menu Expand Document

Document Parser: Template Creation Guide

Table of Contents:

What is Document Parser and How It Works?

Document Parser is the versatile document parsing engine that helps to do accurate and easy to maintain data extraction data from PDF invoices, statements, reports, paystubs, tables, reciepts. No programming is reqired to create and maintain data extraction templates! Supports both native and scanned PDF files, PNG, JPG, TIFF images and Doc, Docx, PPT files (only in Web API version) as well as English, German, French, Spanish and many other languages. Available as Web API, Zapier and on-premise API server or as direct integration module.

Template specification version: 3.

Templates can be written in YAML or JSON formats. A template defines one or more keywords to match the right template to the document and expressions for fields and tables to be extracted. A single template file can contain multiple templates. Templates in YAML file should be separated with --- line. Templates in JSON must be arranged as an array [].

Sample YAML template showing the main features:

--- templateVersion: 3 templatePriority: 1 sourceId: ACME Inc. Invoice culture: en-US detectionRules: keywords: - ACME Inc\. - Invoice No - ABN 01 234 567 890 fields: companyName: type: static expression: ACME Inc. invoiceNumber: type: regex expression: 'Invoice No.: ({{LettersOrDigitsOrSymbols}})' pageIndex: 0 invoiceDate: type: regex expression: 'Invoice Date: ({{SmartDate}})' dataType: date dateFormat: MM/dd/yyyy billTo: type: rectangle rectangle: - 32.5 - 64.5 - 200 - 100 pageIndex: 0 total: type: regex expression: 'TOTAL{{Spaces}}({{Number}})' dataType: decimal tables: - name: table1 start: expression: 'Item{{Spaces}}Quantity{{Spaces}}Price{{Spaces}}Total' end: expression: TOTAL row: expression: '{{LineStart}}{{Spaces}}(?<description>{{SentenceWithSingleSpaces}})(?<quantity>{{Digits}}){{Spaces}}(?<unitPrice>{{Number}}){{Spaces}}(?<itemTotal>{{Number}})' columns: - name: description type: string - name: quantity type: integer - name: unitPrice type: decimal - name: itemTotal type: decimal multipage: true

Template Parameters

TemplatePriority

Templates are sorted and tried by templatePriority, then alphabetically. 0 - the highest priority, 999999 - the lowest.

SourceId

Some name that identifies the design of the document. Passed to the result unchanged.

Culture

Template culture that affects the detection of dates and decimal numbers. For example, if en-US culture is set, the parser will expect dates in month-day-year sequence, and decimal numbers with the dot as the decimal symbol and the comma as the digit grouping symbol. For fr-FR culture, the parser will expect dates in day-month-year sequence, and decimal numbers with the comma as the decimal symbol and the space as the digit grouping symbol. You can find the list of culture names at https://msdn.microsoft.com/en-us/library/cc233982.aspx.

Example:

culture: fr-FR

DetectionRules

Few expressions that uniquely identify the document design. The expression can be exact phrase or contain macros and regular expressions (Regex).

Example:

detectionRules: keywords: - ACME Inc\. - 'Invoice No:{{Spaces}}{{6Digits}}' - \[CONFIDENTIAL\]

DocumentStart

If your PDF file contains multiple documents to parse, documentStart expression should indicate the beginning of new document in PDF file.

Example:

documentStart: TAX INVOICE

Fields

Standalone fields to extract. For example, invoice number, invoice date, etc.

Field parameters:

  • type - [optional] Type of the field.

    Valid values:

    • macros - (default) A field that contains macros or regular expression (Regex).
    • rectangle - Rectangular area of the document page to extract text from. The rectangle coordinates are defined in rectangle parameter. If used without the expression parameter, it will simply return the text extracted from the rectangle. If used with the expression parameter, the expression will only search within the text extracted from the rectangle.
    • static - Static text that will be passed to the result without changes.
    • structure - Virtual table structure field. The parser tries to reconstruct a tabular structure of the document page and allows you to specify coordinates of desired field in this structure. Use pageIndex and structureCoordinates parameters to specify the coordinates. Use Template Editor to select structure coordinates visually.
    • direction - Directional field. Allows to find a fixed keyword phrase and take Nth phrase from it as value. Use expression parameter to specify the keyword phrase, and keywordOrdinalNumber, valueOrdinalNumber parameters to specify the criteria.

    Examples of fields of different types:

    fields: # regex field total: type: macros expression: 'TOTAL:{{Spaces}}{{Number}}' dataType: decimal # rectangle field billTo: type: rectangle rectangle: - 32.5 - 64.5 - 200 - 100 expression: 'Bill to:({{Anything}}){{LineEnd}}' pageIndex: 0 #static field companyName: type: static expression: ACME Inc. # structure field structureField1: type: structure pageIndex: 0 structureCoordinates: x: 2 y: 4 # directional field userSsn: type: direction expression: SSN keywordOrdinalNumber: 2 valueDirection: right valueOrdinalNumber: 1
  • expression - Contains macros or a regular expression (Regex) defining the data to be searched and retrieved from the document.

    By default, the entire matched expression will go to the result. If you need to have only a part of matched expression in the result, enclose that part in parenthesis.

    The part of the expression in parenthesis is called "group". The expression can contain multiple groups. To get only the specific group in the result, mark it with ?<value>. See examples below.

    You can also use | symbol to construct logical "OR" expressions. See examples below.

    Field expression can also contain the name of a special function. Currently available special functions:

    • $$funcFindCompany - searches the document for the company name from a predefined list of known companies.
    • $$funcFindCompanyNext - searches for the next company name from a predefined list of known companies.
    • $$funcFindMaxDate - searches the document for the maximum date.
    • $$funcFindMinDate - searches the document for the minimum date.
    • $$funcFindMaxNumber - searches the document for the maximum number.

    Examples of expression parameter:

    fields: billTo: # The entire match will go to the result. expression: 'Bill to: {{SentenceWithSingleSpaces}}' invoiceNo: # Only the part in parenthesis will go to the result. expression: 'Invoice No.: ({{DigitsOrSymbols}})' # Logical operation. The expression will match both $12.00 and €12.00 expression: '({{Dollar}}|{{Euro}}({{Number}}))' # Named group. Only the value of <value> named group will go to the result expression: 'Total:{{Spaces}}({{Dollar}}|{{Euro}})(?<value>{{Number}})' # Special function expression: $$funcFindCompany
  • rectangle - [optional] coordinates of the extraction area for fields of the 'rectangle' type. The coordinates are specified as top, left, width, and height in PDF units Points (1 Point = 1/72").

    Example:

    fields: billTo: type: rectangle rectangle: - 10 - 10 - 200 - 100
  • pageIndex - [optional] Zero-based page index to search the field in. Default is -1 (any page).

  • dataType - [optional] The expected datatype of the parsed value.

    Possible values:

    • string - used by default if the type is not specified; the matched Regex value will be passed to the result unchanged.
    • integer - the parser will try to convert the retrieved text to an integer number according to the template culture.
    • decimal - the parser will try to convert the retrieved text to a decimal number according to the template culture. See Note 1 below.
    • date - the retrieved text will be parsed as a date according to specified dateFormat or the template culture. See Note 2 below.
    • table - the special type used in conjunction with 'rectangle' field type. The data from the rectangle area will be extracted preserving the table structure.
  • dateFormat - [optional] The format string to parse the date. See Note 2 below.

  • outputDateFormat - [optional] Output date format. By default, successfully parsed date will be passed to the result in ISO 8601 format, e.g. 2018-01-04T00:00:00, but you can specify your own output format, e.g. yyyy-MM-dd.

  • rowMergingRule - [optional] defines a rule to merge multiline data in table cells. Used with 'rectangle' field type and 'table' data type. See rowMergingRule description in tables section.

  • coalesceWith - name of another field to coalesce with. If the specified field is not parsed, the current field will replace it. This is useful if you need to create two parsing criteria for some varying data and get them as a single field in the result. If the first field fails, the second will be used.

    Example. If field1 is not successfully parsed, the field1a will be used to replace field1 in the result:

    fields: field1: rect: - 10 - 10 - 100 - 25 field1a: rect: - 10 - 50 - 100 - 25 coalesceWith: field1
  • structureCoordinates - X and Y coordinates in the virtual table structure. You can use Use Template Editor to select structure coordinates visually.

    Example:

    # structure field structureField1: type: structure pageIndex: 0 structureCoordinates: x: 2 y: 4
  • keywordOrdinalNumber - For direction type fields. Ordinal number of the keyword phrase occurrence.

  • valueOrdinalNumber - For direction type fields. Ordinal number of the sentence to return as result. Sentence is a sequence of words separated by a single space.

Note 1: If you come across different number representations in the same document, you can override the template culture by appending new culture to the data type name. This single field will be parsed according to the specified culture.

Example:

type: decimal[fr-FR]

Note 2: The dateFormat and outputDateFormat can contain a format string defining the exact date format. Find the format string description here: https://docs.microsoft.com/en-us/dotnet/api/system.datetime.tryparseexact.

Example:

type: date dateFormat: MM-dd-yyyy

The dateFormat can also contain auto-format strings:

  • auto-MDY - the parser will try to detect the date format automatically, assuming the date is in month-day-year sequence.
  • auto-DMY - the parser will try to detect the date format automatically, assuming the date is in day-month-year sequence.
  • auto-YMD - the parser will try to detect the date format automatically, assuming the date is in year-month-day sequence.
  • auto - the parser will try to detect the format automatically, taking the date parts sequence from the template culture.

Example:

type: date dateFormat: auto-DMY

Tables

This section defines tabular data you need to extract. Tables can be defined by coordinates or by expressions to find the table start, the end, and rows.Tables section can contain multiple table definitions arranged as an array.

Table parameters:

  • name - table name to distinguish different tables in the result.

  • start - group of parameters that define the start of the table:

    • expression - macro expression to find the start of the table, or
    • y - the top coordinate of the table.
    • pageIndex - index of the page containing the y coordinate.
  • end - group of parameters that define the end of the table:

    • expression - macro expression to find the end of the table, or
    • y - the bottom coordinate of the table.
  • subItemStart - [optional] parameters that define the start of the table sub-item. Sub-items are used for tables with complex multiline rows:

    • expression - macro expression to find the start of the sub-item.
  • subItemEnd - [optional] parameters that define the end of the table sub-item:

    • expression - macro expression to find the end of the sub-item.
  • introduction - Parameters to parse values from sub-headers. Values parsed from the introduction expression will be repeated in the beginning of every row.

    • expression - macro expression to parse introduction items.
  • row - [optional] group of parameters that define table rows:

    • expression - the main macro expression to find a row. Named groups in this expression will go to the result table as columns. See example below.
    • subExpression1, subExpression2, subExpression3, subExpression4, subExpression5 - additional expressions to parse some remaining parts of row data which the main expression cannot parse in one pass. Sub-expressions are executed after the main expression for the text chunks between matches of the main expression. Can be used to parse hanging rows (wrapped multiline cells).
  • columns - [optional] array that defines column properties. Names of columns should correspond to the names of the capturing groups of the row expression.

    Column properties:

    • name - column name.
    • x - [optional] X coordinate of the left column edge in PDF Points;
    • type - [optional] Column data type: 'string', 'integer', 'date', or 'decimal' (see also the type descriptions in fields section).
    • dateFormat - [optional] See dateFormat description in fields section.
    • outputDateFormat - [optional] See outputDateFormat description in fields section.
    • coalesceWith - [optional] Name of column to merge the parsed value with.

    Example:

    columns: - name: exam x: 0 type: string - name: examDate x: 100 type: date dateFormat: auto-MDY
  • rowMergingRule - [optional] For the fields of rectangle type and table data type. Defines the rule to merge multiline data in table cells.

    Valid values:

    • none - default, no rule.
    • byBorders - combine lines within a table cell framed by border lines.
    • hangingRows - join table row that contains only a single cell up to the previous row if there is no separating line between them. Useful for tables without borders between rows.

    Example:

    rowMergingRule: byBorders
  • multipage - [optional] defines whether the table may continue on further pages.

    Example:

    multipage: true

Example of table parsing:

DescriptionIntervalQuantityAmount ($)
Basic PlanJan 1 - Jan 31125.00
Basic PlanFeb 1 - Feb 28125.00
Total in USD:50.00

The table above, can be parsed with macro expressions or with explicitly defined column coordinates.

Macros approach:

tables: - name: table1 start: # The table will start after the text "Amount ($)" expression: 'Amount{{Space}}{{OpeningParenthesis}}{{Dollar}}{{ClosingParenthesis}}' end: # The table will end before the text "Total in USD" expression: Total in USD row: # Groups <description>, <interval>, <quantity>, and <amount> will become columns in the result table. expression: '{{LineStart}}{{Spaces}}(?<description>{{SentenceWithSingleSpaces}})(?<interval>{{3Letters}}{{Space}}{{Digits}}{{Space}}{{Minus}}{{Space}}{{3Letters}}{{Space}}{{Digits}}){{Spaces}}(?<quantity>{{Digits}}){{Spaces}}(?<amount>{{Number}})' columns: # Suggest data types for table columns: - name: description type: string - name: interval type: string - name: quantity type: integer - name: amount type: decimal

If the macros approach is impossible for some complicated table, you can specify column coordinates explicitly. To visually determine the coordinates of a column, you can use included Template Editor application: it shows cursor coordinates in the toolbar.

Explicit column coordinates approach:

tables: - name: table1 start: # The table will start below the text "Description Interval" expression: 'Description{{Spaces}}Interval' end: # The table will end above the text "Total in USD" expression: 'Total in USD' columns: # Suggest coordinates and data types for table columns - name: description x: 0 type: string - name: interval x: 100 type: string - name: quantity x: 150 type: integer - name: amount x: 200 type: decimal

Options

Template options.

  • ocrLanguage - The language for Optical Character Recognition (OCR). Five base languages are supported out of the box. Dozens of other languages are supported as well as mixed language documents (with 2 or more languages used on the same page or document), please contact us for help with the setup of multiple languages.

    Valid values:

    • eng - English (default)
    • deu - German
    • fra - French
    • spa - Spanish
    • nld - Dutch

    Example:

    ocrLanguage: nld
  • ocrMode - The mode of the Optical Character Recognition (OCR):

    • auto - OCR will be used only if there are no text on PDF document page but only raster images.
    • forced - Force OCR to extract text from both images and fonts. Useful for PDF documents with mixed content (when portion of document text is drawn as image).
    • repairFonts - Some PDF documents use embedded fonts with customized charset making the text extraction impossible. This mode will render entire document and extract the text using OCR.

    Example:

    ocrMode: forced

Template-level macros

Template-level (TL) macros allow to define reusable blocks that you can use in expression parameters of fields and tables.

TL macro can contain built-in macros and regular expressions (Regex).

TL macro can reuse macros defined above in the template code.

TL macros in expression should be enclosed in double angle brackets: << >>.

Example:

>
templateMacros: # Detects "Yes" or "No" in the text yesOrNo: '(Yes|No)' # Detects "12/01/2019 - 12/31/2019" date range in the text dateRange: '{{DateMM/DD/YY}} {{Minus}} {{DateMM/DD/YY}}' # Detects 24h time "13:00" in the text time: '{{2Digits}}{{Colon}}{{2Digits}}' # Detects "08:00-17:00" time range in the text. Example of reusing the macro defined above. timeRange: '<<time>>{{Minus}}<<time>>' # Example of use of template-level macros defined in templateMacros section fields: answer: expression: 'Answer: (<<yesOrNo>>)' period: expression: '<<dateRange>>' workHours: expression: '<<timeRange>>' workHoursStart: expression: '(?<value><<time>>){{Minus}}<<time>>' workHoursEnd: expression: '<<time>>{{Minus}}(?<value><<time>>)' </div></code></pre>

APPENDIX 1: Macros.

Built-in macros:

MacroDescription
{{SmartDate}}Tries to detect the date in the most common formats.
{{Number}}Decimal number like the following: "12.34", "-123,456.78", "123.456". Decimal separator and thousands separator are automatically taken from the template culture.
{{Money}}Decimal number with currency symbol like the following: "USD 12.34", "$123,456.78", "123.45 €". Decimal separator and thousands separator are automatically taken from the template culture.
{{Space}}Single space.
{{Spaces}}One or more spaces.
{{2Spaces}}Two spaces.
{{3Spaces}}Three spaces.
{{4Spaces}}Four spaces.
{{5Spaces}}Five spaces.
{{6Spaces}}Six spaces.
{{7Spaces}}Seven spaces.
{{8Spaces}}Eight spaces.
{{9Spaces}}Nine spaces.
{{10Spaces}}Ten spaces.
{{Digit}}One digit.
{{Digits}}One or more digits.
{{2Digits}}Two digits.
{{3Digits}}Three digits.
{{4Digits}}Four digits.
{{5Digits}}Five digits.
{{6Digits}}Six digits.
{{7Digits}}Seven digits.
{{8Digits}}Eight digits.
{{9Digits}}Nine digits.
{{10Digits}}Ten digits.
{{DigitOrSymbol}}One digit or symbol ("_-+=/").
{{DigitsOrSymbols}}One or more digits or symbols ("_-+=/").
{{2DigitsOrSymbols}}Two digits or symbols ("_-+=/").
{{3DigitsOrSymbols}}Three digits or symbols ("_-+=/").
{{4DigitsOrSymbols}}Four digits or symbols ("_-+=/").
{{5DigitsOrSymbols}}Five digits or symbols ("_-+=/").
{{6DigitsOrSymbols}}Six digits or symbols ("_-+=/").
{{7DigitsOrSymbols}}Seven digits or symbols ("_-+=/").
{{8DigitsOrSymbols}}Eight digits or symbols ("_-+=/").
{{9DigitsOrSymbols}}Nine digits or symbols ("_-+=/").
{{10DigitsOrSymbols}}Ten digits or symbols ("_-+=/").
{{Letter}}One letter from any language.
{{Letters}}One or more letters from any language.
{{2Letters}}Two letters from any language.
{{3Letters}}Three letters from any language.
{{4Letters}}Four letters from any language.
{{5Letters}}Five letters from any language.
{{6Letters}}Six letters from any language.
{{7Letters}}Seven letters from any language.
{{8Letters}}Eight letters from any language.
{{9Letters}}Nine letters from any language.
{{10Letters}}Ten letters from any language.
{{UppercaseLetter}}One uppercase letter from any language.
{{UppercaseLetters}}One or more uppercase letters from any language.
{{2UppercaseLetter}}Two uppercase letters from any language.
{{3UppercaseLetter}}Three uppercase letters from any language.
{{4UppercaseLetter}}Four uppercase letters from any language.
{{5UppercaseLetter}}Five uppercase letters from any language.
{{6UppercaseLetter}}Six uppercase letters from any language.
{{7UppercaseLetter}}Seven uppercase letters from any language.
{{8UppercaseLetter}}Eight uppercase letters from any language.
{{9UppercaseLetter}}Nine uppercase letters from any language.
{{10UppercaseLetter}}Ten uppercase letters from any language.
{{LetterOrDigit}}One letter or digit.
{{LettersOrDigits}}One or more letters or digits.
{{2LettersOrDigits}}Two letters or digits.
{{3LettersOrDigits}}Three letters or digits.
{{4LettersOrDigits}}Four letters or digits.
{{5LettersOrDigits}}Five letters or digits.
{{6LettersOrDigits}}Six letters or digits.
{{7LettersOrDigits}}Seven letters or digits.
{{8LettersOrDigits}}Eight letters or digits.
{{9LettersOrDigits}}Nine letters or digits.
{{10LettersOrDigits}}Ten letters or digits.
{{UppercaseLetterOrDigit}}One uppercase letter or digit.
{{UppercaseLettersOrDigits}}One or more uppercase letters or digits.
{{2UppercaseLettersOrDigits}}Two uppercase letters or digits.
{{3UppercaseLettersOrDigits}}Three uppercase letters or digits.
{{4UppercaseLettersOrDigits}}Four uppercase letters or digits.
{{5UppercaseLettersOrDigits}}Five uppercase letters or digits.
{{6UppercaseLettersOrDigits}}Six uppercase letters or digits.
{{7UppercaseLettersOrDigits}}Seven uppercase letters or digits.
{{8UppercaseLettersOrDigits}}Eight uppercase letters or digits.
{{9UppercaseLettersOrDigits}}Nine uppercase letters or digits.
{{10UppercaseLettersOrDigits}}Ten uppercase letters or digits.
{{LetterOrDigitOrSymbol}}One letter, or digit, or symbol ("_-+=/").
{{LettersOrDigitsOrSymbols}}One or more letters, or digits, or symbols ("_-+=/").
{{2LettersOrDigitsOrSymbols}}Two letters, or digits, or symbols ("_-+=/").
{{3LettersOrDigitsOrSymbols}}Three letters, or digits, or symbols ("_-+=/").
{{4LettersOrDigitsOrSymbols}}Four letters, or digits, or symbols ("_-+=/").
{{5LettersOrDigitsOrSymbols}}Five letters, or digits, or symbols ("_-+=/").
{{6LettersOrDigitsOrSymbols}}Six letters, or digits, or symbols ("_-+=/").
{{7LettersOrDigitsOrSymbols}}Seven letters, or digits, or symbols ("_-+=/").
{{8LettersOrDigitsOrSymbols}}Eight letters, or digits, or symbols ("_-+=/").
{{9LettersOrDigitsOrSymbols}}Nine letters, or digits, or symbols ("_-+=/").
{{10LettersOrDigitsOrSymbols}}Ten letters, or digits, or symbols ("_-+=/").
{{UppercaseLetterOrDigitOrSymbol}}One uppercase letter, or digit, or symbol ("_-+=/").
{{UppercaseLettersOrDigitsOrSymbols}}One or more uppercase letters, or digits, or symbols ("_-+=/").
{{2UppercaseLettersOrDigitsOrSymbols}}Two uppercase letters, or digits, or symbols ("_-+=/").
{{3UppercaseLettersOrDigitsOrSymbols}}Three uppercase letters, or digits, or symbols ("_-+=/").
{{4UppercaseLettersOrDigitsOrSymbols}}Four uppercase letters, or digits, or symbols ("_-+=/").
{{5UppercaseLettersOrDigitsOrSymbols}}Five uppercase letters, or digits, or symbols ("_-+=/").
{{6UppercaseLettersOrDigitsOrSymbols}}Six uppercase letters, or digits, or symbols ("_-+=/").
{{7UppercaseLettersOrDigitsOrSymbols}}Seven uppercase letters, or digits, or symbols ("_-+=/").
{{8UppercaseLettersOrDigitsOrSymbols}}Eight uppercase letters, or digits, or symbols ("_-+=/").
{{9UppercaseLettersOrDigitsOrSymbols}}Nine uppercase letters, or digits, or symbols ("_-+=/").
{{10UppercaseLettersOrDigitsOrSymbols}}Ten uppercase letters, or digits, or symbols ("_-+=/").
{{Dollar}}Dollar sign ($).
{{Euro}}Euro sign (€).
{{Pound}}Pound sign (£).
{{Yen}}Yen sign (¥).
{{Yuan}}Yuan sign (¥).
{{CurrencySymbol}}Any currency symbol ($, €, £, ¥, etc.)
{{Dot}}Single dot symbol (".").
{{Comma}}Single comma symbol (",").
{{Colon}}Single colon symbol (":").
{{Semicolon}}Single semicolon symbol (";").
{{Minus}}Single minus (dash, hyphen) symbol ("-").
{{Slash}}Slash symbol ("/").
{{Backslash}}Backslash symbol ("\").
{{Percent}}Percent symbol ("%").
{{LineStart}}Start of line (virtual symbol).
{{LineEnd}}End of line (virtual symbol).
{{SentenceWithSingleSpaces}}Single-space-separated sequence of words and symbols. Breaks on double space.
{{SentenceWithDoubleSpaces}}Extended {{SentenceWithSingleSpaces}} macro allowing two spaces between words. Breaks on triple space.
{{EndOfPage}}End of page or end of document.
{{WordBoundary}}Start or end of word (virtual symbol).
{{OpeningCurlyBrace}}Opening curly brace symbol ("{").
{{ClosingCurlyBrace}}Closing curly brace symbol ("}").
{{OpeningParenthesis}}Opening parenthesis symbol ("(").
{{ClosingParenthesis}}Closing parenthesis symbol (")").
{{OpeningSquareBracket}}Opening square bracket symbol ("[").
{{ClosingSquareBracket}}Closing square bracket symbol ("]").
{{OpeningAngleBracket}}Opening angle bracket symbol ("<").
{{ClosingAngleBracket}}Closing angle bracket symbol (">").
{{DateMM/DD/YY}}Date in format "01/01/19" (with leading zero).
{{DateM/D/YY}}Date in format "1/1/19" (without leading zero).
{{DateMM/DD/YYYY}}Date in format "01/01/2019" (with leading zero).
{{DateM/D/YYYY}}Date in format "1/1/2019" (without leading zero).
{{DateMM-DD-YY}}Date in format "01-01-19" (with leading zero).
{{DateM-D-YY}}Date in format "1-1-19" (without leading zero).
{{DateMM-DD-YYYY}}Date in format "01-01-2019" (with leading zero).
{{DateM-D-YYYY}}Date in format "1-1-2019" (without leading zero).
{{DateMM.DD.YY}}Date in format "01.01.19" (with leading zero).
{{DateM.D.YY}}Date in format "1.1.19" (without leading zero).
{{DateMM.DD.YYYY}}Date in format "01.01.2019" (with leading zero).
{{DateM.D.YYYY}}Date in format "01.01.2019" (without leading zero).
{{DateDD/MM/YY}}Date in format "01/01/19" (with leading zero).
{{DateD/M/YY}}Date in format "1/1/19" (without leading zero).
{{DateDD/MM/YYYY}}Date in format "01/01/2019" (with leading zero).
{{DateD/M/YYYY}}Date in format "1/1/2019" (without leading zero).
{{DateDD-MM-YY}}Date in format "01-01-19" (with leading zero).
{{DateD-M-YY}}Date in format "1-1-19" (without leading zero).
{{DateDD-MM-YYYY}}Date in format "01-01-2019" (with leading zero).
{{DateD-M-YYYY}}Date in format "1-1-2019" (without leading zero).
{{DateDD.MM.YY}}Date in format "01.01.19" (with leading zero).
{{DateD.M.YY}}Date in format "1.1.19" (without leading zero).
{{DateDD.MM.YYYY}}Date in format "01.01.2019" (with leading zero).
{{DateD.M.YYYY}}Date in format "1.1.2019" (without leading zero).
{{DateYYYYMMDD}}Date in format "20190101".
{{DateYYYY/MM/DD}}Date in format "2019/01/01" (with leading zero).
{{DateYYYY/M/D}}Date in format "2019/1/1" (without leading zero).
{{DateYYYY-MM-DD}}Date in format "2019-01-01" (with leading zero).
{{DateYYYY-M-D}}Date in format "2019-1-1" (without leading zero).
{{Anything}}Any characters up to the next macro in the expression.
{{AnythingGreedy}}Any characters up to the next macro in the expression or to the end of line. Greedy version.
{{ToggleSingleLineMode}}Enables or disables single-line mode. In single-line mode, {{Anything}} and {{AnythingGreedy}} macros do not stop at the end of the line and proceed to the next line of text.
{{ToggleCaseInsensitiveMode}}Enables or disables case-insensitive mode.

APPENDIX 2: Sample templates.

Sample 1.

Sample document text:

DigitalOcean 101 Avenue of the Americas, 10th Floor New York, NY 10013 Date Issued: February 1, 2016 Period: January 1 - 31, 2016 Invoice Number: 1234567 Description Hours Start End USD Website-Dev (1GB) 744 01-01 00:00 01-31 23:59 $10.00 Website-Live (1GB) 744 01-01 00:00 01-31 23:59 $10.00 Database-Live (2GB) 744 01-01 00:00 01-31 23:59 $20.00 Tasks-Dev (1GB) 744 01-01 00:00 01-31 23:59 $10.00 Total: $50.00 Bill To: Samee Sikka <admin@meee.org> meee.org Gouran If you have a credit card on file it will be automatically charged within 24 hours.

Sample template (YAML):

--- templateVersion: 3 templatePriority: 0 sourceId: DigitalOcean Invoice detectionRules: keywords: # Template will match documents containing the following phrases: - 'DigitalOcean' - '101 Avenue of the Americas' - 'Invoice Number' fields: # Static field that will output "DigitalOcean" to the result companyName: type: static expression: DigitalOcean # Macro field that will find the text "Invoice Number: 1234567" and return "1234567" to the result invoiceId: type: macros expression: 'Invoice Number: ({{Digits}})' # Macro field that will find the text "Date Issued: February 1, 2016" and return the date "February 1, 2016" in ISO format to the result dateIssued: type: macros expression: 'Date Issued: ({{SmartDate}})' dataType: date dateFormat: auto-mdy # Macro field that will find the text "Total: $50.00" and return "50.00" to the result total: type: macros expression: 'Total: {{Dollar}}({{Number}})' dataType: decimal # Static field that will "USD" to the result currency: type: static expression: USD tables: - name: table1 # The table will start after the text "Description Hours" start: expression: 'Description{{Spaces}}Hours' # The table will end before the text "Total:" end: expression: 'Total:' # Macro expression that will find table rows "Website-Dev (1GB) 744 01-01 00:00 01-31 23:59 $10.00", etc. row: # Groups <description>, <hours>, <start>, <end> and <unitPrice> will become columns in the result table. expression: '{{LineStart}}{{Spaces}}(?<description>{{SentenceWithSingleSpaces}}){{Spaces}}(?<hours>{{Digits}}){{Spaces}}(?<start>{{2Digits}}{{Minus}}{{2Digits}}{{Space}}{{2Digits}}{{Colon}}{{2Digits}}){{Spaces}}(?<end>{{2Digits}}{{Minus}}{{2Digits}}{{Space}}{{2Digits}}{{Colon}}{{2Digits}}){{Spaces}}{{Dollar}}(?<unitPrice>{{Number}})' # Suggest data types for table columns (missing columns will have the default "string" type): columns: - name: hours type: integer - name: unitPrice type: decimal

Result (JSON):

{ "templateId": "DigitalOcean.yml", "templateVersion": "3", "sourceId": "DigitalOcean Invoice", "fields": { "companyName": { "value": "DigitalOcean" }, "invoiceId": { "value": "1234567", "pageIndex": 0 }, "dateIssued": { "value": "2016-02-01T00:00:00", "pageIndex": 0 }, "total": { "value": 50.00, "pageIndex": 0 }, "currency": { "value": "USD" } }, "tables": [ { "name": "table1", "rows": [ { "description": { "value": "Website-Dev (1GB)", "pageIndex": 0 }, "hours": { "value": 744, "pageIndex": 0 }, "start": { "value": "01-01 00:00", "pageIndex": 0 }, "end": { "value": "01-31 23:59", "pageIndex": 0 }, "unitPrice": { "value": 10.00, "pageIndex": 0 } }, { "description": { "value": "Website-Live (1GB)", "pageIndex": 0 }, "hours": { "value": 744, "pageIndex": 0 }, "start": { "value": "01-01 00:00", "pageIndex": 0 }, "end": { "value": "01-31 23:59", "pageIndex": 0 }, "unitPrice": { "value": 10.00, "pageIndex": 0 } }, { "description": { "value": "Database-Live (2GB)", "pageIndex": 0 }, "hours": { "value": 744, "pageIndex": 0 }, "start": { "value": "01-01 00:00", "pageIndex": 0 }, "end": { "value": "01-31 23:59", "pageIndex": 0 }, "unitPrice": { "value": 20.00, "pageIndex": 0 } }, { "description": { "value": "Tasks-Dev (1GB)", "pageIndex": 0 }, "hours": { "value": 744, "pageIndex": 0 }, "start": { "value": "01-01 00:00", "pageIndex": 0 }, "end": { "value": "01-31 23:59", "pageIndex": 0 }, "unitPrice": { "value": 10.00, "pageIndex": 0 } } ] } ] }

Copyright © 2016 - 2023 ByteScout