Class THtmlTemplateParser
Unit
Declaration
type THtmlTemplateParser = class(TObject)
Description
This is the pattern matching processor class which can apply a pattern to one or more HTML documents.
You can use it by calling the methods parseTemplate
and parseHTML
. parseTemplate
loads a certain pattern and parseHTML
matches the pattern to an HTML/XML file.
A pattern file is just like an HTML file with special commands (it used to be called template file). The parser than matches every text and tag of the pattern to text/tag in the HTML file, while ignoring every additional data in latter file. If no match is possible an exception is raised.
The pattern can extract certain values from the HTML file into variables, and you can access these variables with the property variables and variableChangeLog. Former only contains the final value of the variables, latter records every assignment during the matching of the pattern.
Getting started
Creating a template to analyze an XML-file/webpage:
(
If you want to read several elements like table rows, you need to surround the matching tags with template:loop, e.g. <template:loop><tr>..</tr></template:loop>
and the things between the loop-tags is repeated as long as possible. You can also use the short notation by adding a star like <tr>..</tr>*
.
Using the templates from Pascal:
First, create a new
THtmlTemplateParser
:parser := THtmlTemplateParser.create()
Load the template with
parser.parseTemplate('..template..')
orparser.parseTemplateFile('template-file')
Process the webpage with
parser.parseHTML('..html..')
orparser.parseHTMLFile('html-file')
Read the result of variable yourVariableName through parser.variables.values['yourVariableName']
If you used loops, only the last value of the variable is available in the variables property, the previous values can be enumerated through variableChangelog.
Template examples
- Example, how to read first <b>-tag:
Html-File:
<b>Hello World!</b>
Template:<b>{.}</b>
This will set the default variable
_result
to"Hello World!"
- Example, how to read the first <b>-tag in a explicit named variable:
Html-File:
<b>Hello World!</b>
Template:<b>{$test}</b>
This will set the variable
test
to"Hello World!"
.
Some alternative forms are<b>{$test := .}</b>
,<b><t:s>test := .</t:s></b>
,<b><template:s>test := text()</template:s></b>
or<b><t:read var="test" source="text()"></b>
.- Example, how to read all <b>-tags:
Html-File:
<b>Hello </b><b>World!</b>
Template:<b>{.}</b>*
This will change the value of the variable
_result
twice, to"Hello "
and"World!"
. Both values are available in the variable changelog.
Some alternative forms are:<t:loop><b>{.}</b></t:loop>
,<template:loop><b>{.}</b></template:loop>
,<template:loop><b>{_result := text()}</b></template:loop>
, ...- Example, how to read the first field of every row of a table:
HTML-File:
<table> <tr> <td> row-cell 1 </td> </tr> <tr> <td> row-cell 2 </td> </tr> ... <tr> <td> row-cell n </td> </tr> </table>
Template:<table> <template:loop> <tr> <td> {$field} </td> </tr> </template:loop> </table>
This will read row after row, and will write each first field to the change log of the variable
field
.- Example, how to read several fields of every row of a table:
HTML-File:
<table> <tr> <td> a </td> <td> b </td> <td> c </td> </tr> ... </tr> </table>
Template:<table> <template:loop> <tr> <td> {$field1} </td> <td> {$field2} </td> <td> {$field3} </td> ... </tr> </template:loop> </table>
This will read
$field1=a, $field2=b, $field3=c
...
If you now want to process multiple pages which have a similar, but slightly different table/data layount, you can create a template for each of them, and the Pascal side of the application is independent of the source pages. Then it is even possible for the user of the application to add new pages.- Example, how to read all elements between two elements:
HTML-File:
<h1>Start</h1> <b>Text 1</b> <b>Text 2</b> <h1>End</h1>
Template:<h1>Start</h1> <b>{.}</b>* <h1>End</h1>
This will read all b elements between the two headers.
- Example, how to read the first list item starting with an unary prime number:
HTML-File:
... <li>1111: this is 4</li><li>1:1 is no prime</li><li>1111111: here is 7</li><li>11111111: 8</li> ...
Template:<li template:condition="filter(text(), '1*:') != filter(text(), 'ˆ1?:|ˆ(11+?)\1+:')">{$prime}</li>
This will return "1111111: here is 7", because 1111111 is the first prime in that list.
See the unit tests in tests/extendedhtmlparser_tests.pas for more examples.
Why not XPath/CSS-Selectors?
You might wonder, why you should use templates, if you already know XPath or CSS Selectors.
The answer is that, although XPath/CSS works fine for single values, it is not powerful enough to read multiple values or data from multiple sources, because:
XPath/CSS expressions are not able to return multiple values.
Each expression can only return a single node set, so if you need to read m different values from n different pages, you need O(m * n) expressions, while you only need O(n) templates.
For example, if you need to read a table listing objects and 2 values for each of them, like in this table:
<table><tr><td>name</td><td>value 1</td><td>value 2</td></tr></table>
you can use this template:
<table><tr><td>{$name}</td><td>{$value1}</td><td>{$value2}</td></tr>*</table>
and get three arrays with the needed values.
With XPath you would need three expressions:
names := ... //table/tr/td[1] ...;
values1 := ... //table/tr/td[2] ...;
values2 := ... //table/tr/td[3] ...;Or CSS:
names := ... table tr td:nth-child(1) ...;
values1 := ... table tr td:nth-child(2) ...;
values2 := ... table tr td:nth-child(3) ...;XPath is not suited to process html.
XPath was made to process xml not html, so there are some important functions missing.
One of the most common actions of web scraping is to select (e.g. div) elements based on their classes. Novices think this can be written as//div[@class = "foobar"]
, but this is wrong, because the class attribute can list multiple classes. And the correct XPath expression//div[contains(concat(" ", @class, " "), " foobar ")]
is very ugly.
Templates know the semantic meaning of attributes, so you can just use<div class="foobar"></div>
.
Normal XPath is also case-sensitive, while html is case-insensitive, so if the expression works at all, depends on the parser changing the case of all tags.You might see this as a reason to use CSS selectors, but:
CSS Selectors are not able to process the data
CSS only selects the elements and cannot change their values.
E.g. if you need to parse numbers from two pages, one of them using the Amercian format 123,456.00 and the other one the European format 123.456,00, you cannot use CSS selectors to parse them both without changing something in the host language.
With templates you can use{.}
and{translate(., ".,", ",.")}
and are done.Templates can be written much faster.
Because you do not need to write them at all and instead just copy them from the input page.
E.g. in the example above to create a template for the webpage<table><tr><td>name</td><td>value 1</td><td>value 2</td></tr></table>
you just need to insert some{}*
and get the complete template<table><tr><td>{$name}</td><td>{$value1}</td><td>{$value2}</td></tr>*</table>
.
To get the XPath-expressions/table/tr/td[1,2,3]
you actually need to look at the structure of the page.Of course the table example is trivial, only on more complex examples you can see how powerful the templates actually are:
Let us assume the data is not nicely packed in a table, but contained in a formatted text, like:
<b>name a</b>: value-a1, value-a2<br>
<b>name b</b>: value-b1, value-b2<br> ...The template is a little bit more complex, since you need to split the values:
<t:loop><b>{$name}</b>: <t:s>value1 := extract(text(), ":(.+),", 1), value2 := extract(text(), ":(.+),(.+)", 2)</t:s><br/></t:loop>
However, if you want to solve this task with XPath 1.0 or CSS, you will discover that it is impossible. CSS can not select the text nodes at all, and XPath 1 cannot split them.
The best you can manage is to select the values with XPath and then split them in the host language, but then you cannot parse multiple different sources by swapping the expressions.
And although XPath 2 or 3 can split the values, it becomes rather ugly:names := //b,
values1 := //b/substring-after(following-sibling::text()[1], ":")
values2 := //b/substring-after(following-sibling::text()[1], ",")Another example is if you just need the data from a part of the page, e.g. between two headers like here.
not needed ... <h1>Header 1</h1> <b>name a</b>: value-a1, value-a2<br> <b>name b</b>: value-b1, value-b2<br> <h1>Header 2</h1> ... not needed
The template change is trivial, you just add both headers to the template:
<h1>Header 1</h1>
<t:loop><b>{$name}</b>: <t:s>value1 := extract(matched-text(), ":(.+),", 1), value2 := extract(matched-text(), ":(.+),(.+)", 2)</t:s><br></t:loop>
<h1>Header 2</h1>How to do it in XPath? (in XPath 2, it is of course still impossible with XPath 1)
Well, it gets just crazy:
names := //h1[. = "Header 1"]/following-sibling::b[following-sibling::h1[1] = "Header 2"],
values1 := //h1[. = "Header 1"]/following-sibling::b[following-sibling::h1[1] = "Header 2"]/substring-after(following-sibling::text()[1], ":")
values2 := //h1[. = "Header 1"]/following-sibling::b[following-sibling::h1[1] = "Header 2"]/substring-after(following-sibling::text()[1], ",")Multiple XPath/CSS expressions are not adaptable to changes
If the page layout changes, you need to rewrite all the expressions. With templates, you just need to apply the local change.
E.g. if you want to get multiple data from the last div on this page:
<div id="foobar"> ... <div class="abc">...</div> <div> <b> .. data 1 .. </b> <i> .. data 2 .. </i> </div> </div>
The template would be
<div id="foobar"> <div class="abc"/> <div> <b>{$data1}</b> <i>{$data2}</i> </div> </div>
If you do it with XPath, you have two expressions:
data1 := ... //div[@id="foobar"]/div[@class = "abc"]/following-sibling::div/b ...
data2 := ... //div[@id="foobar"]/div[@class = "abc"]/following-sibling::div/i ...Now, if the page layout is changed to e.g.
<div id="foobar"> ... <div class="def">...</div> <div> ... </div> </div>
You get a diff
- <div class="abc">...</div>
+ <div class="abc">...</div>Which can basically be applied directly to the template and leads to:
<div id="foobar"> <div class="def"/> <div> <b>{$data1}</b> <i>{$data2}</i> </div> </div>
But using XPath expressions, you need to change multiple expressions and you have to look at each expression to find the correct div class to change:
data1 := ... //div[@id="foobar"]/div[@class = "def"]/following-sibling::div/b ...
.
data2 := ... //div[@id="foobar"]/div[@class = "def"]/following-sibling::div/i ...XPath/CSS cannot handle errors
XPath/CSS do not provide any information in case the query fails.
E.g. if you use
//table[@id="foobar"]/tr
to get all rows of a table, and it returns 0 rows, you do not know, if the table was actually empty, or if the page layout changed and the table does not exist anymore, or if you use a new html parser, which inserts (correctly) a tbody element between the table and tr.But if you use a template
<table id="foobar"><tr>{.}*</tr></table>
and it returns anything, it is guaranteed that the table exists, since it raises an exception in case it is missing.Metapher: XPath/CSS are like string functions, templates are like regular expressions
If you write XPath/CSS expressions you give an explicit list of instructions, i.e. you write
/foo
to get all foo-children, you write[bar]
to filter all elements that have a bar child, you write..
to get the parent, you write[position() <= 10]
to take the first ten elements...This is exactly the same concept, as if you write e.g.
copy(s, pos(s, 'foo'), 10)
to find the 'foo' substring and then take the next 10 characters.But you would never do that nowadays, if you can use a regular expression like
'foo(.{1})'
.Such a regular expression now implicitely selects the characters after foo, just like a template
<foo/>{matched-text()}
selects the text after a foo-element.
That said, it is obviously also possible to use XPath or CSS with the templates:
<html>{//your/xpath/expression}</html>
or <html>{css("your.css#expression")}</html>
In fact there exists no other modern XPath/CSS interpreter for FreePascal.
Template reference
Basically the template file is an HTML file, and the parser tries to match the structure of the template html file to the html file.
A tag of the html file is considered as equal to a tag of the template file, if the tag names are equal, all attributes are the same (regardless of their order) and every child node of the tag in the template is also equal to a child node of the tag in the html file (in the same order and nesting).
Text nodes are considered as equal, if the text in the html file starts with the whitespace trimmed text of the template file. All comparisons are performed case insensitive.
The matching occurs with backtracking, so it will always find the first and longest match.
The following template commands can be used:
<template:s>var:=source</template:s>
Short form oftemplate:read
. The expression insource
is evaluated and assigned to the variables
.
You can also set several variables likea:=1,b:=2,c:=3
(Remark: The := is actually part of the expression syntax, so you can use much more complex expressions.)<template:if test="??"/> .. </template:if>
Everything inside this tag is only used iff the XPath-expression in test equals to true<template:else [test="??"]/> .. </template:else>
Everything inside this tag is only used iff the immediate previous if/else block was not executed.
You can chain several else blocks that have test attributes together after an starting if, to create an ifelse chain, in which only one if or else block is used.
E.g.:<template:if test="$condition">..</template:if><template:else test="$condition2">..</template:else><template:else>..</template:else>
<template:loop [min="?"] [max="?"]> .. </template:loop>
Everything inside this tag is repeated between [min,max] times. (default min=0, max=infinity)
E.g. if you write<template:loop> X </template:loop>
, it has the same effect as XXXXX with the largest possible count of X <= max for a given html file.
If min=0 and there is no possible match for the loop interior the loop is completely ignored.
If there are more possible matches than max, they are ignored.<template:switch [value="??"]> ... </template:switch>
This command can be used to match only one of several possibilities. It has two different forms:Case 1: All direct child elements are template commands:
Then the switch statement will choose the first child command, whose attributetest
evaluates to true.
Additionally, if one of the child elements has an attributesvalue
, the expressions of the switch and the childvalue
attribute are evaluated, and the command is only choosen, if both expressions are equal.
An element that has neither avalue
nor atest
attribute is always choosen (if no element before it is choosen).
If no child can be choosen at the current position in the html file, the complete switch statement will skipped.Case 2: All direct child elements are normal html tags:
This tag is matched to an html tag, iff one of its direct children can be matched to that html tag.
For example<template:switch><a>..</a> <b>..</b></template:switch>
will match either<a>..</a>
or<b>..</b>
, but not both. If there is an <a> and a <b> tag in the html file, only the first one will be matched (if there is no loop around the switch tag). These switch-constructs are mainly used within a loop to collect the values of different tags, or to combine to different templates.
If no child can be matched at the current position in the html file, the matching will be tried again at the next position (different to case 1).
<template:switch prioritized="true"> ... </template:switch>
Another version of a case 2 switch statement that only may contain normal html tags.
The switch-prioritized prefers earlier child element to later child elements, while the normal switch match alls child elements equally. So a normal switch containing <a> and <b>, will match <a> or <b>, whichever appears first in the html file. The switch-prioritized contrastingly would match <a>, if there is any <a>, and <b> only iff there is no <a> in the html file.
Therefore<template:switch-prioritized [value="??"]> <a>..</a> <b>..</b> .. </template:switch-prioritized>
is identical to<a template:optional="true">..<t:s>found:=true()</t:s></a> <b template:optional="true" template:test="not($found)">..<t:s>found:=true()</t:s></b> ...
.
(this used to be called<template:switch-prioritized>
, which is still supported, but will be removed in future versions)<template:match-text [matches=".."] [starts-with=".."] [ends-with=".."] [contains=".."] [eq=".."] [case-sensitive=".."] [list-contains=".."]/>
Matches a text node and is more versatile than just including the text in the template.
matches
matches an arbitrary regular expression against the text node.
starts-with/ends-with/contains/eq
check the text verbatim against the text node, in the obvious way.
list-contains treats the text of the node as a comma separated list and tests if that list contains the attribute value .
case-sensitive enables case-sensitive comparisons.
(older versions used regex/is instead matches/eq, which is now deprecated and will be removed in future versions)<template:element> .. </template:element>
Matches any element.
It is handled like an element without t: prefix, but skips the name test. E.g. if either<a>
or<b>
should be allowed, you can use<t:element t:condition="name() = ('a', 'b')">
rather than listing both.<template:siblings-header [id=".."]> .. </template:siblings-header> <template:siblings [id=".."]>..<template:siblings>
These two commands connect elements in different parts of the matched document, and reorder the elements in the siblings command to the same order matched by the header command. The children of these two commands are associated by their order, i.e. the first child of the header command is associated with the first child of the sibling (replay) command. For example, the columns of a table can be associated with the columns in the table header, such the pattern can match any order of those columns.<table><thead><t:siblings-header><th>myheader1</th><th>myheader2</th>..</t:siblings-header></thead> <tr><t:siblings><td>..</td><td>..</td></t:siblings></tr></table>
. Here you can swap the two columns in the document without affecting the matching (except for the order of the output variables).
The command allows loop counters and optional elements in the header. If an header element can match multiple headers in the document the corresponding element in the replay command is duplicated accordingly.
Sibling elements in the header that cannot be matched by the pattern are ignored. Similarly additional elements are ignored during the replay, but the elements of the document ignored are not necessarily in the same order in both cases, since the sibling command only reorders its children. E.g. if you use the above example to match table columns and apply it to a document with an additional column between the two headers, the sibling command will still match the first two columns of the table (unless the second<td>
cannot match the new column, but in this case you might want to use t:switch ). You can put an additional<th/>*
and<td/>
at the end of the header/replay command to prevent this. The th will match any unexpected column and the td will skip it during replay.
It is currently not backtracked and the header command will swallow all the siblings it can. If it has swallowed too many, the matching will fail, even if it could have succeeded, if the header had skipped some optional elements.<template:meta [text-matching="??"] [case-sensitive="??"] [attribute-text-matching="??"] [attribute-case-sensitive="??"]/>
Specifies meta information to change the template semantic:
text-matching
: specifies how text node in the template are matched against html text nodes. You can set it to the allowed attributes of match-text. (default is "starts-with")
text-case-sensitive
: specifies if text nodes are matched case sensitive.
attribute-matching
: liketext-matching
for the values of attribute nodes (note that is currently affecting all attributes in the template. future versions will only change it for following elements)
attribute-case-sensitive
: liketext-case-sensitive
for the values of attribute nodes (note that is currently affecting all attributes in the template. future versions will only change it for following elements)
<template:meta-attribute [name="??"] [text-matching="??"] [case-sensitive="??"]
Like meta for all attributes with a certain name.<template:read var="??" source="??" [regex="??" [submatch="??"]]/>
This is deprecated.
Thexquery.TXQueryEngine
in source is evaluated and stored in variable of var.
If a regex is given, only the matching part is saved. If submatch is given, only the submatch-th match of the regex is returned. (e.g. b will be the 2nd match of "(a)(b)(c)") (However, you should use the xq-function extract instead of the regex/submatch attributes, because former is more elegant)
These template attributes can be used on any template element:
template:test="xpath condition"
The element (and its children) is ignored if the condition does not evaluate to true (so<template:tag test="{condition}">..</template:tag>
is a short hand for<template:if test="{condition}">
).<template:tag>..</template:tag></template:if>
template:ignore-self-test="xpath condition"
The element (but NOT its children) is ignored if the condition does not evaluate to true.
On HTML/matching tags also these matching modifying attributes can be used:
template:optional="true"
if this is set the file is read successesfully even if the tag doesn't exist.
You should never have an optional element as direct children of a loop, because the loop has lower priority as the optional element, so the parser will skip loop iterations if it can find a later match for the optional element. But it is fine to use optional tags that have an non-optional parent tag within the loop.template:condition="xpath"
if this is given, a tag is only accepted as matching, iff the given xpath-expression returns true (powerful, but slow)
(condition is not the same as test: if test evaluates to false, the template tag is ignored; if condition evaluates to false, the html tag is not found)
The default prefixes for template commands are "template:" and "t:", you can change that with the templateNamespace-property or by defining a new namespace in the template like xmlns:yournamespace="http://www.benibela.de/2011/templateparser"
. (only the xmlns:prefix form is supported, not xmlns without prefix)
Short notation
Commonly used commands can be abbreviated as textual symbols instead of xml tags. To avoid conflicts with text node matching, this short notation is only allowed at the beginning of template text nodes.
The short read tag <t:s>foo:=..</t:s>
to read something in variable foo
can be abbreviated as {foo:=..}
. Similarly {} can be written within attributes to read the attribute, e.g. <a href="{$dest := .}"/>
.
Also the trailing := .
can be omitted, if only one variable assignment occurs, e.g. as {$foo}
is equivalent to foo := .
and $foo := .
.
Optional and repeated elements can be marked with ?, *, +, {min, max}; like <a>?...</a>
or, equivalent, <a>..</a>?
.
An element marked with ? becomes optional, which has the same effect as adding the template:optional="true" attribute.
An element marked with * can be repeated any times, which has the same effect as surrounding it with a template:loop element.
An element marked with + has to be repeated at least once, which has the same effect as surrounding it with a template:loop element with attribute min=1.
An element marked with {min,max} has to be repeated at least min-times and at most max-times (just like in a t:loop) (remember that additional data/elements are always ignored).
An element marked with {count} has to be repeated exactly count-times (just like in a t:loop) (remember that additional data/elements are always ignored).
Breaking changes from previous versions:
As was announced in planned changes, the meaning of {$x} and {6} was changed
As was announced in planned changes, the meaning of <x value="{$x}"/> was changed
Adding the short notation breaks all templates that match text nodes starting with *, +, ? or {
The default template prefix was changed to template: (from htmlparser:). You can add the old prefix to the templateNamespace-property, if you want to continue to use it
All changes mentioned in pseudoxpath.
Also text() doesn't match the next text element anymore, but the next text element of the current node. Use .//text() for the old behaviour
All variable names in the pxp are now case-sensitive in the default mode. You can set variableChangeLog.caseSensitive to change it to the old behaviour (however, variables defined with in the expression by
for/some/every
(but not by:=
) remain case sensitive)There was always some confusion, if the old variable changelog should be deleted or merged with the new one, if you process several html documents. Therefore the old merging option was removed and replaced by the KeepPreviousVariables property.
Planned breaking changes:
text()
will be replaced bymatched-text()
. Thentext()
will always return the same as./text()
.Avoid unmatched parenthesis and pipes within text nodes:
Currently is no short notation to read alternatives with the template:switch command, like<template:switch><a>..</a><b>..</b><c>..</c></template:switch>
.
In future this might be the same as(<a>..</a>|<b>..</b>|<c>..</c>)
.
Hierarchy
- TObject
- THtmlTemplateParser
Overview
Nested Types
TDebugMatchingPrintNode = function (node: TTreeNode): string of object; |
Methods
procedure parseHTMLSimple(html, uri, contenttype: string); |
|
function matchLastTrees: Boolean; |
|
constructor create; |
|
destructor destroy; override; |
|
procedure parseTemplate(template: string; templateName: string = '<unknown>'); |
|
procedure parseTemplateFile(templatefilename: string); |
|
function parseHTML(html: string; htmlFileName: string = ''; contentType: string = ''):boolean; |
|
function parseHTMLFile(htmlfilename: string):boolean; |
|
function replaceVarsOld(s:string;customReplace: TReplaceFunction=nil):string; deprecated; |
|
function replaceEnclosedExpressions(str:string):string; |
|
function debugMatchings(const width: integer; htmlToString: TDebugMatchingPrintNode = nil ): string; |
|
function parseQuery(const expression: string): IXQuery; |
Properties
property variables: TXQVariableChangeLog read GetVariables; |
|
property variableChangeLog: TXQVariableChangeLog read FVariableLog; |
|
property oldVariableChangeLog: TXQVariableChangeLog read FOldVariableLog; |
|
property VariableChangeLogCondensed: TXQVariableChangeLog read GetVariableLogCondensed; |
|
property templateNamespaces: TNamespaceList read GetTemplateNamespace; |
|
property ParsingExceptions: boolean read FParsingExceptions write FParsingExceptions; |
|
property OutputEncoding: TSystemCodePage read FOutputEncoding write FOutputEncoding; |
|
property KeepPreviousVariables: TKeepPreviousVariables read FKeepOldVariables write FKeepOldVariables; |
|
property trimTextNodes: TTrimTextNodes read FTrimTextNodes write FTrimTextNodes; |
|
property UnnamedVariableName: string read FUnnamedVariableName write FUnnamedVariableName; |
|
property AllowVeryShortNotation: boolean read FVeryShortNotation write FVeryShortNotation; |
|
property SingleQueryModule: boolean read FSingleQueryModule write FSingleQueryModule; |
|
property hasRealVariableDefinitions: boolean read GetTemplateHasRealVariableDefinitions; |
|
property TemplateTree: TTreeNode read getTemplateTree; |
|
property HTMLTree: TTreeNode read getHTMLTree; |
|
property TemplateParser: TTreeParser read FTemplate; |
|
property HTMLParser: TTreeParser read FHTML; |
|
property QueryEngine: TXQueryEngine read FQueryEngine; |
|
property QueryContext: TXQEvaluationContext read FQueryContext write FQueryContext; |
Description
Nested Types
TDebugMatchingPrintNode = function (node: TTreeNode): string of object; |
|
Methods
procedure parseHTMLSimple(html, uri, contenttype: string); |
|
Parses an HTML file without performing matching. For internal use, |
function matchLastTrees: Boolean; |
|
constructor create; |
|
destructor destroy; override; |
|
procedure parseTemplate(template: string; templateName: string = '<unknown>'); |
|
loads the given template, stores templateName for debugging issues |
procedure parseTemplateFile(templatefilename: string); |
|
loads a template from a file |
function parseHTMLFile(htmlfilename: string):boolean; |
|
parses the given file by applying a previously loaded template. |
function replaceVarsOld(s:string;customReplace: TReplaceFunction=nil):string; deprecated; |
|
Warning: this symbol is deprecated. This replaces every $variable; in s with variables.values['variable'] or the value returned by customReplace (should not be used anymore) |
function replaceEnclosedExpressions(str:string):string; |
|
This treats str as extended string and evaluates the pxquery expression x"str" |
function debugMatchings(const width: integer; htmlToString: TDebugMatchingPrintNode = nil ): string; |
|
Properties
property variables: TXQVariableChangeLog read GetVariables; |
|
List of all variables (variableChangeLog is usually faster) |
property oldVariableChangeLog: TXQVariableChangeLog read FOldVariableLog; |
|
All assignments to a variable during the matching of previous templates. (see TKeepPreviousVariables) |
property VariableChangeLogCondensed: TXQVariableChangeLog read GetVariableLogCondensed; |
|
VariableChangeLog with duplicated objects removed (i.e. if you have obj := object(), obj.a := 1, obj.b := 2, obj := object(); the normal change log will contain 4 objects (like {}, {a:1}, {a:1,b:2}, {}), but the condensed log only two {a:1,b:2}, {}) |
property templateNamespaces: TNamespaceList read GetTemplateNamespace; |
|
Global namespaces to set the commands that will be recognized as template commands. Default prefixes are template: and t: |
property ParsingExceptions: boolean read FParsingExceptions write FParsingExceptions; |
|
If this is true (default) it will raise an exception if the matching fails. |
property OutputEncoding: TSystemCodePage read FOutputEncoding write FOutputEncoding; |
|
Output encoding, i.e. the encoding of the read variables. Html document and template are automatically converted to it |
property KeepPreviousVariables: TKeepPreviousVariables read FKeepOldVariables write FKeepOldVariables; |
|
Controls if old variables are deleted when processing a new document (see TKeepPreviousVariables) |
property trimTextNodes: TTrimTextNodes read FTrimTextNodes write FTrimTextNodes; |
|
How to trim text nodes (default ttnAfterReading). There is also pseudoxpath.XQGlobalTrimNodes which controls, how the values are returned. |
property AllowVeryShortNotation: boolean read FVeryShortNotation write FVeryShortNotation; |
|
Enables the the very short notation (e.g. {a:=text()}, <a>*) (default: true) |
property TemplateTree: TTreeNode read getTemplateTree; |
|
A tree representation of the current template |
property HTMLTree: TTreeNode read getHTMLTree; |
|
A tree representation of the processed html file |
property TemplateParser: TTreeParser read FTemplate; |
|
X/HTML parser used to read the templates (public so you can change the parsing behaviour, if you really need it) |
property HTMLParser: TTreeParser read FHTML; |
|
X/HTML parser used to read the pages (public so you can change the parsing behaviour, if you really need it) |
property QueryEngine: TXQueryEngine read FQueryEngine; |
|
XQuery engine used for evaluating query expressions contained in the template |
property QueryContext: TXQEvaluationContext read FQueryContext write FQueryContext; |
|
Context used to evaluate XQuery expressions. For internal use. |
Generated by PasDoc 0.16.0.