Internet-Tools Overview

Installation

Since the Internet Tools and their XQuery-Engine consist of a set of plain Pascal units they do not require installation. You just need to extract all files from the .zip package and ensure that FPC can find the .pas files.

You can do one of the following to set the paths:

On Linux you also need to install Synapse and OpenSSL devel. For Android you need the standard Android SDK/NDK and initialize the JVM reference in bbjniutils. In the default settings it uses FLRE as regular expression library, you can set a define USE_SOROKINS_REGEX to use Sorokin/FPC's regexpr unit.

The Internet Tools is modularly structured and you can activate different features, depending on which units are included in the unit clause of your project. You need one of synapseinternetaccess, w32internetaccess or androidinternetaccess, so it knows, which underlying library to use for internet access. You do not need to use anything of these units, just including them in the unit clause already activates them.

Getting Started

The most important unit is xquery, which contains the XPath/XQuery interpreter from which you can call most of the other units.

For example, this will write the title of a webpage:

uses xquery;
writeln(query('doc("http://example.org")//title').toString);

This will write the destination of all the links there:

uses xquery;
var v: IXQValue;
begin
  for v in query('doc("http://example.org")//a/@href') do
    writeln(v.toString);      
end.

And this will write the destination and name of all links :

uses xquery;
var v: IXQValue;
begin
  for v in query('doc("http://example.org")//a/concat(., ": ", @href)') do
    writeln(v.toString);      
end.

The examples use the XQuery function doc to download the HTML page, and hard codes the URL in the call of query. Often you need to query a webpage that is only know at runtime. For that you can call query with additional variables in an open array, which will be put in $_1, $_2, ...:

uses xquery;
var v: IXQValue;
begin
  for v in query('doc($_1)//a/concat(., ": ", @href)', ['http://example.org']) do
    writeln(v.toString);      
end.

Or, alternatively, do not use query at all, but store the url in an ixqvalue, then download and transform that value:

uses xquery;
var v: IXQValue;
begin
  for v in xqvalue('http://example.org').retrieve().map('//a/concat(., ": ", @href)') do
    writeln(v.toString);      
end.

There is also the unit simpleinternet, which provides a function simpleinternet.process to directly run a query on a webpage, i.e. process('http://example.org', '//a/concat(., ": ", @href)'). Although this is simpler for trivial queries, it does not generalize well and thus is now deprecated.

The xquery.IXQValue transformation methods generalize extremely well to multiple pages. For example you can start on a search page, fill in a search term on the form field "q", download all links and get the title of every page referenced by a link:

uses xquery;
var v: IXQValue;
begin
  for v in xqvalue('http://www.google.com').retrieve()
           .map('form(//form, {"q": $_1})', ['search term']).retrieve()
           .map('//a').retrieve()
           .map('//title') do
    writeln(v.toString);      
end.

The two most important units are internetaccess and xquery. Internetaccess provides the functions to download things and xquery the functions to understand them.

Accessing the internet/downloading things

The primary unit for accessing the internet is the unit internetaccess. It contains the functions httpRequest for GET/POST HTTP-requests as well as the abstract base class TInternetAccess that implements them. (other internet services like ftp/smtp might be added in future versions).

httpRequest can be used as a simple function call and returns the webpage as string:

uses internetaccess;
begin
  writeln(httpRequest('http://www.example.org')); 
end.

The 1-parameter function sends a GET request, the 2-parameter function a POST-request. Multiple calls to the functions keep a session active (i.e. cookies, referrers), and are thread-safe.

If TInternetAccess is used, every object has its own session. However, you cannot use TInternetAccess itself, and you need to use one of its implementing classes TSynapseInternetAccess, TW32InternetAccess or TAndroidInternetAccess. In the unit internetaccess there is a global variable defaultInternetAccessClass which should be set to one of the implementing classes. You can then create the object from this class variable, so you can easily switch between the implementations and choose the most appropriate (wininet integrates better with windows systems, because it does not require openssl, but synapse it platform independent).

Data processing

There are several ways to actually process a downloaded HTML/XML-document.
It is implemented as a hierarchy of data processing classes, from high level languages to a low level tokenizer, where each levels uses the things from the lower level. Only the high level interface is stable. The lower level classes are used internally to implement the higher level ones.

Transcendent-Level: Multi-page Template

On the utmost highest level there is the multi-page template, but as standalone programming language it is floating so far above usual libraries, that you might want to skip to the next section.

The template is an XML file containing a set of actions. Each action is like a procedure that can contain control structures like if and loop as XML-elements and also call other actions. Thus the XML-elements of the template are Turing-complete (although actual calculations need to be done with XPath/XQuery expressions included in the template). The actions also contain lists of webpages and HTML-patterns. All the webpages are downloaded and then pattern matched against the HTML-patterns, leading to a list of matched HTML-elements or data read from those elements stored in a list/changelog of variables. These variables are the output of the template, and send to the Pascal side. If used as intended, the entire program logic is moved to the template, and the Pascal program is only used to display its output.

The entire template can be loaded with the TMultipageTemplate class and evaluated by the TMultipageTemplateReader class.

High-Level: XQuery/XPath-Expression processing and pattern matching

The class TXQueryEngine in the unit xquery implements an XQuery/XPath 3 interpreter, interpreting Turing complete programming languages. The function query can be used to evaluate an expression, which returns the results of the evaluation as xquery.IXQValue which can be chained through a LINQ-like pipeline.

All the other functions and methods can be accessed from within the interpreted language. Not all features are enabled at default. Add the units xquery_json and xquery_utf8 to a uses clause of any unit to activate JSONiq and UTF-8 support. Their presence in the unit clause is enough to activate them. Further modules are available in the units xquery_module_math (XPath/XQuery 3.0 math functions) and xquery_module_file (EXPath file module). They are activated by calling the register function of the corresponding unit.

The unit extendedhtmlparser contains an XML/HTML-pattern matcher. A pattern is a stripped-down, annotated version of the data that should be parsed. To create a pattern, you take the document you want to process, remove all unimportant things and mark the things of interest. The class THtmlTemplateParser is used to match the pattern against the data and return all matches in a variable changelog. Including the unit in a uses clause also registers the pattern matcher to the XQuery interpreter, so you can call pattern matching functionality from evaluated XQuery expressions. And the other way around the pattern can also contain arbitrary XQuery expressons.

Mid-Level: Tree/DOM-like processing

To process a html file directly in fpc, without using another, interpreted language, you can use the class TTreeParser in the unit simplehtmltreeparser.

It creates a tree of TTreeElement-s from the html document text (you know such trees from DOM, but this tree has nothing to do with do(o)m).

You can also use the class TTreeParserDOM in the unit simplexmltreeparserfpdom to import a DOM document read by the standard fpc xml parser

Low-Level: Lexer/SAX-like interface

At the lowest level you find the parseHTML function of the unit simplehtmlparser.

It just splits a html document into tags and text elements and calls a callback function for each of the elements.

A similar function, parseXML, exists for xml data in the unit simplexmlparser (it treats the xml file as html file, but checks for things like xml processing instructions).

Other things


Generated by PasDoc 0.16.0.