The following markup languages are currently handled by the document
component.
RsStructured Text (RST) is a simple text based markup language, intended
to be easy to read and write by humans. Examples can be found in the
documentation of RST.
The transformation of a simple RST document to docbook can be done just like
this:
- <?php
-
- require 'tutorial_autoload.php';
-
- $document = new ezcDocumentRst();
- $document->loadFile( '../tutorial.txt' );
-
- $docbook = $document->getAsDocbook();
- echo $docbook->save();
- ?>
In line 3 the document is actually loaded and parsed into an internal abstract
syntax tree. In line 5 the internal structure is then transformed back to a
docbook document. In the last line the resulting document is returned as a
string, so that you can echo or store it.
Error handling
By default each parsing or compiling error will be transformed into an
exception, so that you are noticed about those errors. The error reporting
settings can be modified like for all other document handlers:
- <?php
- $document = new ezcDocumentRst();
- $document->options->errorReporting = E_PARSE | E_ERROR | E_WARNING;
- $document->loadFile( '../tutorial.txt' );
-
- $docbook = $document->getAsDocbook();
- echo $docbook->save();
- ?>
Where the setting in line 3 causes, that only warnings, errors and fatal errors
are transformed to exceptions now, while the notices are only collected, but
ignored. This setting affects both, the parsing of the source document and the
compiling into the destination language.
Directives
RST directives are elements in the RST documents with parameters, optional
named options and optional content. The document component implements a well
known subset of the directives implemented in the docutils RST parser. You
may register custom directive handlers, or overwrite existing directive
handlers using your own implementation. A directive in RST markup with
parameters, options and content could look like:
My document
===========
The custom directive:
.. my_directive:: parameters
:option: value
Some indented text...
For such a directive you should register a handler on the RST document, like:
- <?php
- $document = new ezcDocumentRst();
- $document->registerDirective( 'my_directive', 'myCustomDirective' );
- $document->loadFile( $from );
-
- $docbook = $document->getAsDocbook();
- $xml = $docbook->save();
- ?>
The class myCustomDirective must extend the class ezcDocumentRstDirective, and
implement the method toDocbook(). For rendering you get access to the full AST,
the contents of the current directive and the base path, where the document
resist in the file system - which is necessary for accessing external files.
Directive example
A full example for a custom directive, where we want to embed real world
addresses into our RST document and maintain the semantics in the resulting
docbook, could look like:
Address example
===============
.. address:: John Doe
:street: Some Lane 42
We would possibly add more information, like the ZIP code, city and state, but
skip this to keep the code short. The implemented directive then would just
need to take these information and transform it into valid docbook XML using
the DOM extension.
- <?php
- class myAddressDirective extends ezcDocumentRstDirective
- {
- public function toDocbook( DOMDocument $document, DOMElement $root )
- {
- $address = $document->createElement( 'address' );
- $root->appendChild( $address );
-
- if ( !empty( $this->node->parameters ) )
- {
- $name = $document->createElement( 'personname', htmlentities( $this->node->parameters ) );
- $address->appendChild( $name );
- }
-
- if ( isset( $this->node->options['street'] ) )
- {
- $street = $document->createElement( 'street', htmlentities( $this->node->options['street'] ) );
- $address->appendChild( $street );
- }
- }
- }
- ?>
The AST node, which should be rendered, is passed to the constructor of the
custom directive visitor and available in the class property $node. The
complete DOMDocument and the current DOMNode are passed to the method. In this
case we just create a address node with the optional child nodes street and
personname, depending on the existence of the respective values.
You can now render the RST document after you registered you custom directive
handler as shown above:
- <?php
-
- require 'tutorial_autoload.php';
-
- // Load custom directive
- require '00_01_address_directive.php';
-
- $document = new ezcDocumentRst();
- $document->registerDirective( 'address', 'myAddressDirective' );
- $document->loadString( <<<EORST
- Address example
- ===============
-
- .. address:: John Doe
- :street: Some Lane 42
- EORST
- );
-
- $docbook = $document->getAsDocbook();
- echo $docbook->save();
- ?>
The output will then look like:
<?xml version="1.0"?>
<article xmlns="http://docbook.org/ns/docbook">
<section id="address_example">
<sectioninfo/>
<title>Address example</title>
<address>
<personname> John Doe</personname>
<street> Some Lane 42</street>
</address>
</section>
</article>
XHTML rendering
For RST a conversion shortcut has been implemented, so that you don't need to
convert the RST to docbook and the docbook to XHTML. This saves conversion time
and enables you to prevent from information loss during multiple conversions:
- <?php
- $document = new ezcDocumentRst();
- $document->loadFile( $from );
-
- $xhtml = $document->getAsXhtml();
- $xml = $xhtml->save();
- ?>
The default XHTML compiler generates complete XHTML documents, including header
and meta-data in the header. If you want to in-line the result, you may specify
another XHTML compiler, which just creates a XHTML block level element, which
can be embedded in your source code:
- <?php
- $document = new ezcDocumentRst();
- $document->options->xhtmlVisitor = 'ezcDocumentRstXhtmlBodyVisitor';
- $document->loadFile( $from );
-
- $xhtml = $document->getAsXhtml();
- $xml = $xhtml->save();
- ?>
You can of course also use the predefined and custom directives for XHTML
rendering. The directives used during XHTML generation also need to implement
the interface ezcDocumentRstXhtmlDirective.
Modification of XHTML rendering
You can modify the generated output of the XHTML visitor by creating a custom
visitor for the RST AST. The easiest way probably is to extend from one of the
existing XHTML visitors and reusing it. For example you may want to fill the
type attribute in bullet lists, like known from HTML, which isn't valid XHTML,
though:
class myDocumentRstXhtmlVisitor extends ezcDocumentRstXhtmlVisitor
{
protected function visitBulletList( DOMNode $root, ezcDocumentRstNode $node )
{
$list = $this->document->createElement( 'ul' );
$root->appendChild( $list );
$listTypes = array(
'*' => 'circle',
'+' => 'disc',
'-' => 'square',
"\xe2\x80\xa2" => 'disc',
"\xe2\x80\xa3" => 'circle',
"\xe2\x81\x83" => 'square',
);
// Not allowed in XHTML strict
$list->setAttribute( 'type', $listTypes[$node->token->content] );
// Decoratre blockquote contents
foreach ( $node->nodes as $child )
{
$this->visitNode( $list, $child );
}
}
}
The structure, which is not enforced for visitors, but used in the docbook and
XHTML visitors, is to call special methods for each node type in the AST to
decorate the AST recursively. This method will be called for all bullet list
nodes in the AST which contain the actual list items. As the first parameter
the current position in the XHTML DOM tree is also provided to the method.
To create the XHTML we can now just create a new list node (<ul>) in the
current DOMNode, set the new attribute, and recursively decorate all
descendants using the general visitor dispatching method visitNode() for all
children in the AST. For the AST children being also rendered as children in
the XML tree, we pass the just created DOMNode (<ul>) as the new root node to
the visitNode() method.
After defining such a class, you could use the custom visitor like shown
above:
- <?php
- $document = new ezcDocumentRst();
- $document->options->xhtmlVisitor = 'myDocumentRstXhtmlVisitor';
- $document->loadFile( $from );
-
- $xhtml = $document->getAsXhtml();
- $xml = $xhtml->save();
- ?>
Now the lists in the generated XHTML will also the type attribute set.
Writing RST
Writing a RST document from an existing docbook document, or a
ezcDocumentDocbook object generated from some other source, is trivial:
- <?php
-
- require 'tutorial_autoload.php';
-
- $docbook = new ezcDocumentDocbook();
- $docbook->loadFile( 'docbook.xml' );
-
- $rst = new ezcDocumentRst();
- $rst->createFromDocbook( $docbook );
-
- echo $rst->save();
-
- ?>
For the conversion internally the ezcDocumentDocbookToRstConverter class is
used, which can also be called directly, like:
$converter = new ezcDocumentDocbookToRstConverter();
$rst = $converter->convert( $docbook );
Using this you can configure the converter to your wishes, or extend the
convert to handle yet unhandled docbook elements. The converter is, as usaul
configured using its option property, and the options are defined in the
ezcDocumentDocbookToRstConverterOptions class. There you may configure the
header underlines used, the bullet types or the line wrapping.
Extending RST writing
As said before, not all existing docbook elements might already be handled by
the converter. But its handler based mechanism makes it easy to extend or
overwrite existing behaviour.
Similar to the example above we can convert the <address> docbook element back
to the address RST directive.
- <?php
-
- require 'tutorial_autoload.php';
-
- $docbook = new ezcDocumentDocbook();
- $docbook->loadFile( 'address.xml' );
-
- class myAddressElementHandler extends ezcDocumentDocbookToRstBaseHandler
- {
- public function handle( ezcDocumentElementVisitorConverter $converter, DOMElement $node, $root )
- {
- $root .= $this->renderDirective( 'address', $node->textContent, array() );
- return $root;
- }
- }
-
- $converter = new ezcDocumentDocbookToRstConverter();
- $converter->setElementHandler( 'docbook', 'address', new myAddressElementHandler() );
-
- $rst = $converter->convert( $docbook );
- echo $rst->save();
-
- ?>
The handler classes are assigned to XML elements in some namespace, "docbook"
in this case. It is registered in line 18 for the element "address". The class
itself has to extend from the ezcDocumentElementVisitorHandler class, which is
in this case already extended by ezcDocumentDocbookToRstBaseHandler, which
provides some convenience methods for RST creation, like renderDirective() used
in this example.
The handler is called, whenever the element, it has been registered for, occurs
in the docbook XML tree. In this case it has to append the generated RST part
for this element to the RST document - and may call the general conversion
handler again for its child elements. This example converts the above shown
docbook XML back to:
.. _address_example:
===============
Address example
===============
.. address::
John Doe
Some Lane 42
Which ignores any special address sub elements for the simplicity of the
example. For more examples on element handlers check the existing
implementations.
Converting XHTML or HTML to a document markup language is a non trivial task,
because XHTML elements are often used for layout, ignoring the actual semantics
of the element. Therefore the document component allows to stack a set of
filters, which each performs a specific conversion task. The default filter
stack may work fine, but you may want to also implement custom filters
depending on the contents of the filtered website, or to cover additional
sources of meta data information, like RDF, Microformats or similar.
The available filters are:
ezcDocumentXhtmlElementFilter
This filter just maintains the common semantics of XHTML elements by
converting them to their docbook equivalents. It ignores common class names.
This filter is the most basic and you probably want to always add this one to
the filter stack.
ezcDocumentXhtmlXpathFilter
The XPath filter takes a XPath expression to locate the root of the document
contents. It makes no sense to use this one together with the content locator
filter. This is a more static, but also more precise way to tell the
converter where to find the actual contents.
ezcDocumentXhtmlMetadataFilter
This filter extracts common meta data from the XHTML head, and converts it
into docbook section info elements.
ezcDocumentXhtmlTablesFilter
HTML tables are especially often used for layout markup. This filter takes a
threshold, and if the table text factor drops below this threshold the table
is ignored. The same is true for stacked tables.
ezcDocumentXhtmlContentLocatorFilter
The content locator filter tries to find the actual article in the markup of
a website, ignoring the surrounding layout markup. This seems to work well
for example for common news sites.
By default just the element and meta data filters are used. So the conversion
of a common website, like the introduction article from ezcomponents.org,
results in a docbook document containing all lists for the navigation, etc..
- <?php
-
- require 'tutorial_autoload.php';
-
- $xhtml = new ezcDocumentXhtml();
- $xhtml->loadFile( 'ez_components_introduction.html' );
-
- $docbook = $xhtml->getAsDocbook();
- echo $docbook->save();
- ?>
So let's additionally use the XPath filter to pass the location of the actual
content to the conversion:
- <?php
-
- require 'tutorial_autoload.php';
-
- $xhtml = new ezcDocumentXhtml();
- $xhtml->setFilters( array(
- new ezcDocumentXhtmlElementFilter(),
- new ezcDocumentXhtmlMetadataFilter(),
- new ezcDocumentXhtmlXpathFilter( '//div[@class="document"]' ),
- ) );
- $xhtml->loadFile( 'ez_components_introduction.html' );
-
- $docbook = $xhtml->getAsDocbook();
- echo $docbook->save();
- ?>
With this additional filter, the contents are correctly found and converted
properly.
Writing XHTML
Writing XHTML from docbook is very similar to the approach used for writing
RST: It the same handler based mechanism, so you may want to check that chapter
to learn how to extend it for unhandled docbook elements.
- <?php
-
- require 'tutorial_autoload.php';
-
- $docbook = new ezcDocumentDocbook();
- $docbook->loadFile( 'docbook.xml' );
-
- $html = new ezcDocumentXhtml();
- $html->createFromDocbook( $docbook );
-
- echo $html->save();
-
- ?>
As you can see, it happens the same way, as for other conversion from Docbook
to any other format.
HTML styles
By default inline CSS is embedded in all generated HTML, to create a more
appealing default experience. This may of course be deactivated and you may
also reference custom style sheets to be included in the generated HTML.
- <?php
-
- require 'tutorial_autoload.php';
-
- $docbook = new ezcDocumentDocbook();
- $docbook->loadFile( 'docbook.xml' );
-
- $converter = new ezcDocumentDocbookToHtmlConverter();
-
- // Remove the inline CSS
- $converter->options->styleSheet = null;
-
- // Add custom CSS style sheets
- $converter->options->styleSheets = array(
- '/styles/screen.css',
- );
-
- $html = $converter->convert( $docbook );
-
- echo $html->save();
-
- ?>
For this we again use the converted directly to be able to configure it as we
like.
eZ XML describes the markup format used internally by eZ Publish for
storing markup in content objects. The format is roughly specified in the eZ
Publish documentation.
Modules are often register custom elements, which are not specified anywhere,
so there might be several elements not handled by default.
Reading eZ XML
Reading eZ XML is basically the same as for all other formats:
- <?php
-
- require 'tutorial_autoload.php';
-
- $document = new ezcDocumentEzXml();
- $document->loadString( '<?xml version="1.0"?>
- <section xmlns="http://ez.no/namespaces/ezpublish3">
- <header>Paragraph</header>
- <paragraph>Some content...</paragraph>
- </section>' );
-
- $docbook = $document->getAsDocbook();
- echo $docbook->save();
- ?>
As always the document object is either constructed from an input string or
file. To convert into docbook you may just use the method getAsDocbook().
Link handling
Inside eZ XML documents link URIs are replaced with IDs, which reference the
links inside the eZ Publish database, to ensure that a changed link is update
globally. The replacing of such links is handled by a class extending from
ezcDocumentEzXmlLinkProvider. By default dummy URLs are added to the documents.
URLs are either referenced directly by their ID, a node ID, or an object ID.
Those parameters are passed to the link provide, which then should return an
URL for that.
- <?php
-
- require 'tutorial_autoload.php';
-
- class myLinkProvider extends ezcDocumentEzXmlLinkProvider
- {
- public function fetchUrlById( $id, $view, $show_path )
- {
- return 'http://host/path/' . $id;
- }
-
- public function fetchUrlByNodeId( $id, $view, $show_path ) {}
- public function fetchUrlByObjectId( $id, $view, $show_path ) {}
- }
-
- $document = new ezcDocumentEzXml();
- $document->loadString( '<?xml version="1.0"?>
- <section xmlns="http://ez.no/namespaces/ezpublish3">
- <header>Paragraph</header>
- <paragraph>Some content, with a <link url_id="1">link</link>.</paragraph>
- </section>' );
-
- // Set link provider
- $converter = new ezcDocumentEzXmlToDocbookConverter();
- $converter->options->linkProvider = new myLinkProvider();
-
- $docbook = $converter->convert( $document );
- echo $docbook->save();
- ?>
The link provider is only implemented as a trivial stub, but you can establish
a database connection there and actually fetch the required data. I this case
the generated docbook document look like:
<?xml version="1.0"?>
<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
<article xmlns="http://docbook.org/ns/docbook">
<section>
<title>Paragraph</title>
<para>Some content, with a <ulink url="http://host/path/1">link</ulink>.</para>
</section>
</article>
The link provider is set again as a option of the converter. Like shown for the
docbook conversions of the other handlers, you can register element handlers
for yet unhandled eZ XML elements on the converter, too.
Wrting eZ XML
Writing eZ XML works nearly the same as reading. It again uses a XML based
element handled, like shown in the Docbook to RST conversion in more detail.
For the link conversion an object extending from ezcDocumentEzXmlLinkConverter
is used, which returns an array with the attributes of the link in the eZ XML
document.
Wiki markup has no central standard, but is used as a term to describe some
common subset with lots of different extensions. Most wiki markup languages
only support a quite trivial markup with severe limitations on the recursion of
markup blocks. For example no markup really tables containing lists, or
especially not tables containing other tables.
The document component implements a generic parser to support multiple wiki
markup languages. For each different markup syntax a tokenizer has to be
implemented, which converts the implemented markup into a unified token stream,
which can then be handled by the generic parser.
The document component currently supports reading three wiki markup languages,
but new ones are added easily by implementing another tokenizer. Supported are:
Creole, developed by a initiative with the intention to create a unified
wiki markup standard. This is the default wiki language, and currently the
only one which can be written.
Creole currently only supports a very limited set of markup, all further
markup additions are still up to discussion.
Dokuwiki is a popular wiki system, for example used on wiki.php.net
with a quite different syntax, and the most complete markup support, even
including something like footnotes.
Confluence is a common Java based wiki with an entirely different and most
uncommon syntax, which has mainly been implemented to prove the generic
nature of the parser.
All markup languages are tested against all examples from the respective
markup language documentation, there might still be cases where the parsers of
the default implementation behaves slightly different from the implementation
in the document component.
Reading wiki markup
Reading wiki texts basically works like for any other markup language:
- <?php
-
- require 'tutorial_autoload.php';
-
- $document = new ezcDocumentWiki();
- $document->loadString( '
- = Example text =
-
- Just some exaple paragraph with a heading, some **emphasis** markup and a
- [[http://ezcomponents.org|link]].' );
-
- $docbook = $document->getAsDocbook();
- echo $docbook->save();
- ?>
As said, by default the Creoletokenizer is used. The same result can be
produced with dokuwiki markup and switching the tokenizer:
- <?php
-
- require 'tutorial_autoload.php';
-
- $document = new ezcDocumentWiki();
- $document->options->tokenizer = new ezcDocumentWikiConfluenceTokenizer();
- $document->loadString( '
- h1. Example text
-
- Just some exaple paragraph with a heading, some *emphasis* markup and a
- [link|http://ezcomponents.org].' );
-
- $docbook = $document->getAsDocbook();
- echo $docbook->save();
- ?>
Writing wiki markup
Until now only writing of creole wiki markup is supported. Since creole does
not support a lot of the markup available in docbook, not all documents might
get converted properly. Because it does not even support explicit internal
references, we cannot even simulate footnotes like in HTML.
If you want to add support for such conversions, it works exactly like the
docbook RST conversion and can be extended the same way.
- <?php
-
- require 'tutorial_autoload.php';
-
- $docbook = new ezcDocumentDocbook();
- $docbook->loadFile( 'docbook.xml' );
-
- $document = new ezcDocumentWiki();
- $document->createFromDocbook( $docbook );
- echo $document->save();
- ?>