DocBook XML to PDF
The next stage in our documentation changes was the creation of PDF documents from the DocBook XML formats. [See previous posts for other blogs on our documentation changes.]
There is a need to use a XSLT transform to convert the XML document to a FO (formatted object) which can then be processed to create the PDF output.
The DocBook distribution have available a set of XSL transforms for converting the XML files into a variety of different formats, so the first step was to download these. This is not strictly necessary since it is possible to access the XSL transforms over the web, but having a local copy speeds up the transformation process.
Now for the XSL transform engine. We investigated using Saxon for the XSL transformation but experienced a number of problems. These started with errors about being unable to remove the specified namespace, and once we had manually done that, it started complaining about the XML link attribute parameters. In the end we abandoned this as an option. [A lot of the documentation on the web about Saxon and XSL is very old on this topic and we went down several blind alleys, only to discover that the descriptions were totally obsolete.]
We had already obtained the latest version of the Apache FOP product to process the anticipated FO files, then we discovered that the Apache FOP processor had the ability to specify the XSL transform directly. In our earlier reading we must have missed this, so all we needed was a simple command line to perform the transformation is one step.
fop -xml xmldoc-xsl docbook.xsl –pdf output.pdf
Where we would supply the appropriate paths to the fop command, the XML document to be converted (i.e. xmldoc.xml), the XSL transform (i.e. docbook.xsl) and the PDF output file (i.e. output.pdf). The files names supplied reflect ones particular file naming.
Running the command generated a number of messages, which we have not yet completely analysed, but an output file is created successfully.
Now the fun begins. The PDF generated has several characteristics that are undesirable.
Page Size: By default the output page size is US Letter, which is fine for the US but for the EU we are probably better off with this set to A4.
Image Size: The images displayed on the web page are looking fine but on the PDF output a lot of the images exceed the page width. It is necessary to change the attributes on the imagedata tag in the XML file and ensure that we specify ‘scalefit=”1” and width=”14cm”. We tried different width settings and you can get away with specifying a width of 16cm, but anything else is too large. It is best to remove any ‘depth’ attributes on the imagedata tag as well. May need to specify the valign attribute is appropriate as well on some images.
Table width: Tables where the total width in pixels (the sum of all the individual column widths) is greater than about 440 pixels causes the table to run over the right hand side of the page. It is necessary to adjust the table column widths in the XML file to ensure that they do not exceed this value. [The character set size will also impact the best width but our tests indicate that 440 is adequate. Of course depending upon the specific resolution of the generated PDF output this will/may be variable. Generally PDF output allows for 72dpi where as screen resolution will be much higher.]
Note: Some PDF editors [Adobe Acrobat, or Nitro PDF] do allow some editing of PDF documents, but ideally one wants to minimise the number of different steps required to generate the documentation in the various formats. Our tests indicate that resizing of images is not one of those features, or indeed changing the page layout on some pages.
It is possible to customise the Apache FOP product so that we can specify page size, etc. so this is the next step we will investigate. Hopefully we can also introduce a custom front page for our documents as well to achieve an in-house style.
We also have to look again at the HTML output generated from our modified XML file as a result of the changes we needed to create the PDF in a suitable form. After all we only want one XML file per document!