Wiki: Howto create pdf to whatever conversion

This is an old revision of HowtoCreatePdfToWhatever from 2006-12-10 16:29:03.

Howto create pdf to whatever conversion

Pdf format is not meant for neither editing nor simple text extraction etc. It can be impossible to create word/line/column representation from some pdf files. Despite these limitations, most pdf files are enough "sane" so we can manage to extract text and build words from letters, lines from words and finally columns from lines. Text output design in pdfedit allows adding arbitrary output formats very easily.

Class design

Code flow descrption

This is code flow diagram of pdftoxml

Page has convert function creates PageTextSource which actually does the transformation. It takes three template paramters and one function parameter. The three template parameters (WordEngine, LineEngine, ColumnEngine) are responsible for transformations from content stream operators (PdfOperator) to letters(PageSimpleFragment) to words(PageFragment) to lines(PageLine) and to finally to columns (PageColumn).
Currently there are simple implemenatations of these classes which are enough for many pdf files (SimpleWordEngine, SimpleLineEngine, SimpleColumnEngine).

Funcion convert flow description

1: Firstly it creates PageTextSource class with the template paramters and uses it as functor to StateUpdater::updatePdfOperators function. It means that after each operator update the [[ProgrammingFunctor functor] is called. It stores the formatting operators into PageSimpleFragment and when a text operator is encounters creates new PageSimpleFragment.
1: Then it calls PageTextSource format method which does the transformation from letters to columns.
1: Finally when the page is parsed into reasonable structures, output method is called which tries to build output format from

all words

from all columns (which contain lines, lines contain words, ...).

Output structure can decide whether to build the output from one or both possibilities. XmlOutputBuilder build xml from columns iterating through its lines, then words, and letters.

New formats

All is needed for new format is implementing derived class from OutputBuilder. It means implementing one or both building functions. Declaration of XmlOutputBuilder class

span class="co1">//
103 // Building interface
104 //
/** Build output from fragments. *//** Get result without xml header and footer. *///
116 // Static functions
117 //
/** Get xml output. */

The implementation of build is in textoutputbuilder.cc.

These are files where you find the conversion system. In src/kernel

textoutput.h
textoutputengines.cc
textoutputentities.h
textoutputbuilder.cc
textoutputengines.h
textoutputbuilder.h
textoutputentities.cc

Implementation notes

Note: Transformations (letters to words, words to lines, lines to columns) are not heavily tested and are rather simple (some sorting functions are missing).
Note 2: The biggest limitations are fonts. The font specification is embedded in pdf file, but pdfedit (nor any other tool i am aware of) can extract these fonts. It is also due to the fact, that not every character must be present in the specification and the font is therefore not complete and usable.

Examle - pdftoxml

See example pdftoxml: conversion from pdf to xml
See documentation: PDFedit design documentation∞

External links

http://pdfedit.petricek.net/pdfedit.design_doc∞