PDFedit
PDF editor pro UNIX / PDF editor for UNIX

Wiki : HowtoCreatePdfToWhatever

HomePage :: Categories :: PageIndex :: RecentChanges :: RecentlyCommented :: Login/Register
Most recent edit on 2010-02-17 02:30:40 by AdminX [remove spam]

No differences.


Edited on 2008-03-06 01:14:15 by TimC [English grammar change]

Additions:
The Pdf format is not meant for editing or simple text extraction etc. With some pdf files it is impossible to extract a word/line/column representation. Despite these limitations, most pdf files are "sane", so we can extract text and build words from letters, lines from words and finally columns from lines. Text output design in pdfedit allows adding arbitrary output formats very easily.

Deletions:
Pdf format is not meant for neither editing nor simple text extraction etc. It can be impossible to create word/line/column representation from some pdf files. Despite these limitations, most pdf files are enough "sane" so we can manage to extract text and build words from letters, lines from words and finally columns from lines. Text output design in pdfedit allows adding arbitrary output formats very easily.



Edited on 2006-12-10 18:27:41 by JozefMisutka

Additions:

Howto create pdf to whatever conversion


Pdf format is not meant for neither editing nor simple text extraction etc. It can be impossible to create word/line/column representation from some pdf files. Despite these limitations, most pdf files are enough "sane" so we can manage to extract text and build words from letters, lines from words and finally columns from lines. Text output design in pdfedit allows adding arbitrary output formats very easily.

Class design

image

Code flow description


This is code flow diagram of pdftoxml
image

Page has convert function creates PageTextSource which actually does the transformation. It takes three template parameters and one function parameter. The three template parameters (WordEngine, LineEngine, ColumnEngine) are responsible for transformations from content stream operators (PdfOperator) to letters(PageSimpleFragment) to words(PageFragment) to lines(PageLine) and to finally to columns (PageColumn).
Currently there are simple implemenatations of these classes which are enough for many pdf files (SimpleWordEngine, SimpleLineEngine, SimpleColumnEngine).

Funcion convert flow description
  1. Firstly it creates PageTextSource class with the template parameters and uses it as functor to StateUpdater::updatePdfOperators function. It means that after each operator update the functor is called. It stores the formatting operators into PageSimpleFragment and when a text operator is encounters creates new PageSimpleFragment.
  2. Then it calls PageTextSource format method which does the transformation from letters to columns.
  3. Finally when the page is parsed into reasonable structures, output method is called which tries to build output format from
  4. all words
  5. from all columns (which contain lines, lines contain words, ...).

Output structure can decide whether to build the output from one or both possibilities. XmlOutputBuilder build xml from columns iterating through its lines, then words, and letters.

New formats


There are two things to be done to enable your new format.

1) Implement derived class from OutputBuilder which means implementing one or both build functions. For example declaration of XmlOutputBuilder class is

span class="co1">//
103     // Building interface
104     //
/** Build output from fragments. *//** Get result without xml header and footer. *///
116     // Static functions
117     //
/** Get xml output. */

The implementation of build is in textoutputbuilder.cc.

2) Then call convert function with your builder class.
2.1) Probably implement new button, menu item .... as it is done with pdftoxml feature. Look in files below.


Files where to look


These are files where you find the conversion system. In src/kernel

These are files where you find menu item implementation in gui (item Tools->Pdf to xml). In src/gui


Implementation notes


Note: Transformations (letters to words, words to lines, lines to columns) are not heavily tested and are rather simple (some sorting functions are missing).
Note 2: The biggest limitations are fonts. The font specification is embedded in pdf file, but pdfedit (nor any other tool i am aware of) can extract these fonts. It is also due to the fact, that not every character must be present in the specification and the font is therefore not complete and usable.


Example - pdftoxml


See example pdftoxml: conversion from pdf to xml
See documentation: PDFedit design documentation


External links


http://pdfedit.petricek.net/pdfedit.design_doc
http://en.wikipedia.org/wiki/Function_object

Categories
Howto


Deletions:

Howto create pdf to whatever conversion

Pdf format is not meant for neither editing nor simple text extraction etc. It can be impossible to create word/line/column representation from some pdf files. Despite these limitations, most pdf files are enough "sane" so we can manage to extract text and build words from letters, lines from words and finally columns from lines. Text output design in pdfedit allows adding arbitrary output formats very easily.

Class design

image

Code flow descrption

This is code flow diagram of pdftoxml
image
Page has convert function creates PageTextSource which actually does the transformation. It takes three template parameters and one function parameter. The three template parameters (WordEngine, LineEngine, ColumnEngine) are responsible for transformations from content stream operators (PdfOperator) to letters(PageSimpleFragment) to words(PageFragment) to lines(PageLine) and to finally to columns (PageColumn).
Currently there are simple implemenatations of these classes which are enough for many pdf files (SimpleWordEngine, SimpleLineEngine, SimpleColumnEngine).
Funcion convert flow description
  1. Firstly it creates PageTextSource class with the template parameters and uses it as functor to StateUpdater::updatePdfOperators function. It means that after each operator update the functor is called. It stores the formatting operators into PageSimpleFragment and when a text operator is encounters creates new PageSimpleFragment.
  2. Then it calls PageTextSource format method which does the transformation from letters to columns.
  3. Finally when the page is parsed into reasonable structures, output method is called which tries to build output format from
  4. all words
  5. from all columns (which contain lines, lines contain words, ...).
Output structure can decide whether to build the output from one or both possibilities. XmlOutputBuilder build xml from columns iterating through its lines, then words, and letters.

New formats

There are two things to be done to enable your new format.
1) Implement derived class from OutputBuilder which means implementing one or both build functions. For example declaration of XmlOutputBuilder class is
span class="co1">//
103     // Building interface
104     //
/** Build output from fragments. *//** Get result without xml header and footer. *///
116     // Static functions
117     //
/** Get xml output. */

The implementation of build is in textoutputbuilder.cc.
2) Then call convert function with your builder class.
2.1) Probably implement new button, menu item .... as it is done with pdftoxml feature. Look in files below.

Files where to look

These are files where you find the conversion system. In src/kernel
These are files where you find menu item implementation in gui (item Tools->Pdf to xml). In src/gui

Implementation notes

Note: Transformations (letters to words, words to lines, lines to columns) are not heavily tested and are rather simple (some sorting functions are missing).
Note 2: The biggest limitations are fonts. The font specification is embedded in pdf file, but pdfedit (nor any other tool i am aware of) can extract these fonts. It is also due to the fact, that not every character must be present in the specification and the font is therefore not complete and usable.

Example - pdftoxml

See example pdftoxml: conversion from pdf to xml
See documentation: PDFedit design documentation

External links

http://pdfedit.petricek.net/pdfedit.design_doc
http://en.wikipedia.org/wiki/Function_object
Categories
Howto




Edited on 2006-12-10 18:26:42 by JozefMisutka

Additions:
image

Deletions:
image



Edited on 2006-12-10 18:24:39 by JozefMisutka

Additions:
image

Deletions:
image



Edited on 2006-12-10 17:51:33 by JozefMisutka

Additions:
image
Page has convert function creates PageTextSource which actually does the transformation. It takes three template parameters and one function parameter. The three template parameters (WordEngine, LineEngine, ColumnEngine) are responsible for transformations from content stream operators (PdfOperator) to letters(PageSimpleFragment) to words(PageFragment) to lines(PageLine) and to finally to columns (PageColumn).
  1. Firstly it creates PageTextSource class with the template parameters and uses it as functor to StateUpdater::updatePdfOperators function. It means that after each operator update the functor is called. It stores the formatting operators into PageSimpleFragment and when a text operator is encounters creates new PageSimpleFragment.

    Deletions:
    image
Page has convert function creates PageTextSource which actually does the transformation. It takes three template paramters and one function parameter. The three template parameters (WordEngine, LineEngine, ColumnEngine) are responsible for transformations from content stream operators (PdfOperator) to letters(PageSimpleFragment) to words(PageFragment) to lines(PageLine) and to finally to columns (PageColumn).
  1. Firstly it creates PageTextSource class with the template paramters and uses it as functor to StateUpdater::updatePdfOperators function. It means that after each operator update the functor is called. It stores the formatting operators into PageSimpleFragment and when a text operator is encounters creates new PageSimpleFragment.



    Edited on 2006-12-10 17:22:40 by JozefMisutka

    No differences.


    Edited on 2006-12-10 17:22:14 by JozefMisutka

    No differences.


    Edited on 2006-12-10 17:21:24 by JozefMisutka

    Additions:
    Categories
Howto




Edited on 2006-12-10 16:57:51 by JozefMisutka

Additions:
~1) Firstly it creates PageTextSource class with the template paramters and uses it as functor to StateUpdater::updatePdfOperators function. It means that after each operator update the functor is called. It stores the formatting operators into PageSimpleFragment and when a text operator is encounters creates new PageSimpleFragment.
http://en.wikipedia.org/wiki/Function_object


Deletions:
~1) Firstly it creates PageTextSource class with the template paramters and uses it as functor to StateUpdater::updatePdfOperators function. It means that after each operator update the functor is called. It stores the formatting operators into PageSimpleFragment and when a text operator is encounters creates new PageSimpleFragment.



Edited on 2006-12-10 16:56:12 by JozefMisutka

Additions:
1) Implement derived class from OutputBuilder which means implementing one or both build functions. For example declaration of XmlOutputBuilder class is
2) Then call convert function with your builder class.
2.1) Probably implement new button, menu item .... as it is done with pdftoxml feature. Look in files below.


Deletions:
~2)Implement derived class from OutputBuilder which means implementing one or both build functions. For example declaration of XmlOutputBuilder class is
  1. Then call convert function with your builder class.
  2. .1: Probably implement new button, menu item .... as it is done with pdftoxml feature. Look in files below.



    Edited on 2006-12-10 16:50:48 by JozefMisutka

    Additions:
    ~1) Firstly it creates PageTextSource class with the template paramters and uses it as functor to StateUpdater::updatePdfOperators function. It means that after each operator update the functor is called. It stores the formatting operators into PageSimpleFragment and when a text operator is encounters creates new PageSimpleFragment.
  3. Then it calls PageTextSource format method which does the transformation from letters to columns.
  4. Finally when the page is parsed into reasonable structures, output method is called which tries to build output format from
  5. Implement derived class from OutputBuilder which means implementing one or both build functions. For example declaration of XmlOutputBuilder class is
  6. Then call convert function with your builder class.
  7. .1: Probably implement new button, menu item .... as it is done with pdftoxml feature. Look in files below.

    Deletions:
    ~1: Firstly it creates PageTextSource class with the template paramters and uses it as functor to StateUpdater::updatePdfOperators function. It means that after each operator update the functor is called. It stores the formatting operators into PageSimpleFragment and when a text operator is encounters creates new PageSimpleFragment. 1: Then it calls PageTextSource format method which does the transformation from letters to columns.
    1: Finally when the page is parsed into reasonable structures, output method is called which tries to build output format from
    2:Implement derived class from OutputBuilder which means implementing one or both build functions. For example declaration of XmlOutputBuilder class is
    2:Then call convert function with your builder class.
    2 .1: Probably implement new button, menu item .... as it is done with PdfToXml feature. Look in files below.




    Edited on 2006-12-10 16:47:00 by JozefMisutka

    Additions:
    See example pdftoxml: conversion from pdf to xml

    Deletions:
    See example pdftoxml: conversion from pdf to xml



    Edited on 2006-12-10 16:46:27 by JozefMisutka

    Additions:
    ~1: Firstly it creates PageTextSource class with the template paramters and uses it as functor to StateUpdater::updatePdfOperators function. It means that after each operator update the functor is called. It stores the formatting operators into PageSimpleFragment and when a text operator is encounters creates new PageSimpleFragment.

Example - pdftoxml



Deletions:
~1: Firstly it creates PageTextSource class with the template paramters and uses it as functor to StateUpdater::updatePdfOperators function. It means that after each operator update the [[ProgrammingFunctor functor] is called. It stores the formatting operators into PageSimpleFragment and when a text operator is encounters creates new PageSimpleFragment.

Examle - pdftoxml





Edited on 2006-12-10 16:45:52 by JozefMisutka

Additions:
This is code flow diagram of pdftoxml

Deletions:
This is code flow diagram of pdftoxml



Edited on 2006-12-10 16:44:34 by JozefMisutka

Additions:
There are two things to be done to enable your new format.
2:Implement derived class from OutputBuilder which means implementing one or both build functions. For example declaration of XmlOutputBuilder class is
2:Then call convert function with your builder class.
2 .1: Probably implement new button, menu item .... as it is done with PdfToXml feature. Look in files below.

Files where to look

These are files where you find menu item implementation in gui (item Tools->Pdf to xml). In src/gui

Class design

image

Code flow descrption

This is code flow diagram of pdftoxml
image
Page has convert function creates PageTextSource which actually does the transformation. It takes three template paramters and one function parameter. The three template parameters (WordEngine, LineEngine, ColumnEngine) are responsible for transformations from content stream operators (PdfOperator) to letters(PageSimpleFragment) to words(PageFragment) to lines(PageLine) and to finally to columns (PageColumn).
Currently there are simple implemenatations of these classes which are enough for many pdf files (SimpleWordEngine, SimpleLineEngine, SimpleColumnEngine).
Funcion convert flow description
1: Firstly it creates PageTextSource class with the template paramters and uses it as functor to StateUpdater::updatePdfOperators function. It means that after each operator update the [[ProgrammingFunctor functor] is called. It stores the formatting operators into PageSimpleFragment and when a text operator is encounters creates new PageSimpleFragment.
1: Then it calls PageTextSource format method which does the transformation from letters to columns.
1: Finally when the page is parsed into reasonable structures, output method is called which tries to build output format from
  • all words
  • from all columns (which contain lines, lines contain words, ...).
  • Output structure can decide whether to build the output from one or both possibilities. XmlOutputBuilder build xml from columns iterating through its lines, then words, and letters.

    New formats

    All is needed for new format is implementing derived class from OutputBuilder. It means implementing one or both building functions. Declaration of XmlOutputBuilder class
    span class="co1">//
    103     // Building interface
    104     //
    /** Build output from fragments. *//** Get result without xml header and footer. *///
    116     // Static functions
    117     //
    /** Get xml output. */

    The implementation of build is in textoutputbuilder.cc.
    These are files where you find the conversion system. In src/kernel

    Implementation notes

    Note: Transformations (letters to words, words to lines, lines to columns) are not heavily tested and are rather simple (some sorting functions are missing).
    Note 2: The biggest limitations are fonts. The font specification is embedded in pdf file, but pdfedit (nor any other tool i am aware of) can extract these fonts. It is also due to the fact, that not every character must be present in the specification and the font is therefore not complete and usable.

    Examle - pdftoxml

    See example pdftoxml: conversion from pdf to xml
    See documentation: PDFedit design documentation

    External links

    http://pdfedit.petricek.net/pdfedit.design_doc


    Deletions:
    Pdf format is not meant for neither editing nor simple text extraction etc. It can be impossible to create word/line/column representation from som pdf files. Despite these limitations, most pdf files are enough "sane" that we can managet to extract text and to build words from letters, lines from words and columns from lines.
    Text output design in pdfedit allows adding arbitrary output formats very easily.
    See example pdftoxml: conversion from pdf to xml
    See howto pdftoxml: howto convert from pdf to xml




    Edited on 2006-12-10 04:58:16 by JozefMisutka

    Additions:
    Pdf format is not meant for neither editing nor simple text extraction etc. It can be impossible to create word/line/column representation from som pdf files. Despite these limitations, most pdf files are enough "sane" that we can managet to extract text and to build words from letters, lines from words and columns from lines.
    Text output design in pdfedit allows adding arbitrary output formats very easily.
    See example pdftoxml: conversion from pdf to xml
    See howto pdftoxml: howto convert from pdf to xml


    Deletions:
    See example pdftoxml: conversion from pdf to xml
    See howto pdftoxml: howto convert from pdf to xml




    Edited on 2006-12-09 01:08:47 by JozefMisutka

    Additions:
    See example pdftoxml: conversion from pdf to xml
    See howto pdftoxml: howto convert from pdf to xml


    Deletions:
    See real example: conversion from pdf to xml



    Oldest known version of this page was edited on 2006-12-09 01:04:21 by JozefMisutka []
    Page view:

    Howto create pdf to whatever conversion


    See real example: conversion from pdf to xml