PDFedit Bugtracker
  

Viewing Issue Simple Details Jump to Notes ] View Advanced ] Issue History ] Print ]
ID Category Severity Reproducibility Date Submitted Last Update
0000253 [PDFedit] =Other (GUI)= major always 08-06-08 07:51 07-22-09 12:01
Reporter robbak View Status public  
Assigned To hockm0bm
Priority normal Resolution open  
Status assigned   Product Version 0.3.2
Summary 0000253: Text on some PDF documents is scrambled.
Description When you attempt to edit text on many PDF documents, the text displayed in the dynamic toolbar, and within the parameter tree, is scrambled. This is caused by the pdf creators remapping the font as they create the file.
This makes the simple task of editing a string in a file an interesting game of cryptograms.
The text is correctly displayed by such tools as "Extract Text From Page".
In some documents, a different character mapping is used for each page.
Additional Information Suggested ways to deal with this-
1. Provide a tool to remap the fonts and descramble the text. It would give all the glyphs in the font their correct unicode values, and decrypt the text. (PDFs from OpenOffice are like this - their fonts only contain used glyphs, but these glyphs have their correct unicode values.)
2. Provide an unscrambled version of the string in the dynamic toolbar for the user to edit. re-scramble the text when the user is finished. (This solution offends me(TM), although it is perhaps the cleanest from a user's perspective )

This also puts the question of what to do when the user enters a character that is not available. The full solution is to pull the glyph in from a system font, or maybe another page that might have it. This sounds like a tricky programming problem even to untutored me.
Attached Files  pdftest.pdf [^] (6,527 bytes) 08-06-08 07:51

- Relationships
has duplicate 0000264closed hockm0bm Arabic fonts might not display probably 

- Notes
(0000476)
robbak
08-07-08 01:30

I'd like to add the complication that many documents use use 09(tab),10(cr) and 13(lf) for the ninth, tenth and thirteenth unique letters used. This makes entering them impossible!

Secondly, I am aware that this is more difficult than it appears, and may not be able to be fixed in the short term. But it would improve the usability of pdfedit greatly if it could be.
 
(0000477)
hockm0bm
08-08-08 04:19

The main problem is that the text selection tool extracts text directly from selected text operators:
see src/gui/pdfoperator.qs: getTextFromTextOperator

On the other hand, we we are searching for or extracting text, we are using xpdf code:

see src/kernel/cpagecontents.cc: CPageContents::getText function which uses TextOutputDev object which is responsible for proper text extracting and translating it to the human readable form (src/xpdf/xpdf/TextOutputDev.cc: TextPage::dumpFragment).

So we need to change current implementation to ask for text from page object giving coordinates of bounding box from the text operator (if this is possible of course - I have tried to implemented yet).

Jozo, what do I need to do with value given from op.getBBox so that I can push it to the Page.getText() as rectangle?

What do you think about reimplementing getTextFromTextOperator to simply get BB from given text operator and ask for text its page?

I can implement it, but I would like to know your opinion before I start.
 
(0000480)
hockm0bm
08-08-08 22:41

After some discussion, reassigning to Jozo to give background information for implementation (real text should be a part of each text operator initialized during content stream parsing).

I can prepare a patch afterwards.
 
(0000529)
misuj1am
08-26-08 14:54

look into Gfx.cc and find all the operators where Gfx::doShowText(GString *s) is called (it should more or less respond to printTextUpdate()). copy the string extraction into stateupdater.cc into the same operators , e.g.:

    // "Tj"
    GfxState *
    opTjUpdate (GfxState* state, boost::shared_ptr<GfxResources>, const boost::shared_ptr<PdfOperator>, const PdfOperator::Operands& args, BBox* rc)
    {
        assert (1 <= args.size ());

        // This can happen in really damaged pdfs
         if (state->getFont())
---------------->store the text from args[0] using current font
            StateUpdater::printTextUpdate (state, getStringFromIProperty (args[0]), rc);
        
        // return changed state
        return state;
    }
 
(0000538)
hockm0bm
09-12-08 18:06

Just for record, first patches are already in the devel mailing list:
http://sourceforge.net/mailarchive/forum.php?thread_name=20080908220716.057622337%40gmail.com&forum_name=pdfedit-devel [^]

These patches, however, solve only one way - to display text according to font encoding - and don't handle opposite direction - to modify/set text according to the encoding.
 
(0000910)
hockm0bm
07-22-09 12:01

The first part is already (since october 2008) in the CVS.
 

- Issue History
Date Modified Username Field Change
08-06-08 07:51 robbak New Issue
08-06-08 07:51 robbak File Added: pdftest.pdf
08-07-08 01:30 robbak Note Added: 0000476
08-08-08 04:19 hockm0bm Note Added: 0000477
08-08-08 04:21 hockm0bm Status new => assigned
08-08-08 04:21 hockm0bm Assigned To  => hockm0bm
08-08-08 22:41 hockm0bm Note Added: 0000480
08-08-08 22:41 hockm0bm Assigned To hockm0bm => misuj1am
08-13-08 00:49 hockm0bm Severity minor => major
08-19-08 04:38 hockm0bm Issue Monitored: hockm0bm
08-26-08 14:54 misuj1am Note Added: 0000529
09-12-08 18:06 hockm0bm Note Added: 0000538
09-12-08 18:11 hockm0bm Assigned To misuj1am => hockm0bm
09-19-08 07:01 hockm0bm Relationship added has duplicate 0000264
07-22-09 11:58 hockm0bm Issue End Monitor: hockm0bm
07-22-09 12:01 hockm0bm Note Added: 0000910