|Anonymous | Login | Signup for a new account||04-02-2015 08:29 CEST|
|Main | My View | View Issues | Docs|
|Viewing Issue Simple Details [ Jump to Notes ]||[ View Advanced ] [ Issue History ] [ Print ]|
|ID||Category||Severity||Reproducibility||Date Submitted||Last Update|
|0000253||[PDFedit] =Other (GUI)=||major||always||08-06-08 07:51||07-22-09 12:01|
|Summary||0000253: Text on some PDF documents is scrambled.|
When you attempt to edit text on many PDF documents, the text displayed in the dynamic toolbar, and within the parameter tree, is scrambled. This is caused by the pdf creators remapping the font as they create the file.
This makes the simple task of editing a string in a file an interesting game of cryptograms.
The text is correctly displayed by such tools as "Extract Text From Page".
In some documents, a different character mapping is used for each page.
Suggested ways to deal with this-
1. Provide a tool to remap the fonts and descramble the text. It would give all the glyphs in the font their correct unicode values, and decrypt the text. (PDFs from OpenOffice are like this - their fonts only contain used glyphs, but these glyphs have their correct unicode values.)
2. Provide an unscrambled version of the string in the dynamic toolbar for the user to edit. re-scramble the text when the user is finished. (This solution offends me(TM), although it is perhaps the cleanest from a user's perspective )
This also puts the question of what to do when the user enters a character that is not available. The full solution is to pull the glyph in from a system font, or maybe another page that might have it. This sounds like a tricky programming problem even to untutored me.
|Attached Files||pdftest.pdf [^] (6,527 bytes) 08-06-08 07:51|
I'd like to add the complication that many documents use use 09(tab),10(cr) and 13(lf) for the ninth, tenth and thirteenth unique letters used. This makes entering them impossible!
Secondly, I am aware that this is more difficult than it appears, and may not be able to be fixed in the short term. But it would improve the usability of pdfedit greatly if it could be.
The main problem is that the text selection tool extracts text directly from selected text operators:
see src/gui/pdfoperator.qs: getTextFromTextOperator
On the other hand, we we are searching for or extracting text, we are using xpdf code:
see src/kernel/cpagecontents.cc: CPageContents::getText function which uses TextOutputDev object which is responsible for proper text extracting and translating it to the human readable form (src/xpdf/xpdf/TextOutputDev.cc: TextPage::dumpFragment).
So we need to change current implementation to ask for text from page object giving coordinates of bounding box from the text operator (if this is possible of course - I have tried to implemented yet).
Jozo, what do I need to do with value given from op.getBBox so that I can push it to the Page.getText() as rectangle?
What do you think about reimplementing getTextFromTextOperator to simply get BB from given text operator and ask for text its page?
I can implement it, but I would like to know your opinion before I start.
After some discussion, reassigning to Jozo to give background information for implementation (real text should be a part of each text operator initialized during content stream parsing).
I can prepare a patch afterwards.
look into Gfx.cc and find all the operators where Gfx::doShowText(GString *s) is called (it should more or less respond to printTextUpdate()). copy the string extraction into stateupdater.cc into the same operators , e.g.:
opTjUpdate (GfxState* state, boost::shared_ptr<GfxResources>, const boost::shared_ptr<PdfOperator>, const PdfOperator::Operands& args, BBox* rc)
assert (1 <= args.size ());
// This can happen in really damaged pdfs
---------------->store the text from args using current font
StateUpdater::printTextUpdate (state, getStringFromIProperty (args), rc);
// return changed state
Just for record, first patches are already in the devel mailing list:
These patches, however, solve only one way - to display text according to font encoding - and don't handle opposite direction - to modify/set text according to the encoding.
|The first part is already (since october 2008) in the CVS.|
|08-06-08 07:51||robbak||New Issue|
|08-06-08 07:51||robbak||File Added: pdftest.pdf|
|08-07-08 01:30||robbak||Note Added: 0000476|
|08-08-08 04:19||hockm0bm||Note Added: 0000477|
|08-08-08 04:21||hockm0bm||Status||new => assigned|
|08-08-08 04:21||hockm0bm||Assigned To||=> hockm0bm|
|08-08-08 22:41||hockm0bm||Note Added: 0000480|
|08-08-08 22:41||hockm0bm||Assigned To||hockm0bm => misuj1am|
|08-13-08 00:49||hockm0bm||Severity||minor => major|
|08-19-08 04:38||hockm0bm||Issue Monitored: hockm0bm|
|08-26-08 14:54||misuj1am||Note Added: 0000529|
|09-12-08 18:06||hockm0bm||Note Added: 0000538|
|09-12-08 18:11||hockm0bm||Assigned To||misuj1am => hockm0bm|
|09-19-08 07:01||hockm0bm||Relationship added||has duplicate 0000264|
|07-22-09 11:58||hockm0bm||Issue End Monitor: hockm0bm|
|07-22-09 12:01||hockm0bm||Note Added: 0000910|