| PDFedit | Bugtracker |
| Anonymous | Login | Signup for a new account | 09-09-2010 07:00 CEST |
| Main | My View | View Issues | Docs |
| Viewing Issue Simple Details [ Jump to Notes ] | [ View Advanced ] [ Issue History ] [ Print ] | ||||||||
| ID | Category | Severity | Reproducibility | Date Submitted | Last Update | ||||
| 0000348 | [PDFedit] Page view | major | random | 03-13-10 15:06 | 05-03-10 21:05 | ||||
| Reporter | siwira | View Status | public | ||||||
| Assigned To | misuj1am | ||||||||
| Priority | normal | Resolution | fixed | ||||||
| Status | resolved | Product Version | 0.4.3 | ||||||
| Summary | 0000348: One character substituted for another in some pdf viewers after modifying a file with PDFedit. | ||||||||
| Description |
abc0.pdf is a minimal test pdf file created with OpenOffice 3.0.0. abc1.pdf created with PDFedit by adding a single chacter 'a' to abc0.pdf In openSuSE 11.1 abc1 is displayed correctly by Okular but with Adobe Reader 9 'i' is substituted for 'l'. In Win XP abc1 is displayed correctly by Foxit Reader but with Adobe Reader 8.2 and GSView32 'i' is again substituted for 'l'. In another file every 'n' was replaced by a space. |
||||||||
| Additional Information | |||||||||
| Attached Files |
|
||||||||
|
|
|||||||||
Relationships |
|||||||||||||
|
|||||||||||||
Notes |
|
|
(0001008) siwira 03-14-10 14:46 |
The 'i-l' substitution is reproducible with pdf files created by OpenOffice 3.x. Any modification(addition, deletion - of character or graphical object)to a small test file seems to produce the same effect. |
|
(0001034) hockm0bm 04-29-10 11:56 |
Sorry for the long time until someone got to this issue. OK, I can confirm that abc0.pdf has a sequence: "...g h i j k l m n..." while the updated document abc1.pdf: "...g h i j k i m n..." in Acrobat reader 9.2.10 and that I can reproduce the same behavior with the current CVS snapshot. PDFedit and xpdf display the same sequence in both documents. --- [*] I cannot test in Windows |
|
(0001035) hockm0bm 04-29-10 12:01 |
Jozo, I am wondering why the updated document contains also [2 0] original content stream as a changed object. I though that we are adding changes just as a new streams so that the only changed objects should be page dictionary ([1 0] here) which has changed Contents entry and a new object for the new content stream. Here we have [1 0] and [2 0] changed and [10 0] - font and [14 0] content stream created. I have printed [2 0] in the original document and the updated one and they look pretty similar: abc0.pdf [2 0]: 0.1 w q 0 0.1 595.3 841.9 re W* n q 0 0 0 rg BT 56.8 772.2 Td /F1 14 Tf[<0102030204>-7<020502060207>4<02080209020A>-2<020B>-2<020C020D>-2<020E>-2<020F0210021102120213>-3<0214>3<0215>-2<021602170218>1<0219021A021B02>]TJ ET Q q 0 0 0 rg BT 56.8 740 Td /F1 14 Tf[<1C>51<021D>2<021E>2<021F>-7<0220>3<0221>-1<0222>1<0223>1<0224>-3<0225>3<0226>1<0227>31<0228>3<0229>1<022A>1<022B>35<022C>1<022D>2<022E>-1<02>14<2F>17<0230>1<02>21<31>15<02>21<32>14<0233>1<02>36<34>36<0235>]TJ ET Q q 0 0 0 rg BT 56.8 707.8 Td /F1 14 Tf<36023702380239023A023B023C023D023E023F>Tj ET Q Q abc1.pdf: [2 0]: 0.1 w q 0 0.1 595.3 841.9 re W* n q 0 0 0 rg BT 56.8 772.2 Td /F1 14 Tf[ () -7 () 4 ( ) -2 ( ) -2 ( ) -2 () -2 () -3 () 3 () -2 (▒) 1 (▒]TJ ET Q q 0 0 0 rg BT 56.8 740 Td /F1 14 Tf [ () 51 () 2 () 2 () -7 ( ) 3 (!) -1 (") 1 (#) 1 ($) -3 (%) 3 (&) 1 (') 31 (\() 3 (\)) 1 (*) 1 (+) 35 (,) 1 (-) 2 (.) -1 () 14 (/) 17 (0) 1 () 21 (1) 15 () 21 (2) 14 (3) 1 () 36 (4) 36 (5) ] TJ ET Q q 0 0 0 rg BT 56.8 707.8 Td /F1 14 Tf (6789:;<=>?) Tj ET Q Q How did we ended up with a different parameters to Tf operator? The rest seems to be same at first glance. Can this be a problem? |
|
(0001036) misuj1am 04-29-10 21:21 |
i tried add_text tool and it worked just fine in Adobe Reader 8.1 and foxit (my edited pdf file is different to uploaded one)! i suppose there is a problem with adding external font but cannot repro. please michal, either assign this to yourself when you can repro or close this bug :( |
|
(0001037) hockm0bm 04-30-10 11:22 |
You are right. The issue doesn't happen with add_text tool which uses only the core functionality without GUI. The first thing which comes into my mind is that this might be related to http://pdfedit.petricek.net/bt/view.php?id=328. [^] There are still 2 questions, though: 1) What happened to the stream data that they differ (do we change the stream buffer anyhow? Just to be clear, I am not interested why or from where somebody changed that but how those numbers <0102030204> changed to () with non-printable characters) 2) When I compare added content streams from GUI and add_text they are little bit different: * add_text tool [14 0]: /PdfEdit << /Time (1272618244) >> DP q BT /PDFEDIT_F1 15 Tf 1 0 0 1 1 1 Tm (a) Tj ET Q * GUI [14 0]: /PdfEdit << /Time (1272534322) >> DP q 1 0 0 1 0 0 cm BT /PDFEDIT_F1 10 Tf 66 571 Td 0 0 0 rg (a) Tj ET Q Both are supposed to use CPage::addText, aren't they? Then how did we end up with `Tm (a)' vs. `Td 0 0 0 rg (a)'? `1 0 0 1 0 0 cm' is not interesting because it just saves transformation matrix AFAIU. > please michal, either assign this to yourself when you can repro or close > this bug :( We should address the 2 questions before we close this, I think. |
|
(0001038) misuj1am 04-30-10 14:49 |
i updated add_text to match gui more (now system fonts are used) but it works just fine. problem must be in gui usage of kernel. please michal, you have everything to debug it, can you give it a try? |
|
(0001039) hockm0bm 04-30-10 15:11 |
> i updated add_text to match gui more (now system fonts are used) but it works > just fine. problem must be in gui usage of kernel. You have just added font, didn't you. But this doesn't answer why the stream is different (question 2). Does the GUI use the same implementation? I cannot find any page->addText in there... > please michal, you have everything to debug it, can you give it a try? I certainly can but I don't know the code so some hints would be really helpful. E.g. what could cause <0102030204> changed to () with non-printable characters. |
|
(0001040) hockm0bm 04-30-10 15:42 |
I have removed changed object [2 0] (the original content stream) and updated xref accordingly in abc0_fixed_manually.pdf and this is displayed correctly in acrobat reader as well. So the real cause of the issue is a bogus change of the original stream. So this is very same issue as 0000312. |
|
(0001041) hockm0bm 04-30-10 15:45 |
> So the real cause of the issue is a bogus change of the original stream. So > this is very same issue as 0000312. and # 328 |
|
(0001042) misuj1am 04-30-10 15:48 |
the code is in dialogs.qs function addText (_x1,_y1,_x2,_y2, _glob_left,_glob_top) { if (!isPageAvaliable()) { warn(tr("No page selected!")); return; } if (undefined == _x1 || undefined == _y1 || undefined == _x2 || undefined == _y2 || undefined == _glob_left || undefined == _glob_top) { return; } // // Convert x,y to real x,y // global_addText_x = Math.min (convertPixmapPosToPdfPos_x(_x1,_y1),convertPixmapPosToPdfPos_x(_x2,_y2)); global_addText_y = Math.min (convertPixmapPosToPdfPos_y(_x1,_y1),convertPixmapPosToPdfPos_y(_x2,_y2)); if (_x1 < _x2) _glob_left = _glob_left - _x2 + _x1; if (_y1 > _y2) _glob_top = _glob_top + _y1 - _y2; var lineEdit = PageSpace.getTextLine( _glob_left, _glob_top, getNumber( "fontsize" ), getEditText( "fontface" ) ); lineEdit.resize( Math.max( 50, Math.abs (_x2 - _x1)), lineEdit.height ); connect( lineEdit, "returnPressed(const QString&)", _AddTextSlot ); connect( lineEdit, "lostFocus(const QString&)", _AddTextSlot ); } function _AddTextSlot ( text ) { if ((undefined == text) || (text.isEmpty())) { return; } var thepage = page(); var fname = getEditText( "fontface" ); var fid=thepage.getFontId( fname ); if (fid.isEmpty()) { thepage.addSystemType1Font( fname ); fid = thepage.getFontId( fname ); } var fs=getNumber( "fontsize" ); var ctm = getDetransformationMatrix( thepage ); operatorAddTextLine ( text, global_addText_x, global_addText_y, fid, fs, createOperator_transformationMatrix( ctm ), getColor("fg")); // Update go(); } which calls function operatorAddTextLine (text,x,y,fname,fsize,opToPutBefore,col) { // // q // BT // rg col // fname fsize Tf // x y Td // text Tj // ET // Q // var q = createCompositeOperator("q","Q"); var BT = createCompositeOperator("BT","ET"); if ((undefined != opToPutBefore) && (opToPutBefore.type() == "PdfOperator")) q.pushBack( opToPutBefore, q ); q.pushBack (BT,q); putfont(BT,fname,fsize); puttextrelpos (BT,x,y); if (undefined != col) putnscolor (BT,col.red,col.green,col.blue); puttext (BT,text); putendtext (BT); putendq(q); var ops = createPdfOperatorStack(); ops.append (q); page().prependContentStream(ops); } |
|
(0001043) hockm0bm 04-30-10 15:56 |
You are right. I grepped just in *.cc so i didn't catch it. |
|
(0001044) hockm0bm 04-30-10 16:43 |
Hmm. I have checked all calls to CXref::changeObject and we are getting the following changes when a simple text is added (rough backtraces): * [10 0] - from QSPage::addSystemType1Font -> CPage -> CPageFonts -> CDict -> CPdf -> XRefWriter -> CXref * [14 0] - from QSPage::appendContentStream -> CPage -> CPageContents -> CPdf::addIndirectProperty -> XRefWriter -> CXref * [1 0] - QSPage::appendContentStream -> CPage -> CPageContents::addToBack -> cc_add -> CPdf::changeIndirectObject -> CXref * [2 0] - QLineEdit::returnPressed -> [...] -> QSContentStream::saveChange -> CContentStream::save -> CStream::setBuffer -> CDict::delProperty -> IProperty::dispatchChange -> CPdf -> CXref First three are perfectly OK. [10 0] is added font dictionary, [14 0] is created stream and [1 0] is page dictionary which has a new entry in Contents entry. The last one [2 0] is interesting, though. First of all why the change has been triggered by the edit box key press? This sounds plainly wrong. The other interesting thing is that we go through CStream::setBuffer: [...] if (filters.size()) { // TODO when can this happen? Our changes are // always made in the separate streams!!! kernelPrintDbg(debug::DBG_DBG, "Removing Filter entry from the stream"); dictionary.delProperty ("Filter"); } so this soulds like an unexpected code path. |
|
(0001046) hockm0bm 04-30-10 17:37 |
I have to leave now and will continue on that on Monday. The reason why QSContentStream::saveChange is called is for Martin and I have asked him to explain that in bt#328. Nevertheless even if we fix that we still should look how is it possible that changed content stream is somehow mangled. Jozo, I think that you are able to reproduce that by modifying that object directly or simply calling change() method on it. |
|
(0001047) misuj1am 04-30-10 19:40 |
1. void setBuffer (const Container& buf) ... getFilters(filters); if (filters.size()) { // TODO when can this happen? Our changes are // always made in the separate streams!!! kernelPrintDbg(debug::DBG_DBG, "Removing Filter entry from the stream"); dictionary.delProperty ("Filter"); is completely ok, because we do NOT encode streams on CStream level! we correctly save decoded buffer at this point; however, the problem can be the contents of the buffer or later incorrect encoding... |
|
(0001048) misuj1am 04-30-10 20:06 |
logically, the characters are written correctly 020A -> i 020B -> j 020C -> k 020D -> i <-- strange is that this is displayed the same as 020A = i |
|
(0001049) misuj1am 04-30-10 20:45 |
it seems as an Adobe bug, foxit our gui google docs, zohoviewer display it correctly, it would be nice to find out exactly where the problem is by not encoding, flattening, and shortening the string... |
|
(0001050) siwira 05-02-10 12:21 edited on: 05-02-10 20:45 |
This bug follows a simple rule. The 13th distinct character in a file is displayed as the 10th distinct character. I've produced two minimal example files with OpenOffice 3.2. In Ex1-Updated "C" is displayed as " ". In Ex2-Updated "C" is displayed as "@", showing distinct characters must be counted. The file Appendix B in bt#312 follows the same rule - "n" is the 13th and "y" the 10th distinct character in "Waitakere City Council". Interestingly this file was produced with a different application - FREE PDFill. Adobe is not the only viewer that displays incorectly. In Linux (openSUSE 11.2) GNU gv 3.6.9 and Ghostview 1.5 also display updated files incorrectly. In Win XP, PDF-XChange and gsview32 do the same. Foxit and Sumatra display correctly. |
|
(0001052) hockm0bm 05-03-10 15:46 |
Let's get back to the output of the "changed" content stream [2 0]. I was digging around that a bit and the change is that the original stream contains hexadecimal strings while the changed has converted them to literal string. I have tried to change simpleValueToString<pString> to export all strings which contain non-printable characters as hexadecimal strings and AcrobatReader doesn't complain anymore. So either we are doing something wrong or those application have a bug. Never-mind the patch will follow. |
|
(0001053) hockm0bm 05-03-10 17:04 |
OK, let's see what we get from pdfobjects::CContentStream::_objectChanged -> getStringRepresentation "0.1 w q 0 0.1 595.3 841.9 re W* n q 0 0 0 rg BT 56.8 772.2 Td /F1 14 Tf [ (\001\002\003\002\004) -7 (\002\005\002\006\002\a) 4 (\002\b\002\t\002\n) -2 (\002\v) -2 (\002\f\002\r) -2 (\002\016) -2 (\002\017\002\020\002\021\002\022\002\023) -3 (\002\024) 3 (\002\025) -2 (\002\026\002\027\002\030) 1 (\002\031\002\032\002\033\002) ] TJ ET Q q " The broken characters map to: orig: i j k l ...020A>-2<020B>-2<020C020D> changed: ...\002\n) -2 (\002\v) -2 (\002\f\002\r) and they are pretty much same. I am just not sure whether \r cannot cause some problems during document parsing. Maybe it should be escaped? |
|
(0001054) hockm0bm 05-03-10 17:19 |
And here is the same getStringRepresentation from Ex2.pdf (here we have C->@) I don't know how to map Tf parameters into characters, though: 0.1 w /Artifact BMC q 0 0 595.4 842 re W* n EMC /Standard <<\n/MCID 0\n>> BDC q 0 0 0 rg BT 56.8 762.9 Td /F1 24 Tf [ (\001\001\001\002\002\003) 4 (\004\004\004\005\005\006\006\005\005\004\004\004) -3 (\a) 4 (\b\b\t\t\t\n\n\n\v) -2 (\v) 1 (\f\f\f\r\r\r) -4 (\016) ] TJ ET Q while the original one was: 0.1 w /Artifact BMC q 0 0 595.4 842 re W* n EMC /Standard<</MCID 0>>BDC q 0 0 0 rg BT 56.8 762.9 Td /F1 24 Tf[<010101020203>4<040404050506060505040404>-3<07>4<08080909090A0A0A0B>-2<0B>1<0C0C0C0D0D0D>-4<0E>]TJ ET Q |
|
(0001055) misuj1am 05-03-10 17:21 |
i did not understand it, can you describe the problem in more detail? |
|
(0001056) hockm0bm 05-03-10 17:23 |
> i did not understand it, can you describe the problem in more detail? what exactly you didn't understand? |
|
(0001057) misuj1am 05-03-10 17:43 |
orig: i j k l ...020A>-2<020B>-2<020C020D> changed: ...\002\n) -2 (\002\v) -2 (\002\f\002\r) ========= what exactly is \v? \f? that is what we produce? according to pdf spec only \n Line feed (LF) \r Carriage return (CR) \t Horizontal tab (HT) \b Backspace (BS) \f Form feed (FF) \( Left parenthesis \) Right parenthesis \\ Backslash are known, if there is \v the \ is ignored |
|
(0001058) hockm0bm 05-03-10 17:52 |
> what exactly is \v? \f? that is what we produce? according to pdf spec only This is what GDB printed when I displayed generated string. \v \f are just printed instead of \011 and \012 (0B and 0C in hexa) as GDB recognizes them in the ascii table, I guess. What we write to the file are realy ascii 11 and ascii 12. |
|
(0001059) hockm0bm 05-03-10 20:31 |
I have played a bit with the save_binary_strings_as_hexa.patch to find out which binary data might be harmful and it it looks like CR (ascii 13) is the culprit: +bool isBinaryString(const std::string&val) +{ + for(std::string::const_iterator i = val.begin(); i != val.end(); ++i) + if(*i == 13) + return true; + return false; +} with this in place I am getting document which is displayed correctly. Just for background, each string which contains CR will be stored as a hex string with this change. |
|
(0001060) hockm0bm 05-03-10 20:55 |
OK, I think I've finally got it. The specification says: " If an *end-of-line marker* appears within a literal string without a preceding back- slash, the result is equivalent to *\n* (regardless of whether the end-of-line marker was a carriage return, a line feed, or both). " This would mean that if we have \r -> 0D then it is evaluated as \n -> 0A. This also answers why the save_binary_strings_as_hexa.patch fixed the issue as there is no \r in the hexadecimal string. So there is no bug in acrobat reader and in fact the "bug" is on our end. |
|
(0001061) hockm0bm 05-03-10 21:05 |
save_binary_strings_as_hexa.patch has been committed to the CVS. |