how to decode pdf 'stream'

Jean-Yves_Pochez · October 25, 2016, 3:04pm

Hi everyone,

I’m searching for the protocol to decode the ‘stream’ operations in a pdf file.
do you have any document that describes the process,
any link and example on how to decode such chunk of bytes ?

I’ve search quite a lot on this, and did not find anything useful.

the idea is to open and display a pdf at a desired page
(and no Christian I don’t want to use a plugin ! - sorry no offense)

thanks.

DaveS · October 25, 2016, 3:14pm

PDF Streams can contain all kinds of information…

Text
Images
Drawing command (lines, rectangles etc)

they can also contain other information such as input fields, tables, pointers, pointers to other pointers

And they can be in “plain text” (human readable) format, or they can be encrypted streams (zip format)

SO, the answer is there is no “easy” way to do it, without reading and understanding the Adobe PDF Specification (and trust me, that is no easy task).

I’ve spent the last couple of years on and off, working with the PDF specs… download the demo of gPDF and look at the output it produces (and this is the “easy” stuff)

Jean-Yves_Pochez · October 25, 2016, 3:28pm

how do you decode the encrypted zip format streams ?
for now that’s the only thing I’m searching for.
(and this is my original question !)

Christian_Schmitz · October 25, 2016, 3:30pm

What do you want to do?

DynaPDF can decide you a lot and you can get decoded content stream.

You can read PDF spec from Adobe.

Bernardo_Monsalve · October 25, 2016, 7:15pm

With zLib, look at 276 line of contentPage class of DBReportPDF component. Works like graphics class of xojo.

In PDF files use zlib to uncompress the data, as Dave says could be a text or image or other data, the object tell you what is the “stream”.

DaveS · October 25, 2016, 10:44pm

a stream is NOT one thing (image OR text … it could be any number or combination of things]
here is a “simple” example

stream
0. 0. 0. rg BT /F5 30 Tf 100 670 Td (This ) Tj ET 1. 0. 0. rg BT /F6 30 Tf 165 670 Td (is) Tj ET 0. 0. 0. rg BT /F5 16 Tf 190 670 Td ( the ) Tj ET BT 221 670 Td (text) Tj 1.7 w
221 664.9 m 246 664.9 l S ET BT 246 670 Td ( ) Tj ET BT /F5 30 Tf 251 670 Td (that) Tj ET BT 100 640 Td (we are going to save into our ) Tj ET BT 100 610 Td (file from the TextArea.) Tj ET 1. 0. 0. rg BT /F6 18 Tf 100 592 Td (Isn’t that \231 interesting?) Tj ET 0. 0. 0. rg BT /F5 30 Tf 100 562 Td (|\241|\231|\243|\242|?|\247|\266|\225|\252|\272|\226|\267|\252|) Tj ET BT 100 532 Td (Man, I sure do love using this ) Tj ET BT 100 502 Td (to make PDF Files!.) Tj ET q BT 1. 0. 0. RG 1. 0. 0. RG 1 0 0 1 78.0545 101.0117 cm 0.612 0.792 -0.792 0.612 0 0 cm /F5 100 Tf 0 0 Td 1 Tr 1 w (\105\166\141\154\165\141\164\151\157\156\040\115\157\144\145) Tj ET Q 1. 0. 0. rg 1 0 0 1 0 0 Tm 0 Tr /F5 14 Tf q BT 10 14 Td (\105\166\141\154\165\141\164\151\157\156\040\115\157\144\145) Tj ET BT 10 778 Td (\105\166\141\154\165\141\164\151\157\156\040\115\157\144\145) Tj ET BT 497.7112 14 Td (\105\166\141\154\165\141\164\151\157\156\040\115\157\144\145) Tj ET BT 497.7112 778 Td (\105\166\141\154\165\141\164\151\157\156\040\115\157\144\145) Tj ET Q

endstream

that happens to be 6 or 8 styleruns from an Xojo TextArea translated to PDF using my gPDF class… this one doesn’t happen to have any “drawing” commands, but it just as well could.
So having the “stream” in human readable format as this is, may not get you anywhere unless you further know how to decypher the contents… (and don’t get me started on an image stream)

Christian_Schmitz · October 26, 2016, 5:09am

Well, if you use DynaPDF Lite and you import that page, you can just query the images on a page and DynaPDF provides them decoded. Either if JPEG pass through to a JPEG file or in any other format convert them to TIFF, PNG or JPEG.

see
http://www.monkeybreadsoftware.net/example-dynapdf-extractimageobjects.shtml

And when using DynaPDF Pro the ParseInterface can give you all those draw commands as events when processing them. This way we can decode all the commands just fine.
see
http://www.monkeybreadsoftware.net/class-dynapdfparseinterfacembs.shtml

DaveS · October 26, 2016, 5:20am

Christian… not sure how that applies to the question the OP asked… .but a nice plug none the less

Christian_Schmitz · October 26, 2016, 5:41am

Well, the OP can read the PDF specs from Adobe:
https://www.adobe.com/devnet/pdf/pdf_reference.html

Jens did and build DynaPDF using those specs.
And it may be easier for the OP and others to just use what’s there instead of spending hours to reinvent the wheel.
(others may read the thread later and prefer the plugin)

Jean-Yves_Pochez · October 26, 2016, 7:19am

well the answer I was seeking is “zlib” !
thank you, Bernardo.
will have to try it on my apps, but it seems it.

Christian and Dave, no offense, but I have been stuck with plugins not updated years ago, and I dont want to use them anymore.
Dynapdf is a huge work, and I cannot afford it (and it’s a plugin see line above…)
and yes may be people reading this thread may want to use dynapdf.

I read (quickly) the pdf reference, but did not find any example for the stream decompression.

DaveS · October 26, 2016, 1:56pm

Jean-Yves… by no means am I attempting to encourgage you to use plug-ins (Christians or anyone elses). All I am attempting to do is educate you on the fact that decoding the guts of a PDF file in by no means a trivial matter.

If you want an easy way to CREATE a PDF… and don’t want an expensive plug-in… Look at my gPDF class (its is source code in Xojo itself)… rdS.com/gpdf

and no, the PDF spec does not speak of decompression, as a matter of fact for as verbose as the spec is, it leaves out about 90% of the importatant information.