PDF

From
Jump to: navigation, search

Base definition (PDF 1.7) and notes

"ISO 32000 specifies a digital form for representing documents called the Portable Document Format or usually referred to as PDF. PDF was developed and specified by Adobe Systems Incorporated beginning in 1993 and continuing until 2007 when this ISO standard was prepared. The Adobe Systems version PDF 1.7 is the basis for the ISO 32000 edition. The specifications for PDF are backward inclusive, meaning that PDF 1.7 includes all of the functionality previously documented in the Adobe PDF Specifications for versions 1.0 through 1.6. It should be noted that where Adobe removed certain features of PDF from their standard, they too are not included in ISO 32000.

The goal of PDF is to enable users to exchange and view electronic documents easily and reliably, independently of the environment in which they were created or the environment in which they are viewed or printed. At the core of PDF is an advanced imaging model derived from the PostScript® page description language. This PDF Imaging Model enables the description of text and graphics in a device-independent and resolution-independent manner. To improve performance for interactive viewing, PDF defines a more structured format than that used by most PostScript language programs. Unlike Postscript, which is a programming language, PDF is based on a structured binary file format that is optimized for high performance in interactive viewing. PDF also includes objects, such as annotations and hypertext links, that are not part of the page content itself but are useful for interactive viewing and document interchange.

PDF files may be created natively in PDF form, converted from other electronic formats or digitized from paper, microform, or other hard copy format. Businesses, governments, libraries, archives and other institutions and individuals around the world use PDF to represent considerable bodies of important information."
- PDF specification

Previous paragraph is a quote from the specification and it gives you brief understanding of the format history. But what PDF actually is and what you should know about it if you don’t want to study the spec completely and implement your own component (756 pages + supplementary material e.g. font formats)? Not considering the binary part, PDF is a vector-based format for representing documents, where each page may contain drawing operations in a form of commands e.g. draw line; fill polygon, change stroking color, draw image etc. and text drawing operators. Several well-known font formats are being supported and, since all drawing operations in PDF are vector-based (except for raster images), PDF pages usually scale well, being resolution-independent. In this documentation we will consider PDF manipulations from the high level of abstraction based on the API provided by Apitron PDF Kit and Apitron PDF Rasterizer and diving into binary details only if it’s absolutely necessary.

PDF/A

PDF/A (ISO 19005) is the standard for the archiving of digital documents. In short, documents conforming to this standard shouldn't have external resource dependencies e.g. all fonts used should be embedded, special color management should be implemented and others. There are a few versions of it, namely:

  • PDF/A-1, based on PDF 1.4 and specifies two levels of conformance: PDF/A-1b(basic) and PDF/A-1a(accessible)
  • PDF/A-2, based on changes introduced with PDF 1.5-1.7 and specifies three levels of conformance: PDF/A-2b(basic), PDF/A-2a(accessible), and PDF/A-2u(same as level B but requires all text in the document to have Unicode mapping)
  • PDF/A-3, same as PDF/A-2 with changes allowing to embed files in various formats

PDF 2.0 overview and changes

As you probably know the PDF 2.0 specification has been recently released and it introduced many long awaited features as well as deprecated some of the those existing. The changes listed below provide a brief overview of what was introduced to the PDF with version 2.0. Apitron's dev team is working on adding support for all new features and while major features are already implemented, it's a work in progress and some features may still be under development.

Features:

  • "Unencrypted wrapper document" – the way to use custom encryption algorithms for encrypting strings and streams
  • "Use of black point compensation" – related to rendering intents and color conversion using ICC
  • "Projection annotations" – projection annotations provide a way to save 3D and other specialized measurements and comments as markup annotations. These measurements and comments then persist in the document
  • "CAdES signatures as used in PDF" – adds new subfilter to PDF signatures
  • "Long term validation of signatures" along with the "Document Security Store (DSS)" and "Document timestamp (DTS) dictionary" specifies the way to store the information needed to validate signatures on a long-term basis (CRL, OCSP and Timestamp data)
  • "Geospatial features" – new way to store geospatial data
  • "Rich media" annotations – common framework for video, audio and animated content
  • "Namespaces" for tagged PDF – new way for preserving logical structure of docs converted from other formats
  • "Pronunciation hints" – a way to aid text-to-speech processors with correct pronunciation
  • "Document parts" – specifies the way the to divide the document into logical parts with different purposes, simply speaking, defining subdocuments
  • "Associated files" – a way to indicate the relationship between the PDF objects in the document and content in other formats
  • Support for ISO 14739-1:2014 Product Representation Compact (PRC) file format
  • Support for UTF-8

Existing PDF features were updated by adding the following capabilities:

  • Transparency and blend mode attributes for the annotations
  • Stamp Annot intent
  • Polygon and Polyline real paths
  • 256-bit AES encryption
  • ECC-based certificates
  • Unicode based passwords
  • Document requirement extensions
  • New value for tab order of fields and annotations
  • Page-level OutputIntents
  • Referenced (external) OutputIntents
  • Thumbnails for embedded files
  • Halftone Origin (HTO)
  • Measurement and Point Data for image and form XObjects
  • L (length) key for inline image data
  • Viewer preferences enforcement (for print scaling)
  • 3D measurements
  • GoToDp action
  • RichMediaExecute action
  • Extension for GoTo and GoToR supporting linking to specific structure elements
  • Extension for Signature Field Locks and Signature Seed Values
  • Extensions for 3D viewing conditions, including transparency
  • Ref (reference) structure elements
  • PageNum and Bates artifact types
  • New list types for structured lists
  • “Short” (short name) attribute for table header cells
  • Extensions to OutputIntents (MixingHints and SpectralData)

Deprecated:

  • XFA (finally!) including NeedsRendering
  • Movie, Sound and TrapNet annotations
  • Movie and Sound actions
  • Info dictionary
  • Assistive technology restrictions via DRM
  • ProcSet
  • Operation system specific file specifications
  • Operation system specific additions to Launch actions
  • Names for XObjects
  • Names for Fonts
  • Arrays of Blend Modes
  • Alternate Presentations
  • Open prepress interface
  • CharSet (For Type 1 fonts)
  • CIDSet (for CID fonts)
  • Prepress viewer preferences (ViewArea, ViewClip and so on)
  • NeedAppearances
  • adbe.pkcs7.sha1 subfilter, as SHA1 was considered weak
  • adbe.x509.rsa_sha1 subfilter, same as above
  • Encryption of FDF files
  • Suspects flag in MarkInfo dictionary
  • UR signatures