Document Interchange

From
Jump to: navigation, search

Document Interchange

The PDF specification defines various mechanisms facilitating the inclusion of the higher-level information about the content structure into the document. It simplifies custom document processing, improves accesibility while not affecting the appearance of the PDF document. All these mechanisms are descibed below in the correspondind sections.

Metadata

A PDF document may include some information, such as the title, author, creation and modification dates. Such information about the document is called document's metadata. Starting with PDF 1.4, it became possible to include the metadata for individual objects in a document. This object-specific metadata is called object-level metadata.

Metadata can be stored in PDF document using two approaches:

  • For document data and for object-level metadata - metadata stream associated with the document or a component of the document (see the section 14.3.2, "Metadata streams" of the specification). Metadata streams are the preferred method in PDF 2.0. One can set the document's medatada using the following code:
using (FixedDocument document = new FixedDocument(documentStream))
{
    document.SetMetadata([metadata stream]);
    document.Save();
}


  • For document metadata only - in document information dictionary associated with the document(see the section 14.3.3, "Document information dictionary"). The usage of the document information dictionary for document metadata became deprecated with PDF 2.0 standard, except for the CreationDate and ModDate entries. Apitron PDF Kit sets this info automatically.

Marked content

Marked content is the mechanism to incorporate a certain markup serving the interests of the particular PDF processor, it uses tags to add additional information to the portions of document's content. Using Apitron PDF Kit and its Fixed layout API one may use a special type called MarkedContent which is a descendand of the ClippedContent class implementing tagging and logical structure elements support.

Logical structure

It's a framework defining standard structure types which can be used to describe the logical structure of the document, e.g. chapters, articles etc. Employing namespaces to provide the user with an ability to reuse the existing structure definitions corresponding to other formats and map them to standard types using custom mapping schemes, logical structure is a powerful feature simplifying the document's structural processing.

Tagged PDF

Built on the logical structure framework, tagged PDF can be considered as its special case. It defines a set of standard structure types and attributes that allow page content (text, graphics, images, and also annotations and form fields) to be extracted and reused for other purposes.

Tagged PDF is intended for use by tools that perform operations such as:

  • Extraction of text and graphics for pasting into other apps
  • Implementing reflow of page contents - text as well as associated graphics and images or annotations and form fields, in order to fit a display area of a different size than was assumed when the document was created
  • Processing of content for purposes such as searching, indexing, and spellchecking
  • Conversion to other common file formats (such as HTML, XML, RTF, DOC, DOCX) preserving the document's structure
  • Making content accessible to users with disabilities

Sample, demonstrating the creation of the tagged PDF document using Fixed layout API is shown below (when you use Flow layout API the logical structure necessary for tagged PDF becomes generated automatically):

/// <summary>
/// Creates the tagged pdf file.
/// </summary>
/// <param name="fileName">Name of the file.</param>
public static void CreateFile(string fileName)
{
    using (FixedDocument document = new FixedDocument())
    {
        // create structure elements, when you reference the element by index 
        // it affects the existing element or creates a new one, but
        // when you use it without index, new element is being added
        // so e.g. [.../P/Span] would produce new paragraph with a nested span,
        // while [.../P[0]/Span] would insert the span into the existing paragraph
        StructureElement h1 = document.LogicalStructure["Document[0]/H1"];
        StructureElement docLevelP = document.LogicalStructure["Document[0]/P"];
        StructureElement span = document.LogicalStructure["Document[0]/P[0]/Span"];
        
        StructureElement section = document.LogicalStructure["Document[0]/Sect"];
        StructureElement p1 = document.LogicalStructure["Document[0]/Sect[0]/P"];
        StructureElement p2 = document.LogicalStructure["Document[0]/Sect[0]/P"];
        
        // set custom attribute for the section element
        section.Attributes.Add(new Apitron.PDF.Kit.Interchange.LogicalStructure.Attribute("producer", 
            Encoding.UTF8.GetBytes("Apitron PDF Kit")));
        
        // create content for the heading element
        MarkedContent h1Content = new MarkedContent(h1);
        h1Content.SetDeviceNonStrokingColor(1, 0, 0); // Red
        TextObject h1Text = new TextObject(StandardFonts.HelveticaBold, 20);
        h1Text.AppendText("This is a level 1 heading");
        h1Content.AppendText(h1Text);
        
        // create content for doc level paragraph with nested inline span
        MarkedContent spanContent = new MarkedContent(span);
        TextObject headingDescription = new TextObject(StandardFonts.HelveticaBold, 16); 
        headingDescription.SetTextLeading(16);
        headingDescription.AppendText("A heading element briefly describes the topic of the section it");
        headingDescription.AppendTextLine("introduces. Heading information may be used, for example,"); 
        headingDescription.AppendTextLine("to construct a table of contents for a document.");
        headingDescription.AppendTextLine("And this text is a span element.");
        spanContent.AppendText(headingDescription);
        
        // add span to paragraph
        MarkedContent docLevelPContent = new MarkedContent(docLevelP);
        docLevelPContent.AppendContent(spanContent);
        
        // create first paragraph
        MarkedContent p1Content = new MarkedContent(p1);
        p1Content.SetDeviceNonStrokingColor(0, 0, 1); // blue
        
        TextObject p1Text = new TextObject(StandardFonts.Helvetica, 14);
        p1Text.SetTextLeading(14);
        p1Text.AppendText("This is a paragraph, it can contain as many lines as you need. So here goes");
        p1Text.AppendTextLine("the second line, and it can use different ");
        p1Text.SetFont(StandardFonts.TimesBoldItalic, 14);
        p1Text.AppendText("font settings.");
        p1Content.AppendText(p1Text);
        
        // create second paragraph
        MarkedContent p2Content = new MarkedContent(p2);
        p2Content.SetDeviceNonStrokingColor(1, 0, 1); // pink
        
        TextObject p2Text = new TextObject(StandardFonts.Helvetica, 14);
        p2Text.SetTextLeading(14);
        p2Text.AppendText("This is another paragraph, and you can add as many as you need into one section. Almost");
        p2Text.AppendTextLine("like in");
        p2Text.SetFont(StandardFonts.TimesBold, 14);
        p2Text.AppendText(" HTML ");
        p2Text.SetFont(StandardFonts.Helvetica, 14);
        p2Text.AppendText("so you know how to work with it already.");
        p2Content.AppendText(p2Text);
        
        // add both paragraphs into the section
        ClippedContent sectionContent = new MarkedContent(section);
        sectionContent.AppendContent(p1Content);
        sectionContent.SetTranslation(0, -35);
        sectionContent.AppendContent(p2Content);
        
        Page page = new Page();
        // add header
        page.Content.SaveGraphicsState();
        page.Content.SetTranslation(10, 800);
        page.Content.AppendContent(h1Content);
        page.Content.RestoreGraphicsState();
        
        // add paragraph with nested span
        page.Content.SaveGraphicsState();
        page.Content.SetTranslation(10, 780);
        page.Content.AppendContent(docLevelPContent);
        page.Content.RestoreGraphicsState();
      
        // add section
        page.Content.SaveGraphicsState();
        page.Content.SetTranslation(10, 700);
        page.Content.AppendContent(sectionContent);
        page.Content.RestoreGraphicsState();
        
        document.Pages.Add(page);
        
        using (Stream stream = File.Create(fileName))
        {
            document.Save(stream);
        }
    }
}
 

Resulting document, produced by this code, is shown below:

Tagged PDF document