Support models
Support models are abstracts over “raw” objects within a Pdf. For example, a page
in a PDF is a Dictionary with set to /Type of /Page. The Dictionary in
that case is the “raw” object. Upon establishing what type of object it is, we
can wrap it with a support model that adds features to ensure consistency with
the PDF specification.
pikepdf does not currently apply support models to “raw” objects automatically, but might do so in a future release (this would break backward compatibility).
- class pikepdf.Page
Support model wrapper around a raw page dictionary object.
To initialize a
Pagesupport model:from pikepdf import Pdf, Page Pdf = open(...) page_support_model = Page(pdf.pages[0])
- add_content_token_filter(self: pikepdf.Page, tf: pikepdf.TokenFilter) → None
Attach a
pikepdf.TokenFilterto a page’s content stream.This function applies token filters lazily, if/when the page’s content stream is read for any reason, such as when the PDF is saved. If never access, the token filter is not applied.
Multiple token filters may be added to a page/content stream.
Token filters may not be removed after being attached to a Pdf. Close and reopen the Pdf to remove token filters.
If the page’s contents is an array of streams, it is coalesced.
- add_overlay(other, rect=None)
Overlay another object on this page.
Overlays will be drawn after all previous content, potentially drawing on top of existing content.
- Parameters
other (Union[pikepdf.objects.Object, pikepdf._qpdf.Page]) – A Page or Form XObject to render as an overlay on top of this page.
rect (Optional[pikepdf._qpdf.Rectangle]) – The PDF rectangle (in PDF units) in which to draw the overlay. If omitted, this page’s trimbox, cropbox or mediabox will be used.
New in version 2.14.
- add_resource(res, res_type, name=None, *, prefix='', replace_existing=True)
Adds a new resource to the page’s Resources dictionary.
If the Resources dictionaries do not exist, they will be created.
- Parameters
self – The object to add to the resources dictionary.
res (pikepdf.objects.Object) – The dictionary object to insert into the resources dictionary.
res_type (pikepdf.objects.Name) – Should be one of the following Resource dictionary types: ExtGState, ColorSpace, Pattern, Shading, XObject, Font, Properties.
name (Optional[pikepdf.objects.Name]) – The name of the object. If omitted, a random name will be generated with enough randomness to be globally unique.
prefix (str) – A prefix for the name of the object. Allows conveniently namespacing when using random names, e.g. prefix=”Im” for images. Mutually exclusive with name parameter.
replace_existing (bool) – If the name already exists in one of the resource dictionaries, remove it.
- Returns
The name of the object.
- Return type
Example
>>> resource_name = Page(pdf.pages[0]).add_resource(formxobj, Name.XObject)
New in version 2.3.
Changed in version 2.14: If res does not belong to the same Pdf that owns this page, a copy of res is automatically created and added instead. In previous versions, it was necessary to change for this case manually.
- add_underlay(other, rect=None)
Underlay another object beneath this page.
Underlays will be drawn before all other content, so they may be overdrawn partially or completely.
- Parameters
other (Union[pikepdf.objects.Object, pikepdf._qpdf.Page]) – A Page or Form XObject to render as an underlay underneath this page.
rect (Optional[pikepdf._qpdf.Rectangle]) – The PDF rectangle (in PDF units) in which to draw the underlay. If omitted, this page’s MediaBox will be used.
New in version 2.14.
- as_form_xobject(self: pikepdf.Page, handle_transformations: bool = True) → pikepdf.Object
Return a form XObject that draws this page.
This is useful for n-up operations, underlay, overlay, thumbnail generation, or any other case in which it is useful to replicate the contents of a page in some other context. The dictionaries are shallow copies of the original page dictionary, and the contents are coalesced from the page’s contents. The resulting object handle is not referenced anywhere.
- Parameters
handle_transformations (bool) – If True, the resulting form XObject’s
/Matrixwill be set to replicate rotation (/Rotate) and scaling (/UserUnit) in the page’s dictionary. In this way, the page’s transformations will be preserved when placing this object on another page.
- calc_form_xobject_placement(self: pikepdf.Page, formx: pikepdf.Object, name: pikepdf.Object, rect: pikepdf.Rectangle, *, invert_transformations: bool = True, allow_shrink: bool = True, allow_expand: bool = False) → bytes
Generate content stream segment to place a Form XObject on this page.
The content stream segment must be then be added to the page’s content stream.
The default keyword parameters will preserve the aspect ratio.
- Parameters
formx – The Form XObject to place.
name – The name of the Form XObject in this page’s /Resources dictionary.
rect – Rectangle describing the desired placement of the Form XObject.
invert_transformations – Apply /Rotate and /UserUnit scaling when determining FormX Object placement.
allow_shrink – Allow the Form XObject to take less than the full dimensions of rect.
allow_expand – Expand the Form XObject to occupy all of rect.
New in version 2.14.
- contents_add(*args, **kwargs)
Overloaded function.
contents_add(self: pikepdf.Page, contents: pikepdf.Object, prepend: bool = False) -> None
Append or prepend to an existing page’s content stream using an existing stream object.
New in version 2.14.
contents_add(self: pikepdf.Page, contents: bytes, *, prepend: bool = False) -> None
Append or prepend to an existing page’s content stream from bytes.
New in version 2.14.
- contents_coalesce(self: pikepdf.Page) → None
Coalesce a page’s content streams.
A page’s content may be a stream or an array of streams. If this page’s content is an array, concatenate the streams into a single stream. This can be useful when working with files that split content streams in arbitrary spots, such as in the middle of a token, as that can confuse some software.
- property cropbox
This page’s effective /CropBox, in PDF units.
If the /CropBox is not defined, the /MediaBox is returned.
- externalize_inline_images(self: pikepdf.Page, min_size: int = 0) → None
Convert inlines image to normal (external) images.
- Parameters
min_size (int) – minimum size in bytes
- get_filtered_contents(self: pikepdf.Page, tf: TokenFilter) → bytes
Apply a
pikepdf.TokenFilterto a content stream, without modifying it.This may be used when the results of a token filter do not need to be applied, such as when filtering is being used to retrieve information rather than edit the content stream.
Note that it is possible to create a subclassed
TokenFilterthat saves information of interest to its object attributes; it is not necessary to return data in the content stream.To modify the content stream, use
pikepdf.Page.add_content_token_filter().- Returns
the modified content stream
- Return type
- property index
Returns the zero-based index of this page in the pages list.
That is, returns
nsuch thatpdf.pages[n] == this_page. AValueErrorexception is thrown if the page is not attached to aPdf.Requires O(n) search.
New in version 2.2.
- property label
Returns the page label for this page, accounting for section numbers.
For example, if the PDF defines a preface with lower case Roman numerals (i, ii, iii…), followed by standard numbers, followed by an appendix (A-1, A-2, …), this function returns the appropriate label as a string.
It is possible for a PDF to define page labels such that multiple pages have the same labels. Labels are not guaranteed to be unique.
Note that this requires a O(n) search over all pages, to look up the page’s index.
New in version 2.2.
Changed in version 2.9: Returns the ordinary page number if no special rules for page numbers are defined.
- property mediabox
This page’s /MediaBox, in PDF units.
- property obj
Get the underlying
pikepdf.Object.
- parse_contents(self: pikepdf.Page, arg0: pikepdf.StreamParser) → None
Parse a page’s content streams using a
pikepdf.StreamParser.The content stream may be interpreted by the StreamParser but is not altered.
If the page’s contents is an array of streams, it is coalesced.
- remove_unreferenced_resources(self: pikepdf.Page) → None
Removes from the resources dictionary any object not referenced in the content stream.
A page’s resources dictionary maps names to objects elsewhere in the file. This method walks through a page’s contents and keeps tracks of which resources are referenced somewhere in the contents. Then it removes from the resources dictionary any object that is not referenced in the contents. This method is used by page splitting code to avoid copying unused objects in files that used shared resource dictionaries across multiple pages.
- property resources
Return this pages resources dictionary.
- rotate(self: pikepdf.Page, angle: int, relative: bool) → None
Rotate a page.
If
relativeisFalse, set the rotation of the page to angle. Otherwise, add angle to the rotation of the page.anglemust be a multiple of90. Adding90to the rotation rotates clockwise by90degrees.
- property trimbox
This page’s effective /TrimBox, in PDF units.
If the /TrimBox is not defined, the /CropBox is returned (and if /CropBox is not defined, /MediaBox is returned).
- class pikepdf.PdfMatrix(*args)
Support class for PDF content stream matrices
PDF content stream matrices are 3x3 matrices summarized by a shorthand
(a, b, c, d, e, f)which correspond to the first two column vectors. The final column vector is always(0, 0, 1)since this is using homogenous coordinates.PDF uses row vectors. That is,
vr @ A'gives the effect of transforming a row vectorvr=(x, y, 1)by the matrixA'. Most textbook treatments useA @ vcwhere the column vectorvc=(x, y, 1)'.(
@is the Python matrix multiplication operator.)Addition and other operations are not implemented because they’re not that meaningful in a PDF context (they can be defined and are mathematically meaningful in general).
PdfMatrix objects are immutable. All transformations on them produce a new matrix.
- a
- b
- c
- d
- e
- f
Return one of the six “active values” of the affine matrix.
eandfcorrespond to x- and y-axis translation respectively. The other four letters are a 2×2 matrix that can express rotation, scaling and skewing;a=1 b=0 c=0 d=1is the identity matrix.
- encode()
Encode this matrix in binary suitable for including in a PDF
- static identity()
Constructs and returns an identity matrix
- rotated(angle_degrees_ccw)
Concatenates a rotation matrix on this matrix
- scaled(x, y)
Concatenates a scaling matrix on this matrix
- property shorthand
Return the 6-tuple (a,b,c,d,e,f) that describes this matrix
- translated(x, y)
Translates this matrix
- class pikepdf.PdfImage(obj)
Support class to provide a consistent API for manipulating PDF images
The data structure for images inside PDFs is irregular and flexible, making it difficult to work with without introducing errors for less typical cases. This class addresses these difficulties by providing a regular, Pythonic API similar in spirit (and convertible to) the Python Pillow imaging library.
- as_pil_image()
Extract the image as a Pillow Image, using decompression as necessary
- Returns
PIL.Image.Image
- extract_to(*, stream=None, fileprefix='')
Attempt to extract the image directly to a usable image file
If possible, the compressed data is extracted and inserted into a compressed image file format without transcoding the compressed content. If this is not possible, the data will be decompressed and extracted to an appropriate format.
Because it is not known until attempted what image format will be extracted, users should not assume what format they are getting back. When saving the image to a file, use a temporary filename, and then rename the file to its final name based on the returned file extension.
Examples
>>> im.extract_to(stream=bytes_io) '.png'
>>> im.extract_to(fileprefix='/tmp/image00') '/tmp/image00.jpg'
- Parameters
stream – Writable stream to write data to.
fileprefix (str or Path) – The path to write the extracted image to, without the file extension.
- Returns
If fileprefix was provided, then the fileprefix with the appropriate extension. If no fileprefix, then an extension indicating the file type.
- Return type:
str
- get_stream_buffer(decode_level=<StreamDecodeLevel.specialized: 2>)
Access this image with the buffer protocol
- property icc
If an ICC profile is attached, return a Pillow object that describe it.
Most of the information may be found in
icc.profile.- Returns
PIL.ImageCms.ImageCmsProfile
- property is_inline
Falsefor image XObject
- read_bytes(decode_level=<StreamDecodeLevel.specialized: 2>)
Decompress this image and return it as unencoded bytes
- show()
Show the image however PIL wants to
- class pikepdf.PdfInlineImage(*, image_data, image_object)
Support class for PDF inline images
- Parameters
image_object (tuple) –
- class pikepdf.models.PdfMetadata(pdf, pikepdf_mark=True, sync_docinfo=True, overwrite_invalid_xml=True)
Read and edit the metadata associated with a PDF
The PDF specification contain two types of metadata, the newer XMP (Extensible Metadata Platform, XML-based) and older DocumentInformation dictionary. The PDF 2.0 specification removes the DocumentInformation dictionary.
This primarily works with XMP metadata, but includes methods to generate XMP from DocumentInformation and will also coordinate updates to DocumentInformation so that the two are kept consistent.
XMP metadata fields may be accessed using the full XML namespace URI or the short name. For example
metadata['dc:description']andmetadata['{http://purl.org/dc/elements/1.1/}description']both refer to the same field. Several common XML namespaces are registered automatically.See the XMP specification for details of allowable fields.
To update metadata, use a with block.
Example
>>> with pdf.open_metadata() as records: records['dc:title'] = 'New Title'
See also
- load_from_docinfo(docinfo, delete_missing=False, raise_failure=False)
Populate the XMP metadata object with DocumentInfo
- Parameters
- Return type
A few entries in the deprecated DocumentInfo dictionary are considered approximately equivalent to certain XMP records. This method copies those entries into the XMP metadata.
- property pdfa_status: str
Returns the PDF/A conformance level claimed by this PDF, or False
A PDF may claim to PDF/A compliant without this being true. Use an independent verifier such as veraPDF to test if a PDF is truly conformant.
- Returns
The conformance level of the PDF/A, or an empty string if the PDF does not claim PDF/A conformance. Possible valid values are: 1A, 1B, 2A, 2B, 2U, 3A, 3B, 3U.
- Return type
- property pdfx_status: str
Returns the PDF/X conformance level claimed by this PDF, or False
A PDF may claim to PDF/X compliant without this being true. Use an independent verifier such as veraPDF to test if a PDF is truly conformant.
- Returns
The conformance level of the PDF/X, or an empty string if the PDF does not claim PDF/X conformance.
- Return type
- class pikepdf.models.Encryption(*, owner, user, R=6, allow=Permissions(accessibility=True, extract=True, modify_annotation=True, modify_assembly=False, modify_form=True, modify_other=True, print_lowres=True, print_highres=True), aes=True, metadata=True)
Specify the encryption settings to apply when a PDF is saved.
- Parameters
owner (str) – The owner password to use. This allows full control of the file. If blank, the PDF will be encrypted and present as “(SECURED)” in PDF viewers. If the owner password is blank, the user password should be as well.
user (str) – The user password to use. With this password, some restrictions will be imposed by a typical PDF reader. If blank, the PDF can be opened by anyone, but only modified as allowed by the permissions in
allow.R (int) – Select the security handler algorithm to use. Choose from:
2,3,4or6. By default, the highest version of is selected (6).5is a deprecated algorithm that should not be used.allow (pikepdf.models.encryption.Permissions) – The permissions to set. If omitted, all permissions are granted to the user.
aes (bool) – If True, request the AES algorithm. If False, use RC4. If omitted, AES is selected whenever possible (R >= 4).
metadata (bool) – If True, also encrypt the PDF metadata. If False, metadata is not encrypted. Reading document metadata without decryption may be desirable in some cases. Requires
aes=True. If omitted, metadata is encrypted whenever possible.
- class pikepdf.models.Outline(pdf, max_depth=15, strict=False)
Maintains a intuitive interface for creating and editing PDF document outlines, according to the PDF 1.7 Reference Manual section 12.3.
- Parameters
pdf (pikepdf._qpdf.Pdf) – PDF document object.
max_depth (int) – Maximum recursion depth to consider when reading the outline.
strict (bool) – If set to
False(default) silently ignores structural errors. Setting it toTrueraises apikepdf.OutlineStructureErrorif any object references re-occur while the outline is being read or written.
See also
- class pikepdf.models.OutlineItem(title, destination=None, page_location=None, action=None, obj=None, *, left=None, top=None, right=None, bottom=None, zoom=None)
Manages a single item in a PDF document outlines structure, including nested items.
- Parameters
title (str) – Title of the outlines item.
destination (Optional[Tuple[int, str, pikepdf.objects.Object]]) – Page number, destination name, or any other PDF object to be used as a reference when clicking on the outlines entry. Note this should be
Noneif an action is used instead. If set to a page number, it will be resolved to a reference at the time of writing the outlines back to the document.page_location (Optional[Union[pikepdf.models.outlines.PageLocation, str]]) – Supplemental page location for a page number in
destination, e.g.PageLocation.Fit. May also be a simple string such as'FitH'.action (Optional[pikepdf.objects.Dictionary]) – Action to perform when clicking on this item. Will be ignored during writing if
destinationis also set.obj (Optional[pikepdf.objects.Dictionary]) –
Dictionaryobject representing this outlines item in aPdf. May beNonefor creating a new object. If present, an existing object is modified in-place during writing and original attributes are retained.left (Optional[float]) – Describes the viewport position associated with a destination.
top (Optional[float]) – Describes the viewport position associated with a destination.
bottom (Optional[float]) – Describes the viewport position associated with a destination.
right (Optional[float]) – Describes the viewport position associated with a destination.
zoom (Optional[float]) – Describes the viewport position associated with a destination.
This object does not contain any information about higher-level or neighboring elements.
- classmethod from_dictionary_object(obj)
Creates a
OutlineItemfrom a PDF document’sDictionaryobject. Does not process nested items.- Parameters
obj (pikepdf.objects.Dictionary) –
Dictionaryobject representing a single outline node.
- to_dictionary_object(pdf, create_new=False)
Creates a
Dictionaryobject from this outline node’s data, or updates the existing object. Page numbers are resolved to a page reference on the inputPdfobject.- Parameters
pdf (pikepdf._qpdf.Pdf) – PDF document object.
create_new (bool) – If set to
True, creates a new object instead of modifying an existing one in-place.
- Return type
- class pikepdf.Permissions(accessibility=True, extract=True, modify_annotation=True, modify_assembly=False, modify_form=True, modify_other=True, print_lowres=True, print_highres=True)
Stores the user-level permissions for an encrypted PDF.
A compliant PDF reader/writer should enforce these restrictions on people who have the user password and not the owner password. In practice, either password is sufficient to decrypt all document contents. A person who has the owner password should be allowed to modify the document in any way. pikepdf does not enforce the restrictions in any way; it is up to application developers to enforce them as they see fit.
Unencrypted PDFs implicitly have all permissions allowed. Permissions can only be changed when a PDF is saved.
- Parameters
- class pikepdf.models.EncryptionMethod
Describes which encryption method was used on a particular part of a PDF. These values are returned by
pikepdf.EncryptionInfobut are not currently used to specify how encryption is requested.- none
Data was not encrypted.
- unknown
An unknown algorithm was used.
- rc4
The RC4 encryption algorithm was used (obsolete).
- aes
The AES-based algorithm was used as described in the PDF 1.7 Reference Manual.
- aesv3
An improved version of the AES-based algorithm was used as described in the Adobe Supplement to the ISO 32000, requiring PDF 1.7 extension level 3. This algorithm still uses AES, but allows both AES-128 and AES-256, and improves how the key is derived from the password.
- class pikepdf.models.EncryptionInfo(encdict)
Reports encryption information for an encrypted PDF.
This information may not be changed, except when a PDF is saved. This object is not used to specify the encryption settings to save a PDF, due to non-overlapping information requirements.
- Parameters
encdict (Dict[str, Any]) –
- property P: int
Encoded permission bits.
See
Pdf.allow()instead.
- property R: int
Revision number of the security handler.
- property V: int
Version of PDF password algorithm.
- property bits: int
The number of encryption bits.
- property encryption_key: bytes
The RC4 or AES encryption key used for this file.
- property file_method: str
Encryption method used to encode the whole file.
- property stream_method: str
Encryption method used to encode streams.
- property string_method: str
Encryption method used to encode strings.
- property user_password: bytes
If possible, return the user password.
The user password can only be retrieved when a PDF is opened with the owner password and when older versions of the encryption algorithm are used.
The password is always returned as
byteseven if it has a clear Unicode representation.
- class pikepdf.Annotation
Describes an annotation in a PDF, such as a comment, underline, copy editing marks, interactive widgets, redactions, 3D objects, sound and video clips.
See the PDF 1.7 Reference Manual section 12.5.6 for the full list of annotation types and definition of terminology.
New in version 2.12.
- property appearance_dict
Returns the annotations appearance dictionary.
- property appearance_state
Returns the annotation’s appearance state (or None).
For a checkbox or radio button, the appearance state may be
pikepdf.Name.Onorpikepdf.Name.Off.
- property flags
Returns the annotation’s flags.
- get_appearance_stream(*args, **kwargs)
Overloaded function.
get_appearance_stream(self: pikepdf.Annotation, which: pikepdf.Object) -> pikepdf.Object
Returns one of the appearance streams associated with an annotation.
- Args:
- which: Usually one of
pikepdf.Name.N,pikepdf.Name.Ror pikepdf.Name.D, indicating the normal, rollover or down appearance stream, respectively. If any other name is passed, an an appearance stream with that name is returned.
- which: Usually one of
get_appearance_stream(self: pikepdf.Annotation, which: pikepdf.Object, state: pikepdf.Object) -> pikepdf.Object
Returns one of the appearance streams associated with an annotation.
- Args:
- which: Usually one of
pikepdf.Name.N,pikepdf.Name.Ror pikepdf.Name.D, indicating the normal, rollover or down appearance stream, respectively. If any other name is passed, an an appearance stream with that name is returned.- state: The appearance state. For checkboxes or radio buttons, the
appearance state is usually whether the button is on or off.
- which: Usually one of
- get_page_content_for_appearance(self: pikepdf.Annotation, name: pikepdf.Object, rotate: int, required_flags: int = 0, forbidden_flags: int = 3) → bytes
Generate content stream text that draws this annotation as a Form XObject.
- Parameters
name (pikepdf.Name) – What to call the object we create.
rotate – Should be set to the page’s /Rotate value or 0.
Note
This method is done mainly with QPDF. Its behavior may change when different QPDF versions are used.
- property obj
Returns the underlying object for this annotation.
- property subtype
Returns the subtype of this annotation.