pdf2docx.shape.Path module

Objects representing PDF path (stroke and filling) extracted from pdf drawings and annotations.

Data structure based on results of page.get_drawings():

{
    'color': (x,x,x) or None,  # stroke color
    'fill' : (x,x,x) or None,  # fill color
    'width': float,            # line width
    'closePath': bool,         # whether to connect last and first point
    'rect' : rect,             # page area covered by this path
    'items': [                 # list of draw commands: lines, rectangle or curves.
        ("l", p1, p2),         # a line from p1 to p2
        ("c", p1, p2, p3, p4), # cubic Bézier curve from p1 to p4, p2 and p3
                               # are the control points
        ("re", rect),          # a rect represented with two diagonal points
        ("qu", quad)           # a quad represented with four corner points
    ],
    ...
}
References:

Note

The coordinates extracted by page.get_drawings() is based on real page CS, i.e. with rotation considered. This is different from page.get_text('rawdict').

class pdf2docx.shape.Path.C(item)

Bases: Segment

Bezier curve path with source ("c", p1, p2, p3, p4).

class pdf2docx.shape.Path.L(item)

Bases: Segment

Line path with source ("l", p1, p2).

property length

Length of line.

to_strokes(width: float, color: list)

Convert to stroke dict.

Args:

width (float): Specify width for the stroke. color (list): Specify color for the stroke.

Returns:

list: A list of Stroke dicts.

Note

A line corresponds to one stroke, but considering the consistence, the return stroke dict is append to a list. So, the length of list is always 1.

class pdf2docx.shape.Path.Path(raw: dict)

Bases: object

Path extracted from PDF, consist of one or more Segments.

property is_fill
property is_iso_oriented

It is iso-oriented when all contained segments are iso-oriented.

property is_stroke
plot(canvas)

Plot path for debug purpose.

Args:

canvas: PyMuPDF drawing canvas by page.new_shape().

Reference:

to_shapes()

Convert path to Shape raw dicts.

Returns:

list: A list of Shape dict.

class pdf2docx.shape.Path.Q(item)

Bases: R

Quad path with source ("qu", quad).

class pdf2docx.shape.Path.R(item)

Bases: Segment

Rect path with source ("re", rect).

to_strokes(width: float, color: list)

Convert each edge to stroke dict.

Args:

width (float): Specify width for the stroke. color (list): Specify color for the stroke.

Returns:

list: A list of Stroke dicts.

Note

One Rect path is converted to a list of 4 stroke dicts.

class pdf2docx.shape.Path.Segment(item)

Bases: object

A segment of path, e.g. a line or a rectangle or a curve.

to_strokes(width: float, color: list)
class pdf2docx.shape.Path.Segments(items: list, close_path=False)

Bases: object

A sub-path composed of one or more segments.

property area

Calculate segments area with Green formulas. Note the boundary of Bezier curve is simplified with its control points.

property bbox

Calculate segments bbox.

property is_iso_oriented

ISO-oriented criterion: the ratio of real area to bbox exceeds 0.9.

property points

Connected points of segments.

to_fill(color: list)

Convert segment closed area to a Fill dict.

Args:

color (list): Specify fill color.

Returns:

dict: Fill dict.

to_strokes(width: float, color: list)

Convert each segment to a Stroke dict.

Args:

width (float): Specify stroke width. color (list): Specify stroke color.

Returns:

list: A list of Stroke dicts.