pdf2docx.shape.Paths module¶
Objects representing PDF path (stroke and filling) extracted by page.get_drawings().
This method is new since PyMuPDF 1.18.0, with both pdf raw path and annotations like Line,
Square and Highlight considered.
https://pymupdf.readthedocs.io/en/latest/page.html#Page.get_drawings
https://pymupdf.readthedocs.io/en/latest/faq.html#extracting-drawings
- class pdf2docx.shape.Paths.Paths(instances: list = None, parent=None)¶
Bases:
CollectionA collection of paths.
- bbox¶
Calculate only once and cache property value.
- property is_iso_oriented¶
It is iso-oriented when all contained segments are iso-oriented.
- plot(page)¶
Plot paths for debug purpose.
- Args:
page (fitz.Page):
PyMuPDFpage.
- restore(raws: list)¶
Initialize paths from raw data get by
page.get_drawings().
- to_shapes()¶
Convert contained paths to ISO strokes or rectangular fills.
- Returns:
list: A list of
Shaperaw dicts.
- to_shapes_and_images(min_svg_gap_dx: float = 15, min_svg_gap_dy: float = 15, min_w: float = 2, min_h: float = 2, clip_image_res_ratio: float = 3.0)¶
Convert paths to iso-oriented shapes or images. The semantic type of path is either table/text style or vector graphic. This method is to: * detect svg regions -> exist at least one non-iso-oriented path * convert svg to bitmap by clipping page * convert the rest paths to iso-oriented shapes for further table/text style parsing
- Args:
min_svg_gap_dx (float): Merge svg if the horizontal gap is less than this value. min_svg_gap_dy (float): Merge svg if the vertical gap is less than this value. min_w (float): Ignore contours if the bbox width is less than this value. min_h (float): Ignore contours if the bbox height is less than this value. clip_image_res_ratio (float, optional): Resolution ratio of clipped bitmap.
Defaults to 3.0.
- Returns:
tuple: (list of shape raw dict, list of image raw dict).