pdf2docx.image.ImagesExtractor module¶
Extract images from PDF.
Both raster images and vector graphics are considered:
Normal images like jpeg or png could be extracted with method
page.get_text('rawdict')andPage.get_images(). Note the process for png images with alpha channel.Vector graphics are actually composed of a group of paths, represented by operators like
re,m,landc. They’re detected by finding the contours withopencv.
- class pdf2docx.image.ImagesExtractor.ImagesExtractor(page: Page)¶
Bases:
objectExtract images from PDF.
- clip_page_to_dict(bbox: Rect = None, rm_image: bool = False, clip_image_res_ratio: float = 3.0)¶
Clip page pixmap (without text) according to
bboxand convert to source image.- Args:
bbox (fitz.Rect, optional): Target area to clip. Defaults to None, i.e. entire page. rm_image (bool): remove images or not. clip_image_res_ratio (float, optional): Resolution ratio of clipped bitmap.
Defaults to 3.0.
- Returns:
list: A list of image raw dict.
- clip_page_to_pixmap(bbox: Rect = None, rm_image: bool = False, zoom: float = 3.0)¶
Clip page pixmap according to
bbox.- Args:
- bbox (fitz.Rect, optional): Target area to clip. Defaults to None, i.e. entire page.
Note that
bboxdepends on un-rotated page CS, while clipping page is based on the final page.
rm_image (bool): remove images or not. zoom (float, optional): Improve resolution by this rate. Defaults to 3.0.
- Returns:
fitz.Pixmap: The extracted pixmap.
- detect_svg_contours(min_svg_gap_dx: float, min_svg_gap_dy: float, min_w: float, min_h: float)¶
Find contour of potential vector graphics.
- Args:
min_svg_gap_dx (float): Merge svg if the horizontal gap is less than this value. min_svg_gap_dy (float): Merge svg if the vertical gap is less than this value. min_w (float): Ignore contours if the bbox width is less than this value. min_h (float): Ignore contours if the bbox height is less than this value.
- Returns:
list: A list of potential svg region: (external_bbox, inner_bboxes:list).
- extract_images(clip_image_res_ratio: float = 3.0)¶
Extract normal images with
Page.get_images().- Args:
- clip_image_res_ratio (float, optional): Resolution ratio of clipped bitmap.
Defaults to 3.0.
- Returns:
list: A list of extracted and recovered image raw dict.
Note
Page.get_images()contains each image only once, which may less than the real count of images in a page.