pdf2docx.common.Collection module¶
A group of instances, e.g. Blocks, Lines, Spans, Shapes.
- class pdf2docx.common.Collection.BaseCollection(instances: list = None, parent=None)¶
Bases:
objectBase collection representing a list of instances.
- append(instance)¶
- property bbox¶
bbox of combined collection.
- extend(instances: list)¶
- property parent¶
- reset(instances: list = None)¶
Reset instances list.
- Args:
instances (list, optional): reset to target instances. Defaults to None.
- Returns:
BaseCollection: self
- restore(*args, **kwargs)¶
Construct Collection from a list of dict.
- store()¶
Store attributes in json format.
- class pdf2docx.common.Collection.Collection(instances: list = None, parent=None)¶
Bases:
BaseCollection,ITextCollection of instance focusing on grouping and sorting elements.
- group(fun)¶
Group instances according to user defined criterion.
- Args:
fun (function): with 2 arguments representing 2 instances (Element) and return bool.
- Returns:
list: a list of grouped
Collectioninstances.
Examples 1:
# group instances intersected with each other fun = lambda a,b: a.bbox & b.bbox
Examples 2:
# group instances aligned horizontally fun = lambda a,b: a.horizontally_aligned_with(b)
Note
It’s equal to a GRAPH searching problem, build adjacent list, and then search graph to find all connected components.
- group_by_columns(factor: float = 0.0, sorted: bool = True, text_direction: bool = False)¶
Group elements into columns based on the bbox.
- group_by_connectivity(dx: float, dy: float)¶
Collect connected instances into same group.
- Args:
dx (float): x-tolerances to define connectivity dy (float): y-tolerances to define connectivity
- Returns:
list: a list of grouped
Collectioninstances.
Note
It’s equal to a GRAPH traversing problem, which the critical point in building the adjacent list, especially a large number of vertex (paths).
Checking intersections between paths is actually a Rectangle-Intersection problem, studied already in many literatures.
- group_by_physical_rows(sorted: bool = False, text_direction: bool = False)¶
Group lines into physical rows.
- group_by_rows(factor: float = 0.0, sorted: bool = True, text_direction: bool = False)¶
Group elements into rows based on the bbox.
- sort_in_line_order()¶
Sort collection instances in a physical with text direction considered, e.g. for normal reading direction: from left to right.
- sort_in_reading_order()¶
Sort collection instances in reading order (considering text direction), e.g. for normal reading direction: from top to bottom, from left to right.
- sort_in_reading_order_plus()¶
Sort instances in reading order, especially for instances in same row. Taking natural reading direction for example: reading order for rows, from left to right for instances in row. In the following example, A comes before B:
+-----------+ +---------+ | | | A | | B | +---------+ +-----------+
Steps:
Sort elements in reading order, i.e. from top to bottom, from left to right.
Group elements in row.
Sort elements in row: from left to right.
- property text_direction¶
Get text direction. All instances must have same text direction.
- class pdf2docx.common.Collection.ElementCollection(instances: list = None, parent=None)¶
Bases:
CollectionCollection of
Elementinstances.- append(e: Element)¶
Append an instance, update parent’s bbox accordingly and set the parent of the added instance.
- Args:
e (Element): instance to append.
- contained_in_bbox(bbox)¶
Filter instances contained in target bbox.
- Args:
bbox (fitz.Rect): target boundary box.
- insert(nth: int, e: Element)¶
Insert a Element and update parent’s bbox accordingly.
- Args:
nth (int): the position to insert. e (Element): the instance to insert.
- is_flow_layout(line_separate_threshold: float, cell_layout=False)¶
Whether contained elements are in flow layout or not.
- pop(nth: int)¶
Delete the
nthinstance.- Args:
nth (int): the position to remove.
- Returns:
Collection: the removed instance.
- split_with_intersection(bbox: Rect, threshold: float = 0.001)¶
Split instances into two groups: one intersects with
bbox, the other not.- Args:
bbox (fitz.Rect): target rect box. threshold (float): It’s intersected when the overlap rate exceeds this threshold. Defaults to 0.
- Returns:
tuple: two group in original class type.