pdf2docx.table.TableStructure module¶
Parsing table structure based on strokes and fills.
- class pdf2docx.table.TableStructure.CellStructure(bbox: list)¶
Bases:
objectCell structure with properties bbox, borders, shading, etc.
- property is_merged¶
- property is_merging¶
- parse_borders(h_strokes: dict, v_strokes: dict)¶
Parse cell borders from strokes.
- Args:
- h_strokes (dict): A dict of y-coordinate v.s. horizontal strokes, e.g.
{y0: [h1,h2,..], y1: [h3,h4,...]}- v_strokes (dict): A dict of x-coordinates v.s. vertical strokes, e.g.
{x0: [v1,v2,..], x1: [v3,v4,...]}
- class pdf2docx.table.TableStructure.TableStructure(strokes: Shapes, **settings)¶
Bases:
objectParsing table structure based on strokes/fills.
Steps to parse table structure:
x0 x1 x2 x3 y0 +----h1---+---h2---+----h3---+ | | | | v1 v2 v3 v4 | | | | y1 +----h4------------+----h5---+ | | | v5 v6 v7 | | | y2 +--------h6--------+----h7---+
Group horizontal and vertical strokes:
self.h_strokes = { y0 : [h1, h2, h3], y1 : [h4, h5], y2 : [h6, h7] }
These
[x0, x1, x2, x3] x [y0, y1, y2]forms table lattices, i.e. 2 rows x 3 cols.Check merged cells in row/column direction.
Let horizontal line
y=(y0+y1)/2cross through table, it gets intersection withv1,v2andv3, indicating no merging exists for cells in the first row.When
y=(y1+y2)/2, it has no intersection with vertical strokes atx=x1, i.e. merging status is[1, 0, 1], indicatingCell(2,2)is merged intoCell(2,1).So, the final merging status in this case:
[ [(1,1), (1,1), (1,1)], [(1,2), (0,0), (1,1)] ]
- property bbox¶
Table boundary bbox.
- Returns:
fitz.Rect: bbox of table.
- property num_cols¶
- property num_rows¶
- parse(fills: Shapes)¶
Parse table structure.
- Args:
fills (Shapes): Fill shapes representing cell shading.
- to_table_block()¶
Convert parsed table structure to
TableBlockinstance.- Returns:
TableBlock: Parsed table block instance.
- property x_cols¶
Left x-coordinate
x0of each column.- Returns:
list: x-coordinates of each column.
- property y_rows¶
Top y-coordinate
y0of each row.- Returns:
list: y-coordinates of each row.