pdf2docx.text.Line module

Text Line objects based on PDF raw dict extracted with PyMuPDF.

Data structure of line in text block referring to this link:

{
    'bbox': (x0,y0,x1,y1),
    'wmode': m,
    'dir': [x,y],
    'spans': [ spans ]
}
class pdf2docx.text.Line.Line(raw: dict = None)

Bases: Element

Object representing a line in text block.

add(span_or_list)

Add span list to current Line.

Args:

span_or_list (Span, Iterable): TextSpan or TextSpan list to add.

add_span(span: Element)

Add span to current Line.

property image_spans

Get image spans in this Line.

intersects(rect)

Create new Line object with spans contained in given bbox.

Args:

rect (fitz.Rect): Target bbox.

Returns:

Line: The created Line instance.

make_docx(p)

Create docx line, i.e. a run in python-docx.

property raw_text

Joining span text with image ignored.

store()

Store properties in raw dict.

strip()

Remove redundant blanks at the begin/end span.

property text

Joining span text. Note image is translated to a placeholder <image>.

property text_direction

Get text direction. Consider LEFT_RIGHT and LEFT_RIGHT only.

Returns:

TextDirection: Text direction of this line.

property white_space_only

If this line contains only white space or not. If True, this line is safe to be removed.