pdf2docx.converter module¶
PDF to Docx Converter.
- exception pdf2docx.converter.ConversionException¶
Bases:
Exception
- class pdf2docx.converter.Converter(pdf_file: str = None, password: str = None, stream: bytes = None)¶
Bases:
objectThe
PDFtodocxconverter.Read PDF file with
PyMuPDFto get raw layout data page by page, including text, image, drawing and its properties, e.g. boundary box, font, size, image width, height.Analyze layout in document level, e.g. page header, footer and margin.
Parse page layout to docx structure, e.g. paragraph and its properties like indentation, spacing, text alignment; table and its properties like border, shading, merging.
Finally, generate docx with
python-docx.
- close()¶
- convert(docx_filename: str | IO = None, start: int = 0, end: int = None, pages: list = None, **kwargs)¶
Convert specified PDF pages to docx file.
- Args:
docx_filename (str, file-like, optional): docx file to write. Defaults to None. start (int, optional): First page to process. Defaults to 0, the first page. end (int, optional): Last page to process. Defaults to None, the last page. pages (list, optional): Range of page indexes. Defaults to None. kwargs (dict, optional): Configuration parameters. Defaults to None.
Refer to
default_settings()for detail of configuration parameters.Note
Change extension from
pdftodocxifdocx_fileis None.Note
startandendis counted from zero if--zero_based_index=True(by default).Start from the first page if
startis omitted.End with the last page if
endis omitted.
Note
pageshas a higher priority thanstartandend.startandendworks only ifpagesis omitted.Note
Multi-processing works only for continuous pages specified by
startandendonly.
- debug_page(i: int, docx_filename: str = None, debug_pdf: str = None, layout_file: str = None, **kwargs)¶
Parse, create and plot single page for debug purpose.
- Args:
i (int): Page index to convert. docx_filename (str): docx filename to write to. debug_pdf (str): New pdf file storing layout information. Default to add prefix
debug_. layout_file (str): New json file storing parsed layout data. Default tolayout.json.
- property default_settings¶
Default parsing parameters.
- deserialize(filename: str)¶
Load parsed pages from specified JSON file.
- extract_tables(start: int = 0, end: int = None, pages: list = None, **kwargs)¶
Extract table contents from specified PDF pages.
- Args:
start (int, optional): First page to process. Defaults to 0, the first page. end (int, optional): Last page to process. Defaults to None, the last page. pages (list, optional): Range of page indexes. Defaults to None. kwargs (dict, optional): Configuration parameters. Defaults to None.
- Returns:
list: A list of parsed table content.
- property fitz_doc¶
- load_pages(start: int = 0, end: int = None, pages: list = None)¶
Step 1 of converting process: open PDF file with
PyMuPDF, especially for password encrypted file.- Args:
start (int, optional): First page to process. Defaults to 0, the first page. end (int, optional): Last page to process. Defaults to None, the last page. pages (list, optional): Range of page indexes to parse. Defaults to None.
- make_docx(filename_or_stream=None, **kwargs)¶
Step 4 of converting process: create docx file with converted pages.
- Args:
filename_or_stream (str, file-like): docx file to write. kwargs (dict, optional): Configuration parameters.
- property pages¶
- parse(start: int = 0, end: int = None, pages: list = None, **kwargs)¶
Parse pages in three steps: * open PDF file with
PyMuPDF* analyze whole document, e.g. page section, header/footer and margin * parse specified pages, e.g. paragraph, image and table- Args:
start (int, optional): First page to process. Defaults to 0, the first page. end (int, optional): Last page to process. Defaults to None, the last page. pages (list, optional): Range of page indexes to parse. Defaults to None. kwargs (dict, optional): Configuration parameters.
- parse_document(**kwargs)¶
Step 2 of converting process: analyze whole document, e.g. page section, header/footer and margin.
- parse_pages(**kwargs)¶
Step 3 of converting process: parse pages, e.g. paragraph, image and table.
- restore(data: dict)¶
Restore pages from parsed results.
- serialize(filename: str)¶
Write parsed pages to specified JSON file.
- store()¶
Store parsed pages in dict format.
- exception pdf2docx.converter.MakedocxException¶
Bases:
ConversionException