GPT-4o OCR PDF to Markdown Tool

Author: kenwood
Version: 0.0.2
Type: Dify Plugin Tool

Description

This tool uses OpenAI's GPT-4o model to perform OCR (Optical Character Recognition) on PDF files and convert the extracted content to Markdown format. It preserves the document structure including headings, paragraphs, lists, and tables.

Inspired by mistral_ocr_pdf project.

Features

Extract text from PDF files using GPT-4o's advanced vision capabilities
Convert extracted content to clean, formatted Markdown
Preserve document structure (headings, paragraphs, lists, tables)
Support for both Azure OpenAI and OpenAI APIs
Process PDFs in batches for efficient handling of large documents
Maintain original formatting as much as possible

Requirements

Python 3.8+
Dify Plugin Framework
OpenAI API key or Azure OpenAI API credentials
PyMuPDF (for PDF processing)

Installation

Clone this repository
Install dependencies:
```
Bash
1pip install -r requirements.txt
```

Configuration

The tool supports both OpenAI and Azure OpenAI APIs. You need to provide the appropriate credentials based on which service you want to use.

OpenAI Configuration

API Type: Select "OpenAI"
API Key: Your OpenAI API key
Model Name: (Optional) The model to use (defaults to "gpt-4o")

Azure OpenAI Configuration

API Type: Select "Azure OpenAI"
API Key: Your Azure OpenAI API key
API Endpoint: Your Azure OpenAI endpoint URL
Deployment Name: Your GPT-4o model deployment name
API Version: Azure OpenAI API version (e.g., "2023-05-15")

Usage

Once configured, you can use the tool to extract text from PDF files and convert it to Markdown:

Upload a PDF file
The tool processes the PDF in batches
GPT-4o extracts text and formats it as Markdown
The final Markdown content is returned

How It Works

The PDF is loaded and each page is converted to an image
Images are encoded in base64 format
Images are sent to GPT-4o with instructions to extract text and format as Markdown
The tool processes the PDF in batches (up to 5 pages at a time) to handle large documents
The extracted Markdown content from each batch is combined into a single document

Limitations

Performance depends on the quality of the PDF
Very large PDFs may take longer to process
Complex layouts or tables may not be perfectly preserved
Images in the PDF will be described but not included in the Markdown output

License

MIT License