Back to productivity
productivity v2.3.0 3.3 min read 171 lines

ocr-and-documents

PDF 및 스캔 문서에서 텍스트 추출 — web_extract, pymupdf, marker-pdf

Hermes Agent
MIT

PDF & Document Extraction

For DOCX: use python-docx (parses actual document structure, far better than OCR).
For PPTX: see the powerpoint skill (uses python-pptx with full slide/notes support).
This skill covers PDFs and scanned documents.

Step 1: Remote URL Available?

If the document has a URL, always try web_extract first:

web_extract(urls=["https://arxiv.org/pdf/2402.03300"])
web_extract(urls=["https://example.com/report.pdf"])

This handles PDF-to-markdown conversion via Firecrawl with no local dependencies.

Only use local extraction when: the file is local, web_extract fails, or you need batch processing.

Step 2: Choose Local Extractor

| Feature | pymupdf (~25MB) | marker-pdf (~3-5GB) |
|---------|-----------------|---------------------|
| Text-based PDF | ✅ | ✅ |
| Scanned PDF (OCR) | ❌ | ✅ (90+ languages) |
| Tables | ✅ (basic) | ✅ (high accuracy) |
| Equations / LaTeX | ❌ | ✅ |
| Code blocks | ❌ | ✅ |
| Forms | ❌ | ✅ |
| Headers/footers removal | ❌ | ✅ |
| Reading order detection | ❌ | ✅ |
| Images extraction | ✅ (embedded) | ✅ (with context) |
| Images → text (OCR) | ❌ | ✅ |
| EPUB | ✅ | ✅ |
| Markdown output | ✅ (via pymupdf4llm) | ✅ (native, higher quality) |
| Install size | ~25MB | ~3-5GB (PyTorch + models) |
| Speed | Instant | ~1-14s/page (CPU), ~0.2s/page (GPU) |

Decision: Use pymupdf unless you need OCR, equations, forms, or complex layout analysis.

If the user needs marker capabilities but the system lacks ~5GB free disk:

"This document needs OCR/advanced extraction (marker-pdf), which requires ~5GB for PyTorch and models. Your system has [X]GB free. Options: free up space, provide a URL so I can use web_extract, or I can try pymupdf which works for text-based PDFs but not scanned documents or equations."


pymupdf (lightweight)

pip install pymupdf pymupdf4llm

Via helper script:

python scripts/extract_pymupdf.py document.pdf              # Plain text
python scripts/extract_pymupdf.py document.pdf --markdown # Markdown
python scripts/extract_pymupdf.py document.pdf --tables # Tables
python scripts/extract_pymupdf.py document.pdf --images out/ # Extract images
python scripts/extract_pymupdf.py document.pdf --metadata # Title, author, pages
python scripts/extract_pymupdf.py document.pdf --pages 0-4 # Specific pages

Inline:

python3 -c "
import pymupdf
doc = pymupdf.open('document.pdf')
for page in doc:
print(page.get_text())
"


marker-pdf (high-quality OCR)

# Check disk space first
python scripts/extract_marker.py --check

pip install marker-pdf

Via helper script:

python scripts/extract_marker.py document.pdf                # Markdown
python scripts/extract_marker.py document.pdf --json # JSON with metadata
python scripts/extract_marker.py document.pdf --output_dir out/ # Save images
python scripts/extract_marker.py scanned.pdf # Scanned PDF (OCR)
python scripts/extract_marker.py document.pdf --use_llm # LLM-boosted accuracy

CLI (installed with marker-pdf):

marker_single document.pdf --output_dir ./output
marker /path/to/folder --workers 4 # Batch


Arxiv Papers

# Abstract only (fast)
web_extract(urls=["https://arxiv.org/abs/2402.03300"])

Full paper


web_extract(urls=["https://arxiv.org/pdf/2402.03300"])

Search


web_search(query="arxiv GRPO reinforcement learning 2026")

Split, Merge & Search

pymupdf handles these natively — use execute_code or inline Python:

# Split: extract pages 1-5 to a new PDF
import pymupdf
doc = pymupdf.open("report.pdf")
new = pymupdf.open()
for i in range(5):
new.insert_pdf(doc, from_page=i, to_page=i)
new.save("pages_1-5.pdf")

# Merge multiple PDFs
import pymupdf
result = pymupdf.open()
for path in ["a.pdf", "b.pdf", "c.pdf"]:
result.insert_pdf(pymupdf.open(path))
result.save("merged.pdf")

# Search for text across all pages
import pymupdf
doc = pymupdf.open("report.pdf")
for i, page in enumerate(doc):
results = page.search_for("revenue")
if results:
print(f"Page {i+1}: {len(results)} match(es)")
print(page.get_text("text"))

No extra dependencies needed — pymupdf covers split, merge, search, and text extraction in one package.


Notes

  • web_extract is always first choice for URLs
  • pymupdf is the safe default — instant, no models, works everywhere
  • marker-pdf is for OCR, scanned docs, equations, complex layouts — install only when needed
  • Both helper scripts accept --help for full usage
  • marker-pdf downloads ~2.5GB of models to ~/.cache/huggingface/ on first use
  • For Word docs: pip install python-docx (better than OCR — parses actual structure)
  • For PowerPoint: see the powerpoint skill (uses python-pptx)

Related Skills / 관련 스킬

google-workspace

Gmail, 캘린더, 드라이브, 연락처, 시트, 문서 통합 — gws CLI + OAuth2

linear

Linear 이슈/프로젝트/팀 관리 — GraphQL API, API 키 인증, curl만으로 동작

maps

>

nano-pdf

자연어 명령으로 PDF 편집 — 텍스트 수정, 오타 교정, 제목 업데이트