pdf-custom
Files
Install
npx skills add alisonaquinas/llm-doc-skills --skill 'pdf-custom' -g -y
/plugin install doc-skills@llm-skills
npx skills when you only want this skill, or install the bundle once to make every included skill available through the plugin marketplace flow. Browse the full skill bundle repository at github.com/alisonaquinas/llm-doc-skills.Invoke
/doc-skills:pdf-custom
SKILL.md
name: pdf-custom description: Use this skill whenever the user wants to do anything with PDF files. This includes reading or extracting text/tables from PDFs, combining or merging multiple PDFs into one, splitting PDFs apart, rotating pages, adding watermarks, creating new PDFs, filling PDF forms, encrypting/decrypting PDFs, extracting images, and OCR on scanned PDFs to make them searchable. If the user mentions a .pdf file or asks to produce one, use this skill.
PDF Processing Guide
Intent Router
Load sections based on the task:
- Extract text → "Quick Start" + "pdfplumber - Text and Table Extraction" for layout-aware extraction
- Merge/split/rotate PDFs → "Command-Line Tools" for qpdf or "Python Libraries" for pypdf
- Create PDF from scratch → "reportlab - Create PDFs" section with canvas or Platypus examples
- Fill PDF forms → Read FORMS.md for detailed form-filling patterns
- Scanned PDF / OCR → "Extract Text from Scanned PDFs" for pytesseract workflow
- Advanced operations → "Command-Line Tools" for qpdf, pdftk, or "Python Libraries" for pypdfium2
Overview
This guide covers essential PDF processing operations using Python libraries and command-line tools. For advanced features, JavaScript libraries, and detailed examples, see REFERENCE.md. If you need to fill out a PDF form, read FORMS.md and follow its instructions.
Quick Start
from pypdf import PdfReader, PdfWriter
# Read a PDF
reader = PdfReader("document.pdf")
print(f"Pages: {len(reader.pages)}")
# Extract text
text = ""
for page in reader.pages:
text += page.extract_text()
Python Libraries
pypdf - Basic Operations
Merge PDFs
from pypdf import PdfWriter, PdfReader
writer = PdfWriter()
for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]:
reader = PdfReader(pdf_file)
for page in reader.pages:
writer.add_page(page)
with open("merged.pdf", "wb") as output:
writer.write(output)
Split PDF
reader = PdfReader("input.pdf")
for i, page in enumerate(reader.pages):
writer = PdfWriter()
writer.add_page(page)
with open(f"page_{i+1}.pdf", "wb") as output:
writer.write(output)
Extract Metadata
reader = PdfReader("document.pdf")
meta = reader.metadata
print(f"Title: {meta.title}")
print(f"Author: {meta.author}")
print(f"Subject: {meta.subject}")
print(f"Creator: {meta.creator}")
Rotate Pages
reader = PdfReader("input.pdf")
writer = PdfWriter()
page = reader.pages[0]
page.rotate(90) # Rotate 90 degrees clockwise
writer.add_page(page)
with open("rotated.pdf", "wb") as output:
writer.write(output)
pdfplumber - Text and Table Extraction
Extract Text with Layout
import pdfplumber
with pdfplumber.open("document.pdf") as pdf:
for page in pdf.pages:
text = page.extract_text()
print(text)
Extract Tables
with pdfplumber.open("document.pdf") as pdf:
for i, page in enumerate(pdf.pages):
tables = page.extract_tables()
for j, table in enumerate(tables):
print(f"Table {j+1} on page {i+1}:")
for row in table:
print(row)
Advanced Table Extraction
import pandas as pd
with pdfplumber.open("document.pdf") as pdf:
all_tables = []
for page in pdf.pages:
tables = page.extract_tables()
for table in tables:
if table: # Check if table is not empty
df = pd.DataFrame(table[1:], columns=table[0])
all_tables.append(df)
# Combine all tables
if all_tables:
combined_df = pd.concat(all_tables, ignore_index=True)
combined_df.to_excel("extracted_tables.xlsx", index=False)
reportlab - Create PDFs
Basic PDF Creation
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas
c = canvas.Canvas("hello.pdf", pagesize=letter)
width, height = letter
# Add text
c.drawString(100, height - 100, "Hello World!")
c.drawString(100, height - 120, "This is a PDF created with reportlab")
# Add a line
c.line(100, height - 140, 400, height - 140)
# Save
c.save()
Create PDF with Multiple Pages
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak
from reportlab.lib.styles import getSampleStyleSheet
doc = SimpleDocTemplate("report.pdf", pagesize=letter)
styles = getSampleStyleSheet()
story = []
# Add content
title = Paragraph("Report Title", styles['Title'])
story.append(title)
story.append(Spacer(1, 12))
body = Paragraph("This is the body of the report. " * 20, styles['Normal'])
story.append(body)
story.append(PageBreak())
# Page 2
story.append(Paragraph("Page 2", styles['Heading1']))
story.append(Paragraph("Content for page 2", styles['Normal']))
# Build PDF
doc.build(story)
Subscripts and Superscripts
IMPORTANT: Never use Unicode subscript/superscript characters (₀₁₂₃₄₅₆₇₈₉, ⁰¹²³⁴⁵⁶⁷⁸⁹) in ReportLab PDFs. The built-in fonts do not include these glyphs, causing them to render as solid black boxes.
Instead, use ReportLab's XML markup tags in Paragraph objects:
from reportlab.platypus import Paragraph
from reportlab.lib.styles import getSampleStyleSheet
styles = getSampleStyleSheet()
# Subscripts: use <sub> tag
chemical = Paragraph("H<sub>2</sub>O", styles['Normal'])
# Superscripts: use <super> tag
squared = Paragraph("x<super>2</super> + y<super>2</super>", styles['Normal'])
For canvas-drawn text (not Paragraph objects), manually adjust font the size and position rather than using Unicode subscripts/superscripts.
Command-Line Tools
pdftotext (poppler-utils)
# Extract text
pdftotext input.pdf output.txt
# Extract text preserving layout
pdftotext -layout input.pdf output.txt
# Extract specific pages
pdftotext -f 1 -l 5 input.pdf output.txt # Pages 1-5
qpdf
# Merge PDFs
qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf
# Split pages
qpdf input.pdf --pages . 1-5 -- pages1-5.pdf
qpdf input.pdf --pages . 6-10 -- pages6-10.pdf
# Rotate pages
qpdf input.pdf output.pdf --rotate=+90:1 # Rotate page 1 by 90 degrees
# Remove password
qpdf --password=mypassword --decrypt encrypted.pdf decrypted.pdf
pdftk (if available)
# Merge
pdftk file1.pdf file2.pdf cat output merged.pdf
# Split
pdftk input.pdf burst
# Rotate
pdftk input.pdf rotate 1east output rotated.pdf
Common Tasks
Extract Text from Scanned PDFs
# Requires: pip install pytesseract pdf2image
import pytesseract
from pdf2image import convert_from_path
# Convert PDF to images
images = convert_from_path('scanned.pdf')
# OCR each page
text = ""
for i, image in enumerate(images):
text += f"Page {i+1}:\n"
text += pytesseract.image_to_string(image)
text += "\n\n"
print(text)
Add Watermark
from pypdf import PdfReader, PdfWriter
# Create watermark (or load existing)
watermark = PdfReader("watermark.pdf").pages[0]
# Apply to all pages
reader = PdfReader("document.pdf")
writer = PdfWriter()
for page in reader.pages:
page.merge_page(watermark)
writer.add_page(page)
with open("watermarked.pdf", "wb") as output:
writer.write(output)
Extract Images
# Using pdfimages (poppler-utils)
pdfimages -j input.pdf output_prefix
# This extracts all images as output_prefix-000.jpg, output_prefix-001.jpg, etc.
Password Protection
from pypdf import PdfReader, PdfWriter
reader = PdfReader("input.pdf")
writer = PdfWriter()
for page in reader.pages:
writer.add_page(page)
# Add password
writer.encrypt("userpassword", "ownerpassword")
with open("encrypted.pdf", "wb") as output:
writer.write(output)
Quick Reference
| Task | Best Tool | Command/Code |
|---|---|---|
| Merge PDFs | pypdf | writer.add_page(page) |
| Split PDFs | pypdf | One page per file |
| Extract text | pdfplumber | page.extract_text() |
| Extract tables | pdfplumber | page.extract_tables() |
| Create PDFs | reportlab | Canvas or Platypus |
| Command line merge | qpdf | qpdf --empty --pages ... |
| OCR scanned PDFs | pytesseract | Convert to image first |
| Fill PDF forms | pdf-lib or pypdf (see FORMS.md) | See FORMS.md |
Next Steps
- For advanced pypdfium2 usage, see REFERENCE.md
- For JavaScript libraries such as
pdf-lib, see REFERENCE.md - If you need to fill out a PDF form, follow FORMS.md
- For troubleshooting guides, see REFERENCE.md
API Reference
Sources: pypdf docs, pdfplumber GitHub, ReportLab docs
pypdf (version 6.x)
PdfReader
from pypdf import PdfReader
reader = PdfReader("file.pdf")
reader = PdfReader("file.pdf", password="secret") # Encrypted PDF
| Property / Method | Type / Returns | Notes |
|---|---|---|
.pages | list[PageObject] | All pages |
.metadata | DocumentInformation | Title, Author, Subject, Creator, etc. |
.outline | list | Bookmarks/outline tree |
.named_destinations | dict | Named navigation targets |
.get_num_pages() | int | Total page count |
.get_page(page_number) | PageObject | 0-indexed |
.get_fields() | dict | None | Form fields |
.is_encrypted | bool | |
.decrypt(password) | int | Returns 0 (fail), 1 (user), 2 (owner) |
PageObject
page = reader.pages[0]
text = page.extract_text()
text = page.extract_text(extraction_mode="layout") # Preserve layout
| Method | Parameters | Returns |
|---|---|---|
extract_text() | extraction_mode: str = "plain", orientations: tuple = (0,90,180,270) | str |
extract_xform_text() | — | str |
merge_page(page2) | page2: PageObject | None (modifies in-place) |
merge_transformed_page(page2, ctm) | Transformation matrix | None |
rotate(angle) | angle: int (90, 180, 270) | PageObject |
scale(sx, sy) | sx, sy: float | None |
scale_by(factor) | factor: float | None |
scale_to(width, height) | pixels | None |
compress_content_streams() | — | None |
transfer_rotation_to_content() | — | None |
Properties: .mediabox, .cropbox, .bleedbox, .trimbox, .artbox, .rotation, .images, .annotations
PdfWriter
from pypdf import PdfWriter
writer = PdfWriter()
writer.add_page(reader.pages[0])
writer.clone_reader_document_root(reader) # Clone entire document
with open("output.pdf", "wb") as f:
writer.write(f)
| Method | Parameters | Notes |
|---|---|---|
add_page(page) | PageObject | Appends page |
insert_page(page, index) | PageObject, int | Insert at position |
remove_page(page_index) | int | |
add_blank_page(width, height) | pts | |
clone_reader_document_root(reader) | PdfReader | Full document clone |
append(fileobj, pages, import_outline) | Path / Reader | Merge files |
encrypt(user_password, owner_password, use_128bit) | str, str, bool=True | |
decrypt(password) | str | |
add_bookmark(title, pagenum, parent) | Add outline entry | |
add_annotation(page_number, annotation) | ||
set_page_layout(layout) | "/SinglePage" etc. | |
set_page_mode(mode) | "/UseOutlines" etc. | |
add_metadata(infos) | dict | Update metadata |
compress_identical_objects(remove_identicals, remove_orphans) | bool | Reduce file size |
Transformation
from pypdf import Transformation
op = Transformation().rotate(90).translate(tx=50, ty=100).scale(sx=0.5, sy=0.5)
page.add_transformation(op)
pdfplumber
Opening and navigating
import pdfplumber
with pdfplumber.open("file.pdf") as pdf:
page = pdf.pages[0]
text = page.extract_text()
pdfplumber.open(path, password=None, laparams=None, unicode_norm=None, strict_metadata=False)
PDF properties
| Property | Type | Notes |
|---|---|---|
.metadata | dict | CreationDate, Producer, Title, Author, … |
.pages | list[Page] | All pages |
Page properties
| Property | Type | Notes |
|---|---|---|
.page_number | int | 1-based |
.width, .height | float | Points |
.chars | list[dict] | Character objects |
.lines | list[dict] | Line objects |
.rects | list[dict] | Rectangle objects |
.curves | list[dict] | Curve objects |
.images | list[dict] | Image objects |
.annots | list[dict] | Annotations |
.hyperlinks | list[dict] | Hyperlink annotations |
.edges | list[dict] | All edges (from rects, curves, lines) |
Page methods
| Method | Parameters | Returns |
|---|---|---|
extract_text() | x_tolerance=3, y_tolerance=3, layout=False, x_density=7.25, y_density=13 | str |
extract_words() | x_tolerance=3, y_tolerance=3, keep_blank_chars=False, use_text_flow=False | list[dict] |
extract_tables() | table_settings={} | list[list[list[str]]] |
extract_table() | table_settings={} | list[list[str]] (first table only) |
find_tables() | table_settings={} | list[Table] |
crop(bbox) | (x0,top,x1,bottom) | Page |
within_bbox(bbox) | (x0,top,x1,bottom) | Page |
outside_bbox(bbox) | (x0,top,x1,bottom) | Page |
filter(test_function) | callable | Page |
to_image(resolution=72) | int | PageImage |
close() | — | Flush cache |
Character object fields
text, fontname, size, x0, x1, y0, y1, top, bottom, width, height, upright, stroking_color, non_stroking_color, matrix
Table settings (key options)
table_settings = {
"vertical_strategy": "lines", # "lines", "lines_strict", "text", "explicit"
"horizontal_strategy": "lines", # same options
"explicit_vertical_lines": [], # x-coordinates
"explicit_horizontal_lines": [], # y-coordinates
"snap_tolerance": 3,
"join_tolerance": 3,
"edge_min_length": 3,
"min_words_vertical": 3,
"min_words_horizontal": 1,
"intersection_tolerance": 3,
"text_tolerance": 3,
"text_x_tolerance": 3,
"text_y_tolerance": 3,
}
ReportLab
Canvas (low-level drawing)
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter, A4
from reportlab.lib.units import inch, cm, mm
c = canvas.Canvas("output.pdf", pagesize=letter)
width, height = letter # 612, 792 pts
# Coordinates: origin bottom-left, y increases upward
c.drawString(1*inch, 10*inch, "Hello")
c.drawRightString(7.5*inch, 10*inch, "Right-aligned")
c.drawCentredString(4.25*inch, 10*inch, "Centered")
c.showPage() # Start new page
c.save()
| Canvas Method | Parameters | Notes |
|---|---|---|
drawString(x,y,text) | pts | Bottom-left origin |
drawRightString(x,y,text) | pts | Right-aligned at x |
drawCentredString(x,y,text) | pts | Centered at x |
setFont(name, size) | str, float | e.g., "Helvetica", 12 |
setFillColor(color) | Color | colors.red, HexColor("#FF0000") |
setStrokeColor(color) | Color | |
setLineWidth(width) | float | pts |
line(x1,y1,x2,y2) | pts | Draw line |
rect(x,y,width,height,fill,stroke) | pts | fill=0|1, stroke=0|1 |
circle(cx,cy,r) | pts | |
ellipse(x1,y1,x2,y2) | bounding box | |
drawImage(path,x,y,width,height) | pts | |
beginPath() / moveTo() / lineTo() / curveTo() / closePath() | Path drawing | |
translate(x,y) | pts | Transform origin |
rotate(angle) | degrees | |
saveState() / restoreState() | Push/pop graphics state | |
showPage() | Finalize page | |
save() | Write file |
Built-in fonts: Helvetica, Helvetica-Bold, Helvetica-Oblique, Helvetica-BoldOblique, Times-Roman, Times-Bold, Times-Italic, Times-BoldItalic, Courier, Courier-Bold, Courier-Oblique, Courier-BoldOblique, Symbol, ZapfDingbats
Platypus (high-level layout)
from reportlab.platypus import (
SimpleDocTemplate, Paragraph, Spacer, PageBreak, KeepTogether,
Table, TableStyle, Image, HRFlowable, ListFlowable, ListItem
)
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.lib import colors
from reportlab.lib.enums import TA_LEFT, TA_CENTER, TA_RIGHT, TA_JUSTIFY
doc = SimpleDocTemplate("report.pdf", pagesize=letter,
leftMargin=inch, rightMargin=inch,
topMargin=inch, bottomMargin=inch)
styles = getSampleStyleSheet()
story = []
story.append(Paragraph("Title", styles["Title"]))
story.append(Spacer(1, 0.2*inch))
story.append(PageBreak())
doc.build(story)
Built-in styles: Normal, Title, Heading1–Heading6, BodyText, Italic, Bold, BulletList, Definition, Code
ParagraphStyle options
ParagraphStyle(
name="MyStyle",
fontName="Helvetica",
fontSize=12,
leading=14, # Line height
spaceBefore=6, # pts before paragraph
spaceAfter=6,
leftIndent=0,
rightIndent=0,
firstLineIndent=0,
alignment=TA_LEFT,
textColor=colors.black,
backColor=None,
borderWidth=0,
borderColor=None,
borderPadding=0,
borderRadius=None,
)
Platypus Table
data = [["Header 1", "Header 2"], ["Row 1 Col 1", "Row 1 Col 2"]]
t = Table(data, colWidths=[3*inch, 3*inch], rowHeights=None)
t.setStyle(TableStyle([
("BACKGROUND", (0,0), (-1,0), colors.grey),
("TEXTCOLOR", (0,0), (-1,0), colors.white),
("FONTNAME", (0,0), (-1,0), "Helvetica-Bold"),
("FONTSIZE", (0,0), (-1,-1), 10),
("ALIGN", (0,0), (-1,-1), "CENTER"),
("VALIGN", (0,0), (-1,-1), "MIDDLE"),
("GRID", (0,0), (-1,-1), 0.5, colors.black),
("ROWBACKGROUNDS", (0,1), (-1,-1), [colors.white, colors.lightgrey]),
("TOPPADDING", (0,0), (-1,-1), 4),
("BOTTOMPADDING", (0,0), (-1,-1), 4),
]))
TableStyle commands use (col, row) tuples; -1 means last.
XML markup in Paragraphs
# Bold, italic, color, links, sub/superscript
Paragraph("<b>Bold</b> and <i>italic</i>", styles["Normal"])
Paragraph('<font color="red" size="14">Red text</font>', styles["Normal"])
Paragraph('x<super>2</super> + H<sub>2</sub>O', styles["Normal"])
Paragraph('<a href="https://example.com">Link</a>', styles["Normal"])
Command-Line Tools (qpdf, pdftk, pdftotext)
qpdf (recommended for merge/split/rotate)
# Merge multiple PDFs
qpdf --empty --pages file1.pdf file2.pdf file3.pdf -- merged.pdf
# Extract page range
qpdf input.pdf --pages . 1-5 -- pages1-5.pdf
# Rotate specific pages (+90, +180, +270, -90, or absolute 0,90,180,270)
qpdf input.pdf output.pdf --rotate=+90:1 # Page 1 only
qpdf input.pdf output.pdf --rotate=90 # All pages
# Decrypt
qpdf --password=secret --decrypt encrypted.pdf out.pdf
# Linearize (optimize for web streaming)
qpdf --linearize input.pdf output.pdf
# Inspect PDF structure
qpdf --check input.pdf
qpdf --json input.pdf | jq .
# Split each page to separate file
qpdf --split-pages input.pdf page-%d.pdf
pdftk
pdftk A=file1.pdf B=file2.pdf cat A B output merged.pdf
pdftk input.pdf burst output page_%04d.pdf
pdftk input.pdf rotate 1-endeast output rotated.pdf # east=90°, west=270°, south=180°
pdftk input.pdf dump_data > metadata.txt
pdftk input.pdf update_info metadata.txt output updated.pdf
pdftotext (poppler)
pdftotext input.pdf # output to input.txt
pdftotext -layout input.pdf output.txt # Preserve layout spacing
pdftotext -f 1 -l 5 input.pdf out.txt # Pages 1-5 only
pdftotext -nopgbrk input.pdf out.txt # No page break characters
pdftotext -enc UTF-8 input.pdf out.txt # Force encoding
pdfimages (poppler)
pdfimages -j input.pdf prefix # Extract as JPEG
pdfimages -png input.pdf prefix # Extract as PNG
pdfimages -f 2 -l 4 input.pdf prefix # Pages 2-4 only
pdfimages -list input.pdf # List images without extracting
Page size reference
| Size | Width (pts) | Height (pts) |
|---|---|---|
| US Letter | 612 | 792 |
| US Legal | 612 | 1008 |
| A4 | 595 | 842 |
| A3 | 842 | 1191 |
| Tabloid | 792 | 1224 |