MCPcopy Index your code
hub / github.com/jsvine/pdfplumber

github.com/jsvine/pdfplumber @v0.11.10

repository ↗ · DeepWiki ↗ · release v0.11.10 ↗ · Ask this repo → · + Follow
487 symbols 1,734 edges 38 files 84 documented · 17% 14 cross-repo links updated 18d agov0.11.10 · 2026-06-15★ 10,50576 open issues
README

pdfplumber

Version Tests Code coverage Support Python versions

Plumb a PDF for detailed information about each text character, rectangle, and line. Plus: Table extraction and visual debugging.

Works best on machine-generated, rather than scanned, PDFs. Built on pdfminer.six.

Currently tested on Python 3.10, 3.11, 3.12, 3.13, 3.14.

Translations of this document are available in: Chinese (by @hbh112233abc).

To report a bug or request a feature, please file an issue. To ask a question or request assistance with a specific PDF, please use the discussions forum.

Table of Contents

Installation

pip install pdfplumber

Command line interface

Basic example

curl "https://raw.githubusercontent.com/jsvine/pdfplumber/stable/examples/pdfs/background-checks.pdf" > background-checks.pdf
pdfplumber background-checks.pdf > background-checks.csv

The output will be a CSV containing info about every character, line, and rectangle in the PDF.

Options

Argument Description
--format [format] csv, json, or text. The csv and json formats return information about each object. Of those two, the json format returns more information; it includes PDF-level and page-level metadata, plus dictionary-nested attributes. The text option returns a plain-text representation of the PDF, using Page.extract_text(layout=True).
--pages [list of pages] A space-delimited, 1-indexed list of pages or hyphenated page ranges. E.g., 1, 11-15, which would return data for pages 1, 11, 12, 13, 14, and 15.
--types [list of object types to extract] Choices are char, rect, line, curve, image, annot, et cetera. Defaults to all available.
--laparams A JSON-formatted string (e.g., '{"detect_vertical": true}') to pass to pdfplumber.open(..., laparams=...).
--precision [integer] The number of decimal places to round floating-point numbers. Defaults to no rounding.

Python library

Basic example

import pdfplumber

with pdfplumber.open("path/to/file.pdf") as pdf:
    first_page = pdf.pages[0]
    print(first_page.chars[0])

Loading a PDF

To start working with a PDF, call pdfplumber.open(x), where x can be a:

  • path to your PDF file
  • file object, loaded as bytes
  • file-like object, loaded as bytes

The open method returns an instance of the pdfplumber.PDF class.

To load a password-protected PDF, pass the password keyword argument, e.g., pdfplumber.open("file.pdf", password = "test").

To set layout analysis parameters to pdfminer.six's layout engine, pass the laparams keyword argument, e.g., pdfplumber.open("file.pdf", laparams = { "line_overlap": 0.7 }).

To pre-normalize Unicode text, pass unicode_norm=..., where ... is one of the four Unicode normalization forms: "NFC", "NFD", "NFKC", or "NFKD".

Invalid metadata values are treated as a warning by default. If that is not intended, pass strict_metadata=True to the open method and pdfplumber.open will raise an exception if it is unable to parse the metadata.

The pdfplumber.PDF class

The top-level pdfplumber.PDF class represents a single PDF and has two main properties:

Property Description
.metadata A dictionary of metadata key/value pairs, drawn from the PDF's Info trailers. Typically includes "CreationDate," "ModDate," "Producer," et cetera.
.pages A list containing one pdfplumber.Page instance per page loaded.

... and also has the following method:

Method Description
.close() Calling this method calls Page.close() on each page, and also closes the file stream (except in cases when the stream is external, i.e., already opened and passed directly to pdfplumber).

The pdfplumber.Page class

The pdfplumber.Page class is at the core of pdfplumber. Most things you'll do with pdfplumber will revolve around this class. It has these main properties:

Property Description
.page_number The sequential page number, starting with 1 for the first page, 2 for the second, and so on.
.width The page's width.
.height The page's height.
.objects / .chars / .lines / .rects / .curves / .images Each of these properties is a list, and each list contains one dictionary for each such object embedded on the page. For more detail, see "Objects" below.

... and these main methods:

Method Description
.crop(bounding_box, relative=False, strict=True) Returns a version of the page cropped to the bounding box, which should be expressed as 4-tuple with the values (x0, top, x1, bottom). Cropped pages retain objects that fall at least partly within the bounding box. If an object falls only partly within the box, its dimensions are sliced to fit the bounding box. If relative=True, the bounding box is calculated as an offset from the top-left of the page's bounding box, rather than an absolute positioning. (See Issue #245 for a visual example and explanation.) When strict=True (the default), the crop's bounding box must fall entirely within the page's bounding box.
.within_bbox(bounding_box, relative=False, strict=True) Similar to .crop, but only retains objects that fall entirely within the bounding box.
.outside_bbox(bounding_box, relative=False, strict=True) Similar to .crop and .within_bbox, but only retains objects that fall entirely outside the bounding box.
.filter(test_function) Returns a version of the page with only the .objects for which test_function(obj) returns True.

... and also has the following method:

Method Description
.close() By default, Page objects cache their layout and object information to avoid having to reprocess it. When parsing large PDFs, however, these cached properties can require a lot of memory. You can use this method to flush the cache and release the memory.

Additional methods are described in the sections below:

Objects

Each instance of pdfplumber.PDF and pdfplumber.Page provides access to several types of PDF objects, all derived from pdfminer.six PDF parsing. The following properties each return a Python list of the matching objects:

  • .chars, each representing a single text character.
  • .lines, each representing a single 1-dimensional line.
  • .rects, each representing a single 2-dimensional rectangle.
  • .curves, each representing any series of connected points that pdfminer.six does not recognize as a line or rectangle.
  • .images, each representing an image.
  • .annots, each representing a single PDF annotation (cf. Section 8.4 of the official PDF specification for details)
  • .hyperlinks, each representing a single PDF annotation of the subtype Link and having an URI action attribute

Each object is represented as a simple Python dict, with the following properties:

char properties

Property Description
page_number Page number on which this character was found.
text E.g., "z", or "Z" or " ".
fontname Name of the character's font face.
size Font size.
adv Equal to text width * the font size * scaling factor.
upright Whether the character is upright.
height Height of the character.
width Width of the character.
x0 Distance of left side of character from left side of page.
x1 Distance of right side of character from left side of page.
y0 Distance of bottom of character from bottom of page.
y1 Distance of top of character from bottom of page.
top Distance of top of character from top of page.
bottom Distance of bottom of the character from top of page.
doctop Distance of top of character from top of document.
matrix The "current transformation matrix" for this character. (See below for details.)
mcid The marked content section ID for this character if any (otherwise None). Experimental attribute.
tag The marked content section tag for this character if any (otherwise None). Experimental attribute.
ncs TKTK
stroking_pattern TKTK
non_stroking_pattern TKTK
stroking_color The color of the character's outline (i.e., stroke). See docs/colors.md for details.
non_stroking_color The character's interior color. See docs/colors.md for details.
object_type "char"

Note: A character’s matrix property represents the “current transformation matrix,” as described in Section 4.2.2 of the PDF Reference (6th Ed.). The matrix controls the character’s scale, skew, and positional translation. Rotation is a combination of scale and skew, but in most cases can be considered equal to the x-axis skew. The pdfplumber.ctm submodule defines a class, CTM, that assists with these calculations. For instance:

from pdfplumber.ctm import CTM
my_char = pdf.pages[0].chars[3]
my_char_ctm = CTM(*my_char["matrix"])
my_char_rotation = my_char_ctm.skew_x

line properties

Property Description
page_number Page number on which this line was found.
height Height of line.
width Width of line.
x0 Distance of left-side extremity from left side of page.
x1 Distance of right-side extremity from left side of page.
y0 Distance of bottom extremity from bottom of page.
y1 Distance of top extremity bottom of page.
top Distance of top of line from top of page.
bottom Distance of bottom of the line from top of page.
doctop Distance of top of line from top of document.
linewidth Thickness of line.
stroking_color The color of the line. See docs/colors.md for details.
non_stroking_color The non-stroking color specified for the line’s path. See docs/colors.md for details.
mcid The marked content section ID for this line if any (otherwise None). Experimental attribute.
tag The marked content section tag for this line if any (otherwise None). Experimental attribute.
object_type "line"

rect properties

Property Description
page_number Page number on which this rectangle was found.
height Height of rectangle.
width Width of rectangle.
x0 Distance of left side of rectangle from left side of page.
x1 Distance of right side of rectangle from left side of page.
y0 Distance of bottom of rectangle from bottom of page.
y1 Distance of top of rectangle from bottom of page.
top Distance of top of rectangle from top of page.
bottom Distance of bottom of the rectangle from top of page.
doctop Distance of top of rectangle from top of document.
linewidth Thickness of line.
stroking_color The color of the rectangle's outline. See docs/colors.md for details.
non_stroking_color The rectangle’s fill color. See docs/colors.md for details.
mcid The marked content section ID for this rect if any (otherwise None). Experimental attribute.
tag The marked content section tag for this rect if any (otherwise None). Experimental attribute.
object_type "rect"

curve properties

Property Description
page_number Page number on which this curve was found.
pts A list of (x, top) tuples indicating the points on the curve.
path A list of (cmd, *(x, top)) tuples describing the full path description, including (for example) control points used in Bezier curves.
height Height of curve's bounding box.
width Width of curve's bounding box.
x0 Distance of curve's left-most point from left side of page.
x1 Distance of curve's right-most point from left side of the page.
y0 Distance of curve's lowest point from bottom of page.

Core symbols most depended-on inside this repo

open
called by 114
pdfplumber/pdf.py
extract_text
called by 49
pdfplumber/page.py
crop
called by 37
pdfplumber/page.py
extract_words
called by 32
pdfplumber/page.py
search
called by 21
pdfplumber/page.py
close
called by 21
pdfplumber/pdf.py
find_all
called by 18
pdfplumber/structure.py
decode_text
called by 16
pdfplumber/utils/pdfinternals.py

Shape

Method 366
Function 75
Class 46

Languages

Python100%

Modules by API surface

pdfplumber/page.py59 symbols
tests/test_utils.py41 symbols
pdfplumber/table.py33 symbols
pdfplumber/utils/text.py30 symbols
tests/test_issues.py27 symbols
tests/test_basics.py26 symbols
pdfplumber/display.py25 symbols
tests/test_structure.py22 symbols
pdfplumber/structure.py22 symbols
pdfplumber/container.py22 symbols
pdfplumber/utils/geometry.py21 symbols
tests/test_convert.py20 symbols

Dependencies from manifests, versioned

Pillow12.2.0 · 1×
black26.5.1 · 1×
flake87.3.0 · 1×
isort8.0.1 · 1×
jupyterlab4.5.8 · 1×
mypy2.1.0 · 1×
nbexec0.2.0 · 1×
pandas2.3.3 · 1×
pandas-stubs2.3.3.260113 · 1×
pdfminer.six20260107 · 1×
py1.11.0 · 1×
pypdfium25.9.0 · 1×

For agents

$ claude mcp add pdfplumber \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact