hub / github.com/jsvine/pdfplumber

github.com/jsvine/pdfplumber @v0.11.10

repository ↗ · DeepWiki ↗ · release v0.11.10 ↗ · Ask this repo → · + Follow

487 symbols 1,734 edges 38 files 84 documented · 17% 14 cross-repo links ● updated 18d agov0.11.10 · 2026-06-15★ 10,50576 open issues

README

pdfplumber

Plumb a PDF for detailed information about each text character, rectangle, and line. Plus: Table extraction and visual debugging.

Works best on machine-generated, rather than scanned, PDFs. Built on pdfminer.six.

Currently tested on Python 3.10, 3.11, 3.12, 3.13, 3.14.

Translations of this document are available in: Chinese (by @hbh112233abc).

To report a bug or request a feature, please file an issue. To ask a question or request assistance with a specific PDF, please use the discussions forum.

Installation
Command line interface
Python library
Visual debugging
Extracting text
Extracting tables
Extracting form values
Demonstrations
Comparison to other libraries
Acknowledgments / Contributors
Contributing

Installation

pip install pdfplumber

Command line interface

Basic example

curl "https://raw.githubusercontent.com/jsvine/pdfplumber/stable/examples/pdfs/background-checks.pdf" > background-checks.pdf
pdfplumber background-checks.pdf > background-checks.csv

The output will be a CSV containing info about every character, line, and rectangle in the PDF.

Options

Argument	Description
`--format [format]`	`csv`, `json`, or `text`. The `csv` and `json` formats return information about each object. Of those two, the `json` format returns more information; it includes PDF-level and page-level metadata, plus dictionary-nested attributes. The `text` option returns a plain-text representation of the PDF, using `Page.extract_text(layout=True)`.
`--pages [list of pages]`	A space-delimited, `1`-indexed list of pages or hyphenated page ranges. E.g., `1, 11-15`, which would return data for pages 1, 11, 12, 13, 14, and 15.
`--types [list of object types to extract]`	Choices are `char`, `rect`, `line`, `curve`, `image`, `annot`, et cetera. Defaults to all available.
`--laparams`	A JSON-formatted string (e.g., `'{"detect_vertical": true}'`) to pass to `pdfplumber.open(..., laparams=...)`.
`--precision [integer]`	The number of decimal places to round floating-point numbers. Defaults to no rounding.

Python library

Basic example

import pdfplumber

with pdfplumber.open("path/to/file.pdf") as pdf:
    first_page = pdf.pages[0]
    print(first_page.chars[0])

Loading a PDF

To start working with a PDF, call pdfplumber.open(x), where x can be a:

path to your PDF file
file object, loaded as bytes
file-like object, loaded as bytes

The open method returns an instance of the pdfplumber.PDF class.

To load a password-protected PDF, pass the password keyword argument, e.g., pdfplumber.open("file.pdf", password = "test").

To set layout analysis parameters to pdfminer.six's layout engine, pass the laparams keyword argument, e.g., pdfplumber.open("file.pdf", laparams = { "line_overlap": 0.7 }).

To pre-normalize Unicode text, pass unicode_norm=..., where ... is one of the four Unicode normalization forms: "NFC", "NFD", "NFKC", or "NFKD".

Invalid metadata values are treated as a warning by default. If that is not intended, pass strict_metadata=True to the open method and pdfplumber.open will raise an exception if it is unable to parse the metadata.

The `pdfplumber.PDF` class

The top-level pdfplumber.PDF class represents a single PDF and has two main properties:

Property	Description
`.metadata`	A dictionary of metadata key/value pairs, drawn from the PDF's `Info` trailers. Typically includes "CreationDate," "ModDate," "Producer," et cetera.
`.pages`	A list containing one `pdfplumber.Page` instance per page loaded.

... and also has the following method:

Method	Description
`.close()`	Calling this method calls `Page.close()` on each page, and also closes the file stream (except in cases when the stream is external, i.e., already opened and passed directly to `pdfplumber`).

The `pdfplumber.Page` class

The pdfplumber.Page class is at the core of pdfplumber. Most things you'll do with pdfplumber will revolve around this class. It has these main properties:

Property	Description
`.page_number`	The sequential page number, starting with `1` for the first page, `2` for the second, and so on.
`.width`	The page's width.
`.height`	The page's height.
`.objects` / `.chars` / `.lines` / `.rects` / `.curves` / `.images`	Each of these properties is a list, and each list contains one dictionary for each such object embedded on the page. For more detail, see "Objects" below.

... and these main methods:

Method	Description
`.crop(bounding_box, relative=False, strict=True)`	Returns a version of the page cropped to the bounding box, which should be expressed as 4-tuple with the values `(x0, top, x1, bottom)`. Cropped pages retain objects that fall at least partly within the bounding box. If an object falls only partly within the box, its dimensions are sliced to fit the bounding box. If `relative=True`, the bounding box is calculated as an offset from the top-left of the page's bounding box, rather than an absolute positioning. (See Issue #245 for a visual example and explanation.) When `strict=True` (the default), the crop's bounding box must fall entirely within the page's bounding box.
`.within_bbox(bounding_box, relative=False, strict=True)`	Similar to `.crop`, but only retains objects that fall entirely within the bounding box.
`.outside_bbox(bounding_box, relative=False, strict=True)`	Similar to `.crop` and `.within_bbox`, but only retains objects that fall entirely outside the bounding box.
`.filter(test_function)`	Returns a version of the page with only the `.objects` for which `test_function(obj)` returns `True`.

... and also has the following method:

Method	Description
`.close()`	By default, `Page` objects cache their layout and object information to avoid having to reprocess it. When parsing large PDFs, however, these cached properties can require a lot of memory. You can use this method to flush the cache and release the memory.

Additional methods are described in the sections below:

Visual debugging
Extracting text
Extracting tables

Objects

Each instance of pdfplumber.PDF and pdfplumber.Page provides access to several types of PDF objects, all derived from pdfminer.six PDF parsing. The following properties each return a Python list of the matching objects:

.chars, each representing a single text character.
.lines, each representing a single 1-dimensional line.
.rects, each representing a single 2-dimensional rectangle.
.curves, each representing any series of connected points that pdfminer.six does not recognize as a line or rectangle.
.images, each representing an image.
.annots, each representing a single PDF annotation (cf. Section 8.4 of the official PDF specification for details)
.hyperlinks, each representing a single PDF annotation of the subtype Link and having an URI action attribute

Each object is represented as a simple Python dict, with the following properties:

`char` properties

Property	Description
`page_number`	Page number on which this character was found.
`text`	E.g., "z", or "Z" or " ".
`fontname`	Name of the character's font face.
`size`	Font size.
`adv`	Equal to text width * the font size * scaling factor.
`upright`	Whether the character is upright.
`height`	Height of the character.
`width`	Width of the character.
`x0`	Distance of left side of character from left side of page.
`x1`	Distance of right side of character from left side of page.
`y0`	Distance of bottom of character from bottom of page.
`y1`	Distance of top of character from bottom of page.
`top`	Distance of top of character from top of page.
`bottom`	Distance of bottom of the character from top of page.
`doctop`	Distance of top of character from top of document.
`matrix`	The "current transformation matrix" for this character. (See below for details.)
`mcid`	The marked content section ID for this character if any (otherwise `None`). Experimental attribute.
`tag`	The marked content section tag for this character if any (otherwise `None`). Experimental attribute.
`ncs`	TKTK
`stroking_pattern`	TKTK
`non_stroking_pattern`	TKTK
`stroking_color`	The color of the character's outline (i.e., stroke). See docs/colors.md for details.
`non_stroking_color`	The character's interior color. See docs/colors.md for details.
`object_type`	"char"

Note: A character’s matrix property represents the “current transformation matrix,” as described in Section 4.2.2 of the PDF Reference (6th Ed.). The matrix controls the character’s scale, skew, and positional translation. Rotation is a combination of scale and skew, but in most cases can be considered equal to the x-axis skew. The pdfplumber.ctm submodule defines a class, CTM, that assists with these calculations. For instance:

from pdfplumber.ctm import CTM
my_char = pdf.pages[0].chars[3]
my_char_ctm = CTM(*my_char["matrix"])
my_char_rotation = my_char_ctm.skew_x

`line` properties

Property	Description
`page_number`	Page number on which this line was found.
`height`	Height of line.
`width`	Width of line.
`x0`	Distance of left-side extremity from left side of page.
`x1`	Distance of right-side extremity from left side of page.
`y0`	Distance of bottom extremity from bottom of page.
`y1`	Distance of top extremity bottom of page.
`top`	Distance of top of line from top of page.
`bottom`	Distance of bottom of the line from top of page.
`doctop`	Distance of top of line from top of document.
`linewidth`	Thickness of line.
`stroking_color`	The color of the line. See docs/colors.md for details.
`non_stroking_color`	The non-stroking color specified for the line’s path. See docs/colors.md for details.
`mcid`	The marked content section ID for this line if any (otherwise `None`). Experimental attribute.
`tag`	The marked content section tag for this line if any (otherwise `None`). Experimental attribute.
`object_type`	"line"

`rect` properties

Property	Description
`page_number`	Page number on which this rectangle was found.
`height`	Height of rectangle.
`width`	Width of rectangle.
`x0`	Distance of left side of rectangle from left side of page.
`x1`	Distance of right side of rectangle from left side of page.
`y0`	Distance of bottom of rectangle from bottom of page.
`y1`	Distance of top of rectangle from bottom of page.
`top`	Distance of top of rectangle from top of page.
`bottom`	Distance of bottom of the rectangle from top of page.
`doctop`	Distance of top of rectangle from top of document.
`linewidth`	Thickness of line.
`stroking_color`	The color of the rectangle's outline. See docs/colors.md for details.
`non_stroking_color`	The rectangle’s fill color. See docs/colors.md for details.
`mcid`	The marked content section ID for this rect if any (otherwise `None`). Experimental attribute.
`tag`	The marked content section tag for this rect if any (otherwise `None`). Experimental attribute.
`object_type`	"rect"

`curve` properties

Property	Description
`page_number`	Page number on which this curve was found.
`pts`	A list of `(x, top)` tuples indicating the points on the curve.
`path`	A list of `(cmd, (x, top))` tuples describing the full path description*, including (for example) control points used in Bezier curves.
`height`	Height of curve's bounding box.
`width`	Width of curve's bounding box.
`x0`	Distance of curve's left-most point from left side of page.
`x1`	Distance of curve's right-most point from left side of the page.
`y0`	Distance of curve's lowest point from bottom of page.

Core symbols most depended-on inside this repo

pdfplumber/structure.py

decode_text

called by 16

pdfplumber/utils/pdfinternals.py

Shape

Method 366

Function 75

Class 46

Languages

Python100%

Modules by API surface

pdfplumber/page.py59 symbols

tests/test_utils.py41 symbols

pdfplumber/table.py33 symbols

pdfplumber/utils/text.py30 symbols

tests/test_issues.py27 symbols

tests/test_basics.py26 symbols

pdfplumber/display.py25 symbols

tests/test_structure.py22 symbols

pdfplumber/structure.py22 symbols

pdfplumber/container.py22 symbols

pdfplumber/utils/geometry.py21 symbols

tests/test_convert.py20 symbols

Used by 14 indexed graphs manifest dependencies, hub-wide

github.com/Alibaba-NLP/DeepResearch

github.com/PKU-YuanGroup/Helios

github.com/Yuan1z0825/nature-skills

github.com/anthropics/knowledge-work-plugins

github.com/crewAIInc/crewAI

github.com/datawhalechina/hello-agents

github.com/frappe/erpnext

github.com/infiniflow/ragflow

github.com/microsoft/markitdown

github.com/netease-youdao/QAnything

… +4 more

Dependencies from manifests, versioned

Pillow12.2.0 · 1×

black26.5.1 · 1×

flake87.3.0 · 1×

isort8.0.1 · 1×

jupyterlab4.5.8 · 1×

mypy2.1.0 · 1×

nbexec0.2.0 · 1×

pandas2.3.3 · 1×

pandas-stubs2.3.3.260113 · 1×

pdfminer.six20260107 · 1×

py1.11.0 · 1×

pypdfium25.9.0 · 1×

For agents

$ claude mcp add pdfplumber \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact

github.com/jsvine/pdfplumber @v0.11.10

pdfplumber

Table of Contents

Installation

Command line interface

Basic example

Options

Python library

Basic example

Loading a PDF

The pdfplumber.PDF class

The pdfplumber.Page class

Objects

char properties

line properties

rect properties

curve properties

Core symbols most depended-on inside this repo

Shape

Languages

Modules by API surface

Used by 14 indexed graphs manifest dependencies, hub-wide

Dependencies from manifests, versioned

For agents

The `pdfplumber.PDF` class

The `pdfplumber.Page` class

`char` properties

`line` properties

`rect` properties

`curve` properties