MCPcopy
hub / github.com/jhy/jsoup

github.com/jhy/jsoup @jsoup-1.22.2 sqlite

repository ↗ · DeepWiki ↗ · release jsoup-1.22.2 ↗
3,939 symbols 22,239 edges 192 files 997 documented · 25%
README

jsoup: Java HTML Parser

jsoup is a Java library that makes it easy to work with real-world HTML and XML. It offers an easy-to-use API for URL fetching, data parsing, extraction, and manipulation using DOM API methods, CSS, and xpath selectors.

jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers.

  • scrape and parse HTML from a URL, file, or string
  • find and extract data, using DOM traversal or CSS selectors
  • manipulate the HTML elements, attributes, and text
  • clean user-submitted content against a safe-list, to prevent XSS attacks
  • output tidy HTML

jsoup is designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; jsoup will create a sensible parse tree.

See jsoup.org for downloads and the full API documentation.

Build Status

Example

Fetch the Wikipedia homepage, parse it to a DOM, and select the headlines from the In the News section into a list of Elements:

Document doc = Jsoup.connect("https://en.wikipedia.org/").get();
log(doc.title());
Elements newsHeadlines = doc.select("#mp-itn b a");
for (Element headline : newsHeadlines) {
  log("%s\n\t%s", 
    headline.attr("title"), headline.absUrl("href"));
}

Online sample, full source.

Open source

jsoup is an open source project distributed under the liberal MIT license. The source code is available on GitHub.

Getting started

  1. Download the latest jsoup jar (or add it to your Maven/Gradle build)
  2. Read the cookbook
  3. Enjoy!

Android support

When used in Android projects, core library desugaring with the NIO specification should be enabled to support Java 8+ features.

Development and support

If you have any questions on how to use jsoup, or have ideas for future development, please get in touch via jsoup Discussions.

If you find any issues, please file a bug after checking for duplicates.

The colophon talks about the history of and tools used to build jsoup.

Status

jsoup is in general, stable release.

Author

jsoup was created and is maintained by Jonathan Hedley, its primary author.

jsoup is an open-source project, and many contributors have helped improve it over the years. You can see their contributions and join the development on GitHub.

Citing jsoup

If you use jsoup in research or technical documentation, you can cite it as:

Jonathan Hedley & jsoup contributors. jsoup: Java HTML Parser (2009–present). Available at: https://jsoup.org

@misc{jsoup,
  author = {Jonathan Hedley and jsoup contributors},
  title = {jsoup: Java HTML Parser},
  year = {2025},
  url = {https://jsoup.org}
}

Extension points exported contracts — how you extend this code

RouteHandler (Interface)
Handles one routed test request on the Netty-backed origin server [8 implementers]
src/test/java/org/jsoup/integration/netty/RouteHandler.java
NodeVisitor (Interface)
Node visitor interface, used to walk the DOM and visit each node. Execute via #traverse(Node) or {@link Node#tr [9 implementers]
src/main/java/org/jsoup/select/NodeVisitor.java
Connection (Interface)
The Connection interface is a convenient HTTP client and session object to fetch content from the web, and parse them i [2 …
src/main/java/org/jsoup/Connection.java
Matcher (Interface)
(no doc) [4 implementers]
src/main/java/org/jsoup/helper/Regex.java
NodeFilter (Interface)
A controllable Node visitor interface. Execute via #traverse(Node). This interface provides two methods, { [2 implementers]
src/main/java/org/jsoup/select/NodeFilter.java
RequestAuthenticator (Interface)
A RequestAuthenticator is used in Connection to authenticate if required to proxies and web servers. Se [1 implementers]
src/main/java/org/jsoup/helper/RequestAuthenticator.java

Core symbols most depended-on inside this repo

parse
called by 1051
src/main/java/org/jsoup/Jsoup.java
get
called by 643
src/main/java/org/jsoup/Connection.java
append
called by 393
src/main/java/org/jsoup/internal/QuietAppendable.java
body
called by 317
src/main/java/org/jsoup/nodes/Document.java
transition
called by 259
src/main/java/org/jsoup/parser/Tokeniser.java
error
called by 189
src/main/java/org/jsoup/parser/Tokeniser.java
of
called by 186
src/main/java/org/jsoup/nodes/Range.java
first
called by 183
src/main/java/org/jsoup/select/Nodes.java

Shape

Method 3,613
Class 300
Interface 14
Enum 12

Languages

Java100%

Modules by API surface

src/test/java/org/jsoup/nodes/ElementTest.java248 symbols
src/test/java/org/jsoup/parser/HtmlParserTest.java225 symbols
src/main/java/org/jsoup/select/Evaluator.java185 symbols
src/test/java/org/jsoup/select/SelectorTest.java142 symbols
src/main/java/org/jsoup/nodes/Element.java136 symbols
src/main/java/org/jsoup/helper/HttpConnection.java104 symbols
src/main/java/org/jsoup/parser/HtmlTreeBuilder.java85 symbols
src/main/java/org/jsoup/parser/Token.java84 symbols
src/main/java/org/jsoup/Connection.java82 symbols
src/main/java/org/jsoup/nodes/Node.java73 symbols
src/test/java/org/jsoup/integration/ConnectTest.java72 symbols
src/test/java/org/jsoup/select/ElementsTest.java57 symbols

Dependencies from manifests, versioned

io.netty:netty-bom4.2.12.Final · 1×
org.jsoup:jsoup1.21.2 · 1×
org.junit.jupiter:junit-jupiter5.14.3 · 1×

For agents

$ claude mcp add jsoup \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact