MCPcopy
hub / github.com/coder-hxl/x-crawl

github.com/coder-hxl/x-crawl @v10.1.0 sqlite

repository ↗ · DeepWiki ↗ · release v10.1.0 ↗
159 symbols 405 edges 41 files 0 documented · 0%
README

x-crawl · npm NPM Downloads GitHub license

English | 简体中文

x-crawl is a flexible Node.js AI-assisted crawler library. Flexible usage and powerful AI assistance functions make crawler work more efficient, intelligent and convenient.

It consists of two parts:

  • Crawler: It consists of a crawler API and various functions that can work normally even without relying on AI.
  • AI: Integrate ollama and openai, AI simplifies many tedious operations.

If you find x-crawl helpful, or you like x-crawl, you can give x-crawl repository a like on GitHub A star. Your support is the driving force for our continuous improvement! thank you for your support!

Features

  • 🤖 AI Assistance - Integrate ollama and openai, powerful AI assistance function makes crawler work more efficient, intelligent and convenient.
  • 🖋️ Flexible writing - A single crawling API is suitable for multiple configurations, and each configuration method has its own advantages.
  • ⚙️Multiple uses - Supports crawling dynamic pages, static pages, interface data and file data.
  • ⚒️ Control page - Crawling dynamic pages supports automated operations, keyboard input, event operations, etc.
  • 👀 Device Fingerprinting - Zero configuration or custom configuration to avoid fingerprint recognition to identify and track us from different locations.
  • 🔥 Asynchronous Sync - Asynchronous or synchronous crawling mode without switching crawling API.
  • ⏱️ Interval crawling - no interval, fixed interval and random interval, determine whether to crawl with high concurrency.
  • 🔄 Failed Retry - Customize the number of retries to avoid crawling failures due to temporary problems.
  • ➡️ Rotation proxy - Automatic proxy rotation with failed retries, custom error times and HTTP status codes.
  • 🚀 Priority Queue - Based on the priority of a single crawl target, it can be crawled ahead of other targets.
  • 🧾 Crawl information - Controllable crawl information, which will output colored string information in the terminal.
  • 🦾 TypeScript - Own types and implement complete types through generics.

AI assisted crawler

With the rapid development of network technology, website updates have become more frequent, and changes in class names or structures often bring considerable challenges to crawlers that rely on these elements. Against this background, crawlers combined with AI technology have become a powerful weapon to meet this challenge.

First of all, changes in class names or structures after website updates may cause traditional crawler strategies to fail. This is because crawlers often rely on fixed class names or structures to locate and extract the required information. Once these elements change, the crawler may not be able to accurately find the required data, thus affecting the effectiveness and accuracy of data crawling.

However, crawlers combined with AI technology are better able to cope with this change. AI can also understand and parse the semantic information of web pages through natural language processing and other technologies to more accurately extract the required data.

To sum up, crawlers combined with AI technology can better cope with the problem of class name or structure changes after website updates.

Example

The combination of crawler and AI allows the crawler and AI to obtain pictures of high-rated vacation rentals according to our instructions:

import { createCrawl, createCrawlOpenAI } from 'x-crawl'

// Create a crawler application
const crawlApp = createCrawl({
  maxRetry: 3,
  intervalTime: { max: 2000, min: 1000 }
})

// Create AI application
const crawlOpenAIApp = createCrawlOpenAI({
  clientOptions: { apiKey: process.env['OPENAI_API_KEY'] },
  defaultModel: { chatModel: 'gpt-4-turbo-preview' }
})

// crawlPage is used to crawl pages
crawlApp
  .crawlPage('https://www.example.cn/s/select_homes')
  .then(async (res) => {
    const { page, browser } = res.data

    // Wait for the element to appear on the page and get the HTML
    const targetSelector = '[data-tracking-id="TOP_REVIEWED_LISTINGS"]'
    await page.waitForSelector(targetSelector)
    const highlyHTML = await page.$eval(targetSelector, (el) => el.innerHTML)

    // Let AI obtain image links and remove duplicates
    const srcResult = await crawlOpenAIApp.parseElements(
      highlyHTML,
      `Get the image link, don't source it inside, and de-duplicate it`
    )

    browser.close()

    // crawlFile is used to crawl file resources
    crawlApp.crawlFile({
      targets: srcResult.elements.map((item) => item.src),
      storeDirs: './upload'
    })
  })

[!TIP] Can even send the whole HTML to the AI to help us operate, because the website content is more complex you also need to describe the location to get more accurately, and will consume a lot of Tokens.

Even if the subsequent update of the website causes the class name or structure to change, it can climb to the data normally, because we no longer rely on the fixed class name or structure to locate and extract the required information, but let the AI understand and parse the semantic information of the web page, so as to extract the required data more efficiently, intelligently and conveniently.

See the HTML that the AI needs to process

For ease of viewing, it is formatted here

```html

        <h2 class="h1436ahv dir dir-ltr">威奇托的高评分度假屋</h2>



          这些房源在位置、干净卫生等方面获得房客的一致好评。















    显示 12 项中的 4 项









      <span class="a8jt5op dir dir-ltr" aria-live="polite"
        >第 1 页,共 3 页</span
      >



        <span class="a8jt5op dir dir-ltr">第 1 页,共 3 页</span>

1 / 3

          <button
            aria-label="上一张"
            type="button"
            class="l1ovpqvx c1e0qvzg dir dir-ltr"
          >
            <span class="ifnd39z dir dir-ltr"
              ><span class="_krjbj">_</span
              ><svg
                xmlns="http://www.w3.org/2000/svg"
                viewBox="0 0 32 32"
                style="display:block;fill:none;height:12px;width:12px;stroke:currentColor;stroke-width:4;overflow:visible"
                aria-hidden="true"
                role="presentation"
                focusable="false"
              >
                <path
                  fill="none"
                  d="M20 28 8.7 16.7a1 1 0 0 1 0-1.4L20 4"
                ></path></svg
            ></span></button
          ><span class="_pog3hg"></span
          ><button
            aria-label="下一张"
            type="button"
            class="l1ovpqvx c1e0qvzg dir dir-ltr"
          >
            <span class="ifnd39z dir dir-ltr"
              ><span class="_krjbj">_</span
              ><svg
                xmlns="http://www.w3.org/2000/svg"
                viewBox="0 0 32 32"
                style="display:block;fill:none;height:12px;width:12px;stroke:currentColor;stroke-width:4;overflow:visible"
                aria-hidden="true"
                role="presentation"
                focusable="false"
              >
                <path
                  fill="none"
                  d="m12 4 11.3 11.3a1 1 0 0 1 0 1.4L12 28"
                ></path></svg
            ></span>
          </button>





















        <a
          aria-label="农家乐 | Mulvane"
          class="cbkobxh dir dir-ltr"
          href="https://github.com/coder-hxl/x-crawl/raw/v10.1.0/s/homes?dynamic_product_ids%5B%5D=45937791&amp;omni_page_id=36021&amp;place_id=ChIJLRh_0mrbuocRPj3TdL_VlpM"
          target="_blank"
          rel="noreferrer"
          data-nosnippet="true"
          >










            <img
              src="https://z1.muscache.cn/im/pictures/miso/Hosting-45937791/original/c67d32ed-21eb-4066-8cef-650dcd45bada.jpeg?aki_policy=large"
              loading="lazy"
              alt=""
              class="iotvkpj dir dir-ltr"
          />





            <span class="a8jt5op dir dir-ltr">房客推荐</span>



              房客推荐









            <h3
              class="t1jojoys n1nue62c dir dir-ltr"
              data-testid="listing-card-title"
            >
              农家乐 | Mulvane
            </h3>
            <span class="r4a59j5 dir dir-ltr"
              ><span class="a8jt5op dir dir-ltr"
                >平均评分 4.98 分(满分 5 分),共 168 条评价</span
              ><svg
                xmlns="http://www.w3.org/2000/svg"
                viewBox="0 0 32 32"
                style="display:block;height:12px;width:12px;fill:currentColor"
                aria-hidden="true"
                role="presentation"
                focusable="false"
              >
                <path
                  fill-rule="evenodd"
                  d="m15.1 1.58-4.13 8.88-9.86 1.27a1 1 0 0 0-.54 1.74l7.3 6.57-1.97 9.85a1 1 0 0 0 1.48 1.06l8.62-5 8.63 5a1 1 0 0 0 1.48-1.06l-1.97-9.85 7.3-6.57a1 1 0 0 0-.55-1.73l-9.86-1.28-4.12-8.88a1 1 0 0 0-1.82 0z"
                ></path></svg
              ><span aria-hidden="true" class="ru0q88m dir dir-ltr"
                >4.98 (168)</span
              ></span
            >



              带私人热水浴缸和早餐的小木屋度假屋3






              Stay in one of our three private cabins, each equipped with
              its own two person hot tub on the back deck for you to enjoy
              under the stars. You will also have breakfast delivered to
              your cabin at the time you choose in the morning(s). Each
              cabin offers a pillow-top Queen bed, mini-fridge, microwave,
              coffee maker, thermostat-controlled gas fireplace, A/C unit,
              shower, cable TV, and a DVD player. All of that situated on
              24+ beautiful acres complete with a pond and walking paths
              through the woods.

8月8日至15日

              <span aria-hidden="true" class="piy2wzv dir dir-ltr"
                >¥1,170</span
              ><span aria-hidden="true">&nbsp;\x3C!-- -->/晚</span
              ><span class="s14ffc1j dir dir-ltr">每晚 ¥1,170</span>

        <a
          aria-label="Loft | 威奇托(Wichita)"
          class="cbkobxh dir d

Extension points exported contracts — how you extend this code

CreateCrawlOpenAIConfig (Interface)
(no doc)
packages/ai/openai.ts
Request (Interface)
(no doc)
packages/crawl/request.ts
CrawlOpenAICommonAPIOtherOption (Interface)
(no doc)
packages/ai/openai.ts
ContentConfig (Interface)
(no doc)
packages/crawl/request.ts
CrawlOpenAIRunChatOption (Interface)
(no doc)
packages/ai/openai.ts
DeviceResult (Interface)
(no doc)
packages/crawl/controller.ts
CrawlOpenAIParseElementsContentOptions (Interface)
(no doc)
packages/ai/openai.ts
Device (Interface)
(no doc)
packages/crawl/controller.ts

Core symbols most depended-on inside this repo

isUndefined
called by 35
packages/shared/general.ts
createCrawl
called by 34
packages/crawl/index.ts
isObject
called by 19
packages/shared/general.ts
random
called by 9
packages/shared/general.ts
transformTargetToDetailTargets
called by 8
packages/crawl/api.ts
controller
called by 4
packages/crawl/controller.ts
loaderCommonConfigToCrawlConfig
called by 4
packages/crawl/api.ts
handleResultEssentialOtherValue
called by 4
packages/crawl/api.ts

Shape

Function 98
Interface 53
Method 8

Languages

TypeScript100%

Modules by API surface

packages/crawl/api.ts40 symbols
packages/ai/openai.ts18 symbols
packages/ai/ollama.ts17 symbols
packages/crawl/types/api.ts15 symbols
packages/shared/general.ts12 symbols
test/automation/written/crawlFile.test.ts7 symbols
test/automation/written/crawlPage.test.ts6 symbols
test/automation/written/crawlHTML.test.ts6 symbols
test/automation/written/crawlData.test.ts6 symbols
packages/crawl/request.ts6 symbols
packages/crawl/types/index.ts4 symbols
packages/crawl/controller.ts4 symbols

Dependencies from manifests, versioned

@babel/core7.26.10 · 1×
@babel/preset-env7.26.9 · 1×
@rollup/plugin-babel6.0.4 · 1×
@rollup/plugin-run3.1.0 · 1×
@rollup/plugin-terser0.4.4 · 1×
@types/node20.17.30 · 1×
@typescript-eslint/eslint-plugin8.29.0 · 1×
@typescript-eslint/parser8.29.0 · 1×
@vitest/coverage-v83.1.1 · 1×
@vitest/ui3.1.1 · 1×
chalk5.4.1 · 1×
devlink: · 1×

For agents

$ claude mcp add x-crawl \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact