MCPcopy
hub / github.com/SpiderClub/haipproxy

github.com/SpiderClub/haipproxy @v0.1 sqlite

repository ↗ · DeepWiki ↗ · release v0.1 ↗
126 symbols 528 edges 34 files 36 documented · 29%
README

HAipproxy

中文文档 | README

This project crawls proxy ip resources from the Internet.What we wish is to provide a anonymous ip proxy pool with highly availability and low latency for distributed spiders.

Features

  • Distributed crawlers with high performance, powered by scrapy and redis
  • Large-scale of proxy ip resources
  • HA design for both crawlers and schedulers
  • Flexible architecture with task routing
  • Support HTTP/HTTPS and Socks5 proxy
  • MIT LICENSE.Feel free to do whatever you want

Quick start

Standalone

Server

  • Install Python3 and Redis Server
  • Change redis args of the project config/settings.py according to redis conf,such as REDIS_HOST,REDIS_PASSWORD
  • Install scrapy-splash and change SPLASH_URL in config/settings.py
  • Install dependencies

    pip install -r requirements.txt

  • Start scrapy worker,including ip proxy crawler and validator

    python crawler_booter.py --usage crawler

python crawler_booter.py --usage validator - Start task scheduler,including crawler task scheduler and validator task scheduler python scheduler_booter.py --usage crawler

python scheduler_booter.py --usage validator

Client

haipproxy provides both py client and squid proxy for your spiders.Any clients about any languages are welcome!

Python Client

from client.py_cli import ProxyFetcher
# args are used to connect redis, if args is None, redis args in settings.py will be used
args = dict(host='127.0.0.1', port=6379, password='123456', db=0)
# https is used for common proxy.If you want to crawl a customized website, you'd better 
# write a customized ip validator according to zhihu validator
fetcher = ProxyFetcher('https', strategy='greedy', length=5, redis_args=args)
# get one proxy ip
print(fetcher.get_proxy())
# get available proxy ip list
print(fetcher.get_proxies()) # or print(fetcher.pool)

Using squid as proxy server

  • Install squid,copy it's conf as a backup and then start squid, take ubuntu for example

    sudo apt-get install squid

sudo sed -i 's/http_access deny all/http_access allow all/g'

sudo cp /etc/squid/squid.conf /etc/squid/squid.conf.backup

sudo service squid start - Change SQUID_BIN_PATH,SQUID_CONF_PATH and SQUID_TEMPLATE_PATH in config/settings.py according to your OS - Update squid conf periodically sudo python squid_update.py - After a while,you can send requests with squid proxies, the proxies url is 'http://squid_host:3128', e.g. python3 import requests proxies = {'https': 'http://127.0.0.1:3128'} resp = requests.get('https://httpbin.org/ip', proxies=proxies) print(resp.text)

Dockerize

  • Install Docker

  • Install docker-compose

    pip install -U docker-compose

  • ChangeSPLASH_URLandREDIS_HOSTin settings.py python3 SPLASH_URL = 'http://splash:8050' REDIS_HOST = 'redis'

  • Start all the containers using docker-compose

    docker-compose up

  • Use py_cli or Squid to get available proxy ips. python3 from client.py_cli import ProxyFetcher args = dict(host='127.0.0.1', port=6379, password='123456', db=0) fetcher = ProxyFetcher('https', strategy='greedy', length=5, redis_args=args) print(fetcher.get_proxy()) print(fetcher.get_proxies()) # or print(fetcher.pool)

or

import requests
proxies = {'https': 'http://127.0.0.1:3128'}
resp = requests.get('https://httpbin.org/ip', proxies=proxies)
print(resp.text)

WorkFlow

Other important things

  • This project is highly dependent on redis,if you want to replace redis with another mq or database, just do it at your own risk
  • If there is no Great Fire Wall at your country,setproxy_mode=0 in both gfw_spider.py and ajax_gfw_spider.py. If you don't want to crawl some websites, set enable=0 in rules.py
  • Becase of the Great Fire Wall in China, some proxy ip may can't be used to crawl some websites such as Google.You can extend the proxy pool by yourself in spiders
  • Issues and PRs are welcome
  • Just star it if it's useful to you

Test Result

Here are test results for crawling https://zhihu.com using haipproxy.Source Code can be seen here

requests time cost strategy client
0 2018/03/03 22:03 0 greedy py_cli
10000 2018/03/03 11:03 1 hour greedy py_cli
20000 2018/03/04 00:08 2 hours greedy py_cli
30000 2018/03/04 01:02 3 hours greedy py_cli
40000 2018/03/04 02:15 4 hours greedy py_cli
50000 2018/03/04 03:03 5 hours greedy py_cli
60000 2018/03/04 05:18 7 hours greedy py_cli
70000 2018/03/04 07:11 9 hours greedy py_cli
80000 2018/03/04 08:43 11 hours greedy py_cli

Reference

Thanks to all the contributors of the following projects.

dungproxy

proxyspider

ProxyPool

proxy_pool

ProxyPool

IPProxyTool

IPProxyPool

proxy_list

proxy_pool

Core symbols most depended-on inside this repo

get
called by 45
examples/zhihu/crawler.py
exists
called by 30
crawler/spiders/base.py
parse_common
called by 18
crawler/spiders/base.py
construct_proxy_url
called by 12
crawler/spiders/base.py
get_redis_conn
called by 9
utils/redis_util.py
procotol_extractor
called by 7
crawler/spiders/base.py
proxy_feedback
called by 3
client/py_cli.py
acquire_lock
called by 3
utils/redis_util.py

Shape

Method 76
Class 38
Function 12

Languages

Python100%

Modules by API surface

crawler/redis_spiders.py17 symbols
scheduler/scheduler.py16 symbols
client/py_cli.py16 symbols
crawler/pipelines.py11 symbols
crawler/spiders/base.py8 symbols
crawler/middlewares.py8 symbols
crawler/validators/httpbin.py6 symbols
crawler/validators/base.py6 symbols
examples/zhihu/zhihu_spider.py5 symbols
crawler/spiders/ajax_gfw_spider.py5 symbols
crawler/spiders/gfw_spider.py4 symbols
crawler/items.py4 symbols

Dependencies from manifests, versioned

Scrapy1.5.0 · 1×
click6.7 · 1×
redis2.10.6 · 1×
requests2.18.4 · 1×
schedule0.5.0 · 1×
scrapy-splash0.7.2 · 1×
service-identity17.0.0 · 1×

For agents

$ claude mcp add haipproxy \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact