hub / github.com/SpiderClub/haipproxy

github.com/SpiderClub/haipproxy @v0.1 sqlite

repository ↗ · DeepWiki ↗ · release v0.1 ↗

126 symbols 528 edges 34 files 36 documented · 29%

README

HAipproxy

This project crawls proxy ip resources from the Internet.What we wish is to provide a anonymous ip proxy pool with highly availability and low latency for distributed spiders.

Features

Distributed crawlers with high performance, powered by scrapy and redis
Large-scale of proxy ip resources
HA design for both crawlers and schedulers
Flexible architecture with task routing
Support HTTP/HTTPS and Socks5 proxy
MIT LICENSE.Feel free to do whatever you want

Quick start

Standalone

Server

Install Python3 and Redis Server
Change redis args of the project config/settings.py according to redis conf,such as REDIS_HOST,REDIS_PASSWORD
Install scrapy-splash and change SPLASH_URL in config/settings.py
Install dependencies

pip install -r requirements.txt
Start scrapy worker,including ip proxy crawler and validator

python crawler_booter.py --usage crawler

python crawler_booter.py --usage validator - Start task scheduler,including crawler task scheduler and validator task scheduler python scheduler_booter.py --usage crawler

python scheduler_booter.py --usage validator

Client

haipproxy provides both py client and squid proxy for your spiders.Any clients about any languages are welcome!

Python Client

from client.py_cli import ProxyFetcher
# args are used to connect redis, if args is None, redis args in settings.py will be used
args = dict(host='127.0.0.1', port=6379, password='123456', db=0)
# https is used for common proxy.If you want to crawl a customized website, you'd better 
# write a customized ip validator according to zhihu validator
fetcher = ProxyFetcher('https', strategy='greedy', length=5, redis_args=args)
# get one proxy ip
print(fetcher.get_proxy())
# get available proxy ip list
print(fetcher.get_proxies()) # or print(fetcher.pool)

Using squid as proxy server

Install squid,copy it's conf as a backup and then start squid, take ubuntu for example

sudo apt-get install squid

sudo sed -i 's/http_access deny all/http_access allow all/g'

sudo cp /etc/squid/squid.conf /etc/squid/squid.conf.backup

sudo service squid start - Change SQUID_BIN_PATH,SQUID_CONF_PATH and SQUID_TEMPLATE_PATH in config/settings.py according to your OS - Update squid conf periodically sudo python squid_update.py - After a while,you can send requests with squid proxies, the proxies url is 'http://squid_host:3128', e.g. python3 import requests proxies = {'https': 'http://127.0.0.1:3128'} resp = requests.get('https://httpbin.org/ip', proxies=proxies) print(resp.text)

Dockerize

Install Docker
Install docker-compose

pip install -U docker-compose
ChangeSPLASH_URLandREDIS_HOSTin settings.py python3 SPLASH_URL = 'http://splash:8050' REDIS_HOST = 'redis'
Start all the containers using docker-compose

docker-compose up
Use py_cli or Squid to get available proxy ips. python3 from client.py_cli import ProxyFetcher args = dict(host='127.0.0.1', port=6379, password='123456', db=0) fetcher = ProxyFetcher('https', strategy='greedy', length=5, redis_args=args) print(fetcher.get_proxy()) print(fetcher.get_proxies()) # or print(fetcher.pool)

import requests
proxies = {'https': 'http://127.0.0.1:3128'}
resp = requests.get('https://httpbin.org/ip', proxies=proxies)
print(resp.text)

WorkFlow

Other important things

This project is highly dependent on redis,if you want to replace redis with another mq or database, just do it at your own risk
If there is no Great Fire Wall at your country,setproxy_mode=0 in both gfw_spider.py and ajax_gfw_spider.py. If you don't want to crawl some websites, set enable=0 in rules.py
Becase of the Great Fire Wall in China, some proxy ip may can't be used to crawl some websites such as Google.You can extend the proxy pool by yourself in spiders
Issues and PRs are welcome
Just star it if it's useful to you

Test Result

Here are test results for crawling https://zhihu.com using haipproxy.Source Code can be seen here

requests	time	cost	strategy	client
0	2018/03/03 22:03	0	greedy	py_cli
10000	2018/03/03 11:03	1 hour	greedy	py_cli
20000	2018/03/04 00:08	2 hours	greedy	py_cli
30000	2018/03/04 01:02	3 hours	greedy	py_cli
40000	2018/03/04 02:15	4 hours	greedy	py_cli
50000	2018/03/04 03:03	5 hours	greedy	py_cli
60000	2018/03/04 05:18	7 hours	greedy	py_cli
70000	2018/03/04 07:11	9 hours	greedy	py_cli
80000	2018/03/04 08:43	11 hours	greedy	py_cli

Reference

Thanks to all the contributors of the following projects.

Core symbols most depended-on inside this repo

get

called by 45

examples/zhihu/crawler.py

exists

called by 30

crawler/spiders/base.py

parse_common

called by 18

crawler/spiders/base.py

construct_proxy_url

called by 12

crawler/spiders/base.py

crawler/spiders/base.py

Shape

Method 76

Class 38

Function 12

Languages

Python100%

Modules by API surface

crawler/redis_spiders.py17 symbols

scheduler/scheduler.py16 symbols

client/py_cli.py16 symbols

crawler/pipelines.py11 symbols

crawler/spiders/base.py8 symbols

crawler/middlewares.py8 symbols

crawler/validators/httpbin.py6 symbols

crawler/validators/base.py6 symbols

examples/zhihu/zhihu_spider.py5 symbols

crawler/spiders/ajax_gfw_spider.py5 symbols

crawler/spiders/gfw_spider.py4 symbols

crawler/items.py4 symbols

Dependencies from manifests, versioned

Scrapy1.5.0 · 1×

click6.7 · 1×

redis2.10.6 · 1×

requests2.18.4 · 1×

schedule0.5.0 · 1×

scrapy-splash0.7.2 · 1×

service-identity17.0.0 · 1×

For agents

$ claude mcp add haipproxy \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact