This project crawls proxy ip resources from the Internet.What we wish is to provide a anonymous ip proxy pool with highly availability and low latency for distributed spiders.
REDIS_HOST,REDIS_PASSWORDSPLASH_URL in config/settings.pypip install -r requirements.txt
python crawler_booter.py --usage crawler
python crawler_booter.py --usage validator - Start task scheduler,including crawler task scheduler and validator task scheduler python scheduler_booter.py --usage crawler
python scheduler_booter.py --usage validator
haipproxy provides both py client and squid proxy for your spiders.Any clients about any languages are welcome!
from client.py_cli import ProxyFetcher
# args are used to connect redis, if args is None, redis args in settings.py will be used
args = dict(host='127.0.0.1', port=6379, password='123456', db=0)
# https is used for common proxy.If you want to crawl a customized website, you'd better
# write a customized ip validator according to zhihu validator
fetcher = ProxyFetcher('https', strategy='greedy', length=5, redis_args=args)
# get one proxy ip
print(fetcher.get_proxy())
# get available proxy ip list
print(fetcher.get_proxies()) # or print(fetcher.pool)
sudo apt-get install squid
sudo sed -i 's/http_access deny all/http_access allow all/g'
sudo cp /etc/squid/squid.conf /etc/squid/squid.conf.backup
sudo service squid start - Change
SQUID_BIN_PATH,SQUID_CONF_PATHandSQUID_TEMPLATE_PATHin config/settings.py according to your OS - Update squid conf periodically sudo python squid_update.py - After a while,you can send requests with squid proxies, the proxies url is 'http://squid_host:3128', e.g.python3 import requests proxies = {'https': 'http://127.0.0.1:3128'} resp = requests.get('https://httpbin.org/ip', proxies=proxies) print(resp.text)
Install Docker
Install docker-compose
pip install -U docker-compose
ChangeSPLASH_URLandREDIS_HOSTin settings.py
python3
SPLASH_URL = 'http://splash:8050'
REDIS_HOST = 'redis'
Start all the containers using docker-compose
docker-compose up
Use py_cli or Squid to get available proxy ips.
python3
from client.py_cli import ProxyFetcher
args = dict(host='127.0.0.1', port=6379, password='123456', db=0)
fetcher = ProxyFetcher('https', strategy='greedy', length=5, redis_args=args)
print(fetcher.get_proxy())
print(fetcher.get_proxies()) # or print(fetcher.pool)
or
import requests
proxies = {'https': 'http://127.0.0.1:3128'}
resp = requests.get('https://httpbin.org/ip', proxies=proxies)
print(resp.text)

proxy_mode=0 in both gfw_spider.py and ajax_gfw_spider.py.
If you don't want to crawl some websites, set enable=0 in rules.pyHere are test results for crawling https://zhihu.com using haipproxy.Source Code can be seen here
| requests | time | cost | strategy | client |
|---|---|---|---|---|
| 0 | 2018/03/03 22:03 | 0 | greedy | py_cli |
| 10000 | 2018/03/03 11:03 | 1 hour | greedy | py_cli |
| 20000 | 2018/03/04 00:08 | 2 hours | greedy | py_cli |
| 30000 | 2018/03/04 01:02 | 3 hours | greedy | py_cli |
| 40000 | 2018/03/04 02:15 | 4 hours | greedy | py_cli |
| 50000 | 2018/03/04 03:03 | 5 hours | greedy | py_cli |
| 60000 | 2018/03/04 05:18 | 7 hours | greedy | py_cli |
| 70000 | 2018/03/04 07:11 | 9 hours | greedy | py_cli |
| 80000 | 2018/03/04 08:43 | 11 hours | greedy | py_cli |
Thanks to all the contributors of the following projects.
$ claude mcp add haipproxy \
-- python -m otcore.mcp_server <graph>