101 lines
3.8 KiB
Markdown
101 lines
3.8 KiB
Markdown
# Anti-Scraping — 完整实现代码
|
||
|
||
> 主流程与决策逻辑见 SKILL.md,本文档为 Phase 1–9 的深度实现细节。
|
||
|
||
## Phase 1:Nginx 前置拦截
|
||
|
||
```nginx
|
||
# /etc/nginx/conf.d/anti-scraping.conf
|
||
map $http_user_agent $is_bot {
|
||
default 0;
|
||
~*python-requests|python-urllib|httpx|aiohttp|scrapy|mechanize 1;
|
||
~*curl|wget|libwww-perl|Go-http-client|Java/|okhttp|axios 1;
|
||
~*HeadlessChrome|headless|PhantomJS|Playwright|Puppeteer|Selenium|webdriver 1;
|
||
~*GPTBot|ChatGPT-User|Claude-Web|PerplexityBot|anthropic-ai|CCBot|DataForSeoBot|SemrushBot 1;
|
||
~*Googlebot/|~*Bingbot/ 0;
|
||
"" 1;
|
||
}
|
||
|
||
limit_req_zone $binary_remote_addr zone=api_limit:10m rate=30r/m;
|
||
limit_req_zone $binary_remote_addr zone=page_limit:10m rate=60r/m;
|
||
limit_req_zone $binary_remote_addr zone=login_limit:10m rate=5r/m;
|
||
limit_conn_zone $binary_remote_addr zone=conn_limit:10m;
|
||
|
||
server {
|
||
if ($is_bot) { return 403; }
|
||
if ($http_user_agent = "") { return 444; }
|
||
location /api/ {
|
||
limit_req zone=api_limit burst=10 nodelay;
|
||
limit_conn conn_limit 20;
|
||
proxy_pass http://hyperf_upstream;
|
||
}
|
||
location /api/auth/ {
|
||
limit_req zone=login_limit burst=2 nodelay;
|
||
limit_req_status 429;
|
||
proxy_pass http://hyperf_upstream;
|
||
}
|
||
}
|
||
```
|
||
|
||
## Phase 2:HTTP 请求指纹识别
|
||
|
||
见 `RequestFingerprintMiddleware.php`:
|
||
- `BROWSER_REQUIRED_HEADERS`: accept, accept-language, accept-encoding
|
||
- `BROWSER_SECURITY_HEADERS`: sec-fetch-site, sec-fetch-mode, sec-fetch-dest, sec-ch-ua
|
||
- `SUSPICIOUS_ACCEPT_PATTERNS`: */*, application/json, text/html,*/*;q=0.9
|
||
- 评分逻辑:UA 缺失 +80,爬虫工具 +70,无头特征 +60,缺失 Header 每项 +15,缺失 sec-fetch +25,可疑 Accept +20,API 无 Referer +15
|
||
|
||
## Phase 3:速率限制与行为分析
|
||
|
||
`RateLimitService`:
|
||
- `IP_RULES`: global/api/search/export/login/register 分层限速
|
||
- `analyzeRequestPattern()`: 请求间隔变异系数 CV,CV < 0.3 或平均间隔 < 200ms 判定异常
|
||
- `analyzeSubnetPattern()`: /24 子网 1 分钟 > 500 请求判定代理池
|
||
- Redis key: `rl:{rule}:{ip}` (ZSET), `req_ts:{session}` (LIST)
|
||
|
||
## Phase 4:IP 信誉与代理检测
|
||
|
||
`IpIntelligenceService`:
|
||
- `DATACENTER_PREFIXES`: AWS/GCP/Azure/Cloudflare/DigitalOcean/Vultr 等
|
||
- `classify()`: blocklist → whitelist → tor → datacenter → residential
|
||
- `trackCrawlBreadth()`: HyperLogLog 1 小时内 > 200 唯一路径判定爬虫
|
||
|
||
## Phase 5:浏览器指纹识别
|
||
|
||
前端 `collectFingerprint()`: Canvas/WebGL/音频/鼠标/字体 指纹
|
||
后端 `BrowserFingerprintService.analyze()`: SwiftShader/llvmpipe/ANGLE +30,无音频 +20,字体全相同 +25,无鼠标 +20,4核0内存 +15
|
||
|
||
## Phase 6:AI 代理行为识别
|
||
|
||
`AiBotDetectionService`: CV < 0.2 加 45,纯只读 GET +25,平均间隔 < 300ms +30,UA 与语言不匹配 +15;总分 ≥ 60 判定 AI Bot
|
||
|
||
## Phase 7:工作量证明 PoW
|
||
|
||
前端 `solvePoW()`: SHA-256 前缀匹配,difficulty=4 约 1 万次
|
||
后端 `PoWService`: generate() 生成 nonce+id,verify() 校验并防重放
|
||
|
||
## Phase 8:蜜罐
|
||
|
||
前端:CSS 隐藏 input,随机字段名,提交时检测 honeypot 有值则 reportBot
|
||
后端:`/api/internal/user-export-all` 蜜罐路由,访问即封 IP 1 小时
|
||
|
||
## Phase 9:综合评分与差异化响应
|
||
|
||
`AntiScrapingMiddleware`: 加权汇总 headerScore*0.2 + ipRisk*0.15 + rateScore*0.25 + breadthScore*0.15 + patternScore*0.15 + subnetScore*0.05 + fpScore*0.05
|
||
- total ≥ 90: 封 IP 2 小时
|
||
- total ≥ 70: 要求 PoW 挑战
|
||
- total ≥ 50: 延迟 0.5–2 秒
|
||
- total ≥ 30: 敏感接口返回空数据
|
||
|
||
## 监控与 Redis 清理
|
||
|
||
| 指标 | 告警阈值 |
|
||
|------|---------|
|
||
| 403 响应率 | > 3% |
|
||
| 429 响应率 | > 5% |
|
||
| PoW 触发量 | > 100/小时 |
|
||
| 蜜罐触发量 | > 50/天 |
|
||
| IP 封禁量 | > 200/天 |
|
||
|
||
Redis TTL: rl/req_ts/req_log/crawl_breadth/subnet_req/pow 均设 TTL,ip:blocklist 定期清理过期封禁。
|