Files
vibe_coding/.cursor/skills/anti-scraping/references/implementation-phases.md
2026-03-05 21:27:11 +08:00

101 lines
3.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Anti-Scraping — 完整实现代码
> 主流程与决策逻辑见 SKILL.md本文档为 Phase 19 的深度实现细节。
## Phase 1Nginx 前置拦截
```nginx
# /etc/nginx/conf.d/anti-scraping.conf
map $http_user_agent $is_bot {
default 0;
~*python-requests|python-urllib|httpx|aiohttp|scrapy|mechanize 1;
~*curl|wget|libwww-perl|Go-http-client|Java/|okhttp|axios 1;
~*HeadlessChrome|headless|PhantomJS|Playwright|Puppeteer|Selenium|webdriver 1;
~*GPTBot|ChatGPT-User|Claude-Web|PerplexityBot|anthropic-ai|CCBot|DataForSeoBot|SemrushBot 1;
~*Googlebot/|~*Bingbot/ 0;
"" 1;
}
limit_req_zone $binary_remote_addr zone=api_limit:10m rate=30r/m;
limit_req_zone $binary_remote_addr zone=page_limit:10m rate=60r/m;
limit_req_zone $binary_remote_addr zone=login_limit:10m rate=5r/m;
limit_conn_zone $binary_remote_addr zone=conn_limit:10m;
server {
if ($is_bot) { return 403; }
if ($http_user_agent = "") { return 444; }
location /api/ {
limit_req zone=api_limit burst=10 nodelay;
limit_conn conn_limit 20;
proxy_pass http://hyperf_upstream;
}
location /api/auth/ {
limit_req zone=login_limit burst=2 nodelay;
limit_req_status 429;
proxy_pass http://hyperf_upstream;
}
}
```
## Phase 2HTTP 请求指纹识别
`RequestFingerprintMiddleware.php`
- `BROWSER_REQUIRED_HEADERS`: accept, accept-language, accept-encoding
- `BROWSER_SECURITY_HEADERS`: sec-fetch-site, sec-fetch-mode, sec-fetch-dest, sec-ch-ua
- `SUSPICIOUS_ACCEPT_PATTERNS`: */*, application/json, text/html,*/*;q=0.9
- 评分逻辑UA 缺失 +80爬虫工具 +70无头特征 +60缺失 Header 每项 +15缺失 sec-fetch +25可疑 Accept +20API 无 Referer +15
## Phase 3速率限制与行为分析
`RateLimitService`
- `IP_RULES`: global/api/search/export/login/register 分层限速
- `analyzeRequestPattern()`: 请求间隔变异系数 CVCV < 0.3 或平均间隔 < 200ms 判定异常
- `analyzeSubnetPattern()`: /24 子网 1 分钟 > 500 请求判定代理池
- Redis key: `rl:{rule}:{ip}` (ZSET), `req_ts:{session}` (LIST)
## Phase 4IP 信誉与代理检测
`IpIntelligenceService`
- `DATACENTER_PREFIXES`: AWS/GCP/Azure/Cloudflare/DigitalOcean/Vultr 等
- `classify()`: blocklist → whitelist → tor → datacenter → residential
- `trackCrawlBreadth()`: HyperLogLog 1 小时内 > 200 唯一路径判定爬虫
## Phase 5浏览器指纹识别
前端 `collectFingerprint()`: Canvas/WebGL/音频/鼠标/字体 指纹
后端 `BrowserFingerprintService.analyze()`: SwiftShader/llvmpipe/ANGLE +30无音频 +20字体全相同 +25无鼠标 +204核0内存 +15
## Phase 6AI 代理行为识别
`AiBotDetectionService`: CV < 0.2 加 45纯只读 GET +25平均间隔 < 300ms +30UA 与语言不匹配 +15总分 ≥ 60 判定 AI Bot
## Phase 7工作量证明 PoW
前端 `solvePoW()`: SHA-256 前缀匹配difficulty=4 约 1 万次
后端 `PoWService`: generate() 生成 nonce+idverify() 校验并防重放
## Phase 8蜜罐
前端CSS 隐藏 input随机字段名提交时检测 honeypot 有值则 reportBot
后端:`/api/internal/user-export-all` 蜜罐路由,访问即封 IP 1 小时
## Phase 9综合评分与差异化响应
`AntiScrapingMiddleware`: 加权汇总 headerScore*0.2 + ipRisk*0.15 + rateScore*0.25 + breadthScore*0.15 + patternScore*0.15 + subnetScore*0.05 + fpScore*0.05
- total ≥ 90: 封 IP 2 小时
- total ≥ 70: 要求 PoW 挑战
- total ≥ 50: 延迟 0.52 秒
- total ≥ 30: 敏感接口返回空数据
## 监控与 Redis 清理
| 指标 | 告警阈值 |
|------|---------|
| 403 响应率 | > 3% |
| 429 响应率 | > 5% |
| PoW 触发量 | > 100/小时 |
| 蜜罐触发量 | > 50/天 |
| IP 封禁量 | > 200/天 |
Redis TTL: rl/req_ts/req_log/crawl_breadth/subnet_req/pow 均设 TTLip:blocklist 定期清理过期封禁。