初始化

2026-03-05 21:27:11 +08:00
commit 130de0fd5d
140 changed files with 21972 additions and 0 deletions
--- a/.cursor/skills/anti-scraping/SKILL.md
+++ b/.cursor/skills/anti-scraping/SKILL.md
@@ -0,0 +1,74 @@
+---
+name: anti-scraping
+version: 4.0.0
+description: "为 PHP Hyperf + Vue 3 应用设计反爬虫防护。当需要防御 Bot、限流或保护 API 时使用。覆盖五个威胁层级。"
+---
+
+# 🛡️ Anti-Scraping Protection — 全栈防爬体系
+
+## 威胁分级与对应策略
+
+| 等级 | 爬虫类型 | 典型工具 | 主要特征 | 防护策略 |
+|------|---------|---------|---------|---------|
+| **T1** | 简单脚本 | curl, wget, Python requests | 无 JS 执行，缺失必要 Header | UA 过滤 + Header 检查 |
+| **T2** | 爬虫框架 | Scrapy, Playwright (无配置), httpx | 自动处理 Cookie 但行为机械 | Header 指纹 + 速率限制 |
+| **T3** | 无头浏览器 | Puppeteer, Playwright (配置过) | 执行 JS 但 Canvas/WebGL 异常 | 浏览器指纹 + 行为分析 |
+| **T4** | 分布式集群 | 自建集群 + 代理池 + UA 轮换 | 跨 IP 协同，单 IP 请求少 | 关联分析 + 蜜罐 + PoW |
+| **T5** | AI 代理 | LLM 控制的浏览器 / GPT 插件 | 接近真实用户但行为规律性高 | 多维指纹 + CV 分析 + 挑战 |
+
+---
+
+## 触发条件
+
+用户询问防爬虫、限流、Bot 防护、爬虫识别、AI 爬虫、Scrapy 等关键词。
+
+## 执行流程
+
+### Phase 0：威胁评估
+
+在实施前回答以下问题，确定防护等级：
+
+| 问题 | 回答影响 |
+|------|---------|
+| 保护目标？（API / 页面内容 / 数据资产） | 决定防护位置（Nginx / Middleware / Frontend） |
+| 可接受的误伤率？（0.1% / 0.5% / 1%） | 决定阈值设定的松紧 |
+| 是否有 CDN/WAF？ | 可借用 WAF 能力，减少自研成本 |
+| 业务是否允许验证码？ | 影响 CAPTCHA 降级策略 |
+| 需要保护登录后内容还是公开内容？ | 决定是否使用前端指纹 Token |
+
+### Phase 1–4：前置与基础防护
+
+1. **Phase 1 Nginx** — UA 黑名单 map、limit_req/limit_conn，直接拒绝已知 Bot
+2. **Phase 2 指纹** — RequestFingerprintMiddleware 检查 UA/Header/Accept/Referer，输出 risk_score
+3. **Phase 3 限速** — RateLimitService 分层限速 + 请求间隔变异系数分析（CV）+ 子网关联分析
+4. **Phase 4 IP** — IpIntelligenceService 黑/白名单、TOR、数据中心、爬取广度（HyperLogLog）
+
+### Phase 5–6：高级识别
+
+5. **Phase 5 浏览器指纹** — 前端采集 Canvas/WebGL/音频/字体/鼠标，后端校验无头特征
+6. **Phase 6 AI 代理** — 请求间隔 CV、只读模式、速度、UA 与语言不匹配
+
+### Phase 7–9：对抗与响应
+
+7. **Phase 7 PoW** — 工作量证明挑战，真实用户 JS 自动计算
+8. **Phase 8 蜜罐** — 前端隐藏字段 + 后端蜜罐路由
+9. **Phase 9 综合** — AntiScrapingMiddleware 加权评分，差异化响应（封禁 / PoW / 延迟 / 假数据）
+
+## 验证清单
+
+1. [ ] `curl` 请求 3 次内触发 403 或 429
+2. [ ] Python `requests` 默认 UA 被 Nginx 直接拒绝
+3. [ ] Scrapy 爬取 100+ 页面被封 IP
+4. [ ] Puppeteer（无反指纹）指纹得分 ≥ 60，触发挑战
+5. [ ] 均匀间隔请求（CV < 0.2）被 AI 行为分析识别
+6. [ ] 蜜罐路由访问后 IP 被封
+7. [ ] PoW 挑战前端正确求解（difficulty=4，约 2 秒内完成）
+8. [ ] 同 /24 子网 500+ 请求触发代理池标记
+9. [ ] 正常用户误触率 < 0.2%
+
+## Tier 3 深度参考
+
+| 文件 | 内容 |
+|------|------|
+| `references/implementation-phases.md` | Phase 1–9 完整实现代码与 Redis 监控 |
+| `references/anti-scraping-patterns.md` | 反爬模式与策略速查 |
--- a/.cursor/skills/anti-scraping/references/anti-scraping-patterns.md
+++ b/.cursor/skills/anti-scraping/references/anti-scraping-patterns.md
@@ -0,0 +1,306 @@
+# Anti-Scraping 参考模式库
+
+## 1. 常见爬虫特征指纹
+
+### User-Agent 黑名单（正则）
+
+```typescript
+const BOT_UA_PATTERNS = [
+  // 爬虫框架
+  /python-requests/i,
+  /scrapy/i,
+  /beautifulsoup/i,
+  /selenium/i,
+  /playwright/i,
+  /puppeteer/i,
+  /mechanize/i,
+  /httpclient/i,
+  /java\/\d/i,
+  /go-http-client/i,
+  /ruby/i,
+
+  // 命令行工具
+  /curl\//i,
+  /wget\//i,
+  /httpie/i,
+  /insomnia/i,
+
+  // 无头浏览器特征
+  /headlesschrome/i,
+  /phantomjs/i,
+  /slimerjs/i,
+
+  // 已知数据采集
+  /dataprovider/i,
+  /yandexbot/i,
+  /mj12bot/i,
+  /ahrefsbot/i,
+  /semrushbot/i,
+  /dotbot/i,
+];
+
+export function isBotUA(ua: string): boolean {
+  return BOT_UA_PATTERNS.some(p => p.test(ua));
+}
+```
+
+### 允许的搜索引擎爬虫白名单
+
+```typescript
+// 合法爬虫：需要验证真实性（反向 DNS 查找）
+const ALLOWED_BOTS = [
+  { name: 'Googlebot', ua: /googlebot/i, rdns: 'googlebot.com' },
+  { name: 'Bingbot', ua: /bingbot/i, rdns: 'search.msn.com' },
+  { name: 'Baidu Spider', ua: /baiduspider/i, rdns: 'crawl.baidu.com' },
+];
+
+async function isLegitimateBot(ua: string, ip: string): Promise<boolean> {
+  const bot = ALLOWED_BOTS.find(b => b.ua.test(ua));
+  if (!bot) return false;
+
+  // 反向 DNS 验证（防止伪造 UA）
+  const hostname = await reverseDNS(ip);
+  return hostname?.endsWith(bot.rdns) ?? false;
+}
+```
+
+---
+
+## 2. 风险评分算法
+
+### 综合评分模型
+
+```typescript
+interface RiskFactors {
+  fingerprintScore: number;    // 0-100
+  rateScore: number;           // 0-100（超速时增加）
+  ipScore: number;             // 0-100（数据中心/Tor/VPN）
+  behaviorScore: number;       // 0-100（行为异常）
+}
+
+function calculateRiskScore(factors: RiskFactors): number {
+  const weights = {
+    fingerprint: 0.35,
+    rate: 0.30,
+    ip: 0.25,
+    behavior: 0.10,
+  };
+
+  return Math.min(
+    Math.round(
+      factors.fingerprintScore * weights.fingerprint +
+      factors.rateScore * weights.rate +
+      factors.ipScore * weights.ip +
+      factors.behaviorScore * weights.behavior
+    ),
+    100
+  );
+}
+
+// 响应策略
+function getResponseStrategy(score: number): 'allow' | 'slowdown' | 'challenge' | 'block' {
+  if (score >= 80) return 'block';
+  if (score >= 50) return 'challenge';
+  if (score >= 30) return 'slowdown';
+  return 'allow';
+}
+```
+
+---
+
+## 3. Redis 数据结构设计
+
+```
+# 速率限制（滑动窗口）
+ZSET  rl:{ip}                     → { timestamp: score }
+ZSET  rl:{ip}:{endpoint}          → { timestamp: score }
+
+# IP 黑白名单
+SET   ip:blocklist                → { ip1, ip2, ... }
+SET   ip:allowlist                → { ip1, ip2, ... }（合法爬虫白名单）
+SET   ip:tor-exit                 → { ip1, ip2, ... }
+SET   ip:datacenter               → { ip1, ip2, ... }
+
+# 蜜罐触发记录
+HASH  honeypot:hits               → { ip: count }
+
+# CAPTCHA 通过记录（防止重复挑战）
+STRING captcha:passed:{ip}        → "1"（TTL 1 小时）
+
+# 行为画像
+HASH  behavior:{ip}               → {
+  first_seen: timestamp,
+  request_count: number,
+  path_entropy: number,           # 访问路径多样性（低=爬虫）
+  referer_missing_ratio: number,  # 缺少 Referer 比例（高=爬虫）
+}
+```
+
+---
+
+## 4. Nginx 层防护（可选，性能最佳）
+
+```nginx
+# /etc/nginx/conf.d/anti-scraping.conf
+
+# 限速区域定义
+limit_req_zone $binary_remote_addr zone=api:10m rate=30r/m;
+limit_req_zone $binary_remote_addr zone=login:10m rate=5r/m;
+limit_conn_zone $binary_remote_addr zone=conn:10m;
+
+server {
+  # 连接数限制（单 IP 最多 20 并发）
+  limit_conn conn 20;
+
+  # UA 黑名单
+  if ($http_user_agent ~* "(python|curl|wget|scrapy|selenium)") {
+    return 403;
+  }
+
+  # 空 UA 拒绝
+  if ($http_user_agent = "") {
+    return 403;
+  }
+
+  location /api/ {
+    limit_req zone=api burst=10 nodelay;
+    limit_req_status 429;
+
+    proxy_pass http://app;
+  }
+
+  location /api/auth/ {
+    limit_req zone=login burst=2 nodelay;
+    limit_req_status 429;
+
+    proxy_pass http://app;
+  }
+
+  # 蜜罐路由（真实用户不会访问）
+  location /admin-backup/ {
+    access_log /var/log/nginx/honeypot.log;
+    # 记录访问者 IP 并返回假数据
+    return 200 '{"status":"ok"}';
+    add_header Content-Type application/json;
+  }
+}
+```
+
+---
+
+## 5. Cloudflare Workers 方案（Edge 层，最推荐）
+
+```typescript
+// workers/anti-scraping.ts
+export default {
+  async fetch(request: Request, env: Env): Promise<Response> {
+    const ip = request.headers.get('CF-Connecting-IP') ?? '';
+    const ua = request.headers.get('User-Agent') ?? '';
+
+    // 利用 Cloudflare 的威胁评分
+    const cfThreatScore = Number(request.headers.get('CF-Threat-Score') ?? 0);
+    if (cfThreatScore > 30) {
+      return new Response('Forbidden', { status: 403 });
+    }
+
+    // 利用 Cloudflare 的 Bot 管理分数（需开启 Bot Management）
+    const cfBotScore = Number(request.headers.get('CF-Bot-Score') ?? 100);
+    if (cfBotScore < 30) {
+      // 低分 = 高爬虫可能性
+      return new Response('Forbidden', { status: 403 });
+    }
+
+    // 自定义限速（使用 Durable Objects 或 KV）
+    const rateLimitKey = `rl:${ip}`;
+    const count = Number(await env.RATE_LIMIT.get(rateLimitKey) ?? 0);
+    if (count > 60) {
+      return new Response('Too Many Requests', {
+        status: 429,
+        headers: { 'Retry-After': '60' },
+      });
+    }
+
+    await env.RATE_LIMIT.put(rateLimitKey, String(count + 1), { expirationTtl: 60 });
+    return fetch(request);
+  },
+};
+```
+
+---
+
+## 6. 监控 Dashboard（Datadog / Grafana 指标）
+
+```typescript
+// lib/metrics.ts — 关键埋点
+export const antiScrapingMetrics = {
+  // 请求被拦截
+  blocked: (reason: 'fingerprint' | 'rate' | 'honeypot' | 'ip' | 'captcha') => {
+    metrics.increment('anti_scraping.blocked', { reason });
+  },
+
+  // 风险评分分布
+  scoreDistribution: (score: number) => {
+    metrics.histogram('anti_scraping.risk_score', score);
+  },
+
+  // CAPTCHA 展示与通过
+  captchaImpressed: () => metrics.increment('anti_scraping.captcha.impressed'),
+  captchaPassed: () => metrics.increment('anti_scraping.captcha.passed'),
+  captchaFailed: () => metrics.increment('anti_scraping.captcha.failed'),
+
+  // 误伤监控
+  falsePositive: (userId: string) => {
+    metrics.increment('anti_scraping.false_positive');
+    logger.warn({ userId }, 'Possible false positive in anti-scraping');
+  },
+};
+```
+
+---
+
+## 7. 测试用例
+
+```typescript
+// __tests__/anti-scraping.test.ts
+describe('Anti-Scraping', () => {
+  describe('Fingerprint Analysis', () => {
+    it('should flag Python requests as high risk', () => {
+      const score = analyzeFingerpring(mockRequest({
+        'user-agent': 'python-requests/2.28.0',
+      }));
+      expect(score).toBeGreaterThanOrEqual(50);
+    });
+
+    it('should not flag normal Chrome browser', () => {
+      const score = analyzeFingerpring(mockRequest({
+        'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36',
+        'accept-language': 'zh-CN,zh;q=0.9',
+        'accept-encoding': 'gzip, deflate, br',
+        'accept': 'text/html,application/xhtml+xml',
+      }));
+      expect(score).toBeLessThan(20);
+    });
+  });
+
+  describe('Rate Limiting', () => {
+    it('should block after exceeding limit', async () => {
+      const ip = '192.168.1.100';
+      // 发送 61 次请求
+      for (let i = 0; i < 61; i++) {
+        await checkRateLimit(ip, { windowMs: 60_000, limit: 60 });
+      }
+      const { allowed } = await checkRateLimit(ip);
+      expect(allowed).toBe(false);
+    });
+  });
+
+  describe('Honeypot', () => {
+    it('should block IP that triggers honeypot', async () => {
+      const ip = '10.0.0.1';
+      await triggerHoneypot(ip);
+      const isBlocked = await redis.sismember('ip:blocklist', ip);
+      expect(isBlocked).toBe(1);
+    });
+  });
+});
+```
--- a/.cursor/skills/anti-scraping/references/implementation-phases.md
+++ b/.cursor/skills/anti-scraping/references/implementation-phases.md
@@ -0,0 +1,100 @@
+# Anti-Scraping — 完整实现代码
+
+> 主流程与决策逻辑见 SKILL.md，本文档为 Phase 1–9 的深度实现细节。
+
+## Phase 1：Nginx 前置拦截
+
+```nginx
+# /etc/nginx/conf.d/anti-scraping.conf
+map $http_user_agent $is_bot {
+    default 0;
+    ~*python-requests|python-urllib|httpx|aiohttp|scrapy|mechanize 1;
+    ~*curl|wget|libwww-perl|Go-http-client|Java/|okhttp|axios 1;
+    ~*HeadlessChrome|headless|PhantomJS|Playwright|Puppeteer|Selenium|webdriver 1;
+    ~*GPTBot|ChatGPT-User|Claude-Web|PerplexityBot|anthropic-ai|CCBot|DataForSeoBot|SemrushBot 1;
+    ~*Googlebot/|~*Bingbot/ 0;
+    "" 1;
+}
+
+limit_req_zone $binary_remote_addr zone=api_limit:10m rate=30r/m;
+limit_req_zone $binary_remote_addr zone=page_limit:10m rate=60r/m;
+limit_req_zone $binary_remote_addr zone=login_limit:10m rate=5r/m;
+limit_conn_zone $binary_remote_addr zone=conn_limit:10m;
+
+server {
+    if ($is_bot) { return 403; }
+    if ($http_user_agent = "") { return 444; }
+    location /api/ {
+        limit_req zone=api_limit burst=10 nodelay;
+        limit_conn conn_limit 20;
+        proxy_pass http://hyperf_upstream;
+    }
+    location /api/auth/ {
+        limit_req zone=login_limit burst=2 nodelay;
+        limit_req_status 429;
+        proxy_pass http://hyperf_upstream;
+    }
+}
+```
+
+## Phase 2：HTTP 请求指纹识别
+
+见 `RequestFingerprintMiddleware.php`：
+- `BROWSER_REQUIRED_HEADERS`: accept, accept-language, accept-encoding
+- `BROWSER_SECURITY_HEADERS`: sec-fetch-site, sec-fetch-mode, sec-fetch-dest, sec-ch-ua
+- `SUSPICIOUS_ACCEPT_PATTERNS`: */*, application/json, text/html,*/*;q=0.9
+- 评分逻辑：UA 缺失 +80，爬虫工具 +70，无头特征 +60，缺失 Header 每项 +15，缺失 sec-fetch +25，可疑 Accept +20，API 无 Referer +15
+
+## Phase 3：速率限制与行为分析
+
+`RateLimitService`：
+- `IP_RULES`: global/api/search/export/login/register 分层限速
+- `analyzeRequestPattern()`: 请求间隔变异系数 CV，CV < 0.3 或平均间隔 < 200ms 判定异常
+- `analyzeSubnetPattern()`: /24 子网 1 分钟 > 500 请求判定代理池
+- Redis key: `rl:{rule}:{ip}` (ZSET), `req_ts:{session}` (LIST)
+
+## Phase 4：IP 信誉与代理检测
+
+`IpIntelligenceService`：
+- `DATACENTER_PREFIXES`: AWS/GCP/Azure/Cloudflare/DigitalOcean/Vultr 等
+- `classify()`: blocklist → whitelist → tor → datacenter → residential
+- `trackCrawlBreadth()`: HyperLogLog 1 小时内 > 200 唯一路径判定爬虫
+
+## Phase 5：浏览器指纹识别
+
+前端 `collectFingerprint()`: Canvas/WebGL/音频/鼠标/字体 指纹
+后端 `BrowserFingerprintService.analyze()`: SwiftShader/llvmpipe/ANGLE +30，无音频 +20，字体全相同 +25，无鼠标 +20，4核0内存 +15
+
+## Phase 6：AI 代理行为识别
+
+`AiBotDetectionService`: CV < 0.2 加 45，纯只读 GET +25，平均间隔 < 300ms +30，UA 与语言不匹配 +15；总分 ≥ 60 判定 AI Bot
+
+## Phase 7：工作量证明 PoW
+
+前端 `solvePoW()`: SHA-256 前缀匹配，difficulty=4 约 1 万次
+后端 `PoWService`: generate() 生成 nonce+id，verify() 校验并防重放
+
+## Phase 8：蜜罐
+
+前端：CSS 隐藏 input，随机字段名，提交时检测 honeypot 有值则 reportBot
+后端：`/api/internal/user-export-all` 蜜罐路由，访问即封 IP 1 小时
+
+## Phase 9：综合评分与差异化响应
+
+`AntiScrapingMiddleware`: 加权汇总 headerScore*0.2 + ipRisk*0.15 + rateScore*0.25 + breadthScore*0.15 + patternScore*0.15 + subnetScore*0.05 + fpScore*0.05
+- total ≥ 90: 封 IP 2 小时
+- total ≥ 70: 要求 PoW 挑战
+- total ≥ 50: 延迟 0.5–2 秒
+- total ≥ 30: 敏感接口返回空数据
+
+## 监控与 Redis 清理
+
+| 指标 | 告警阈值 |
+|------|---------|
+| 403 响应率 | > 3% |
+| 429 响应率 | > 5% |
+| PoW 触发量 | > 100/小时 |
+| 蜜罐触发量 | > 50/天 |
+| IP 封禁量 | > 200/天 |
+
+Redis TTL: rl/req_ts/req_log/crawl_breadth/subnet_req/pow 均设 TTL，ip:blocklist 定期清理过期封禁。