初始化
This commit is contained in:
74
.cursor/skills/anti-scraping/SKILL.md
Normal file
74
.cursor/skills/anti-scraping/SKILL.md
Normal file
@@ -0,0 +1,74 @@
|
||||
---
|
||||
name: anti-scraping
|
||||
version: 4.0.0
|
||||
description: "为 PHP Hyperf + Vue 3 应用设计反爬虫防护。当需要防御 Bot、限流或保护 API 时使用。覆盖五个威胁层级。"
|
||||
---
|
||||
|
||||
# 🛡️ Anti-Scraping Protection — 全栈防爬体系
|
||||
|
||||
## 威胁分级与对应策略
|
||||
|
||||
| 等级 | 爬虫类型 | 典型工具 | 主要特征 | 防护策略 |
|
||||
|------|---------|---------|---------|---------|
|
||||
| **T1** | 简单脚本 | curl, wget, Python requests | 无 JS 执行,缺失必要 Header | UA 过滤 + Header 检查 |
|
||||
| **T2** | 爬虫框架 | Scrapy, Playwright (无配置), httpx | 自动处理 Cookie 但行为机械 | Header 指纹 + 速率限制 |
|
||||
| **T3** | 无头浏览器 | Puppeteer, Playwright (配置过) | 执行 JS 但 Canvas/WebGL 异常 | 浏览器指纹 + 行为分析 |
|
||||
| **T4** | 分布式集群 | 自建集群 + 代理池 + UA 轮换 | 跨 IP 协同,单 IP 请求少 | 关联分析 + 蜜罐 + PoW |
|
||||
| **T5** | AI 代理 | LLM 控制的浏览器 / GPT 插件 | 接近真实用户但行为规律性高 | 多维指纹 + CV 分析 + 挑战 |
|
||||
|
||||
---
|
||||
|
||||
## 触发条件
|
||||
|
||||
用户询问防爬虫、限流、Bot 防护、爬虫识别、AI 爬虫、Scrapy 等关键词。
|
||||
|
||||
## 执行流程
|
||||
|
||||
### Phase 0:威胁评估
|
||||
|
||||
在实施前回答以下问题,确定防护等级:
|
||||
|
||||
| 问题 | 回答影响 |
|
||||
|------|---------|
|
||||
| 保护目标?(API / 页面内容 / 数据资产) | 决定防护位置(Nginx / Middleware / Frontend) |
|
||||
| 可接受的误伤率?(0.1% / 0.5% / 1%) | 决定阈值设定的松紧 |
|
||||
| 是否有 CDN/WAF? | 可借用 WAF 能力,减少自研成本 |
|
||||
| 业务是否允许验证码? | 影响 CAPTCHA 降级策略 |
|
||||
| 需要保护登录后内容还是公开内容? | 决定是否使用前端指纹 Token |
|
||||
|
||||
### Phase 1–4:前置与基础防护
|
||||
|
||||
1. **Phase 1 Nginx** — UA 黑名单 map、limit_req/limit_conn,直接拒绝已知 Bot
|
||||
2. **Phase 2 指纹** — RequestFingerprintMiddleware 检查 UA/Header/Accept/Referer,输出 risk_score
|
||||
3. **Phase 3 限速** — RateLimitService 分层限速 + 请求间隔变异系数分析(CV)+ 子网关联分析
|
||||
4. **Phase 4 IP** — IpIntelligenceService 黑/白名单、TOR、数据中心、爬取广度(HyperLogLog)
|
||||
|
||||
### Phase 5–6:高级识别
|
||||
|
||||
5. **Phase 5 浏览器指纹** — 前端采集 Canvas/WebGL/音频/字体/鼠标,后端校验无头特征
|
||||
6. **Phase 6 AI 代理** — 请求间隔 CV、只读模式、速度、UA 与语言不匹配
|
||||
|
||||
### Phase 7–9:对抗与响应
|
||||
|
||||
7. **Phase 7 PoW** — 工作量证明挑战,真实用户 JS 自动计算
|
||||
8. **Phase 8 蜜罐** — 前端隐藏字段 + 后端蜜罐路由
|
||||
9. **Phase 9 综合** — AntiScrapingMiddleware 加权评分,差异化响应(封禁 / PoW / 延迟 / 假数据)
|
||||
|
||||
## 验证清单
|
||||
|
||||
1. [ ] `curl` 请求 3 次内触发 403 或 429
|
||||
2. [ ] Python `requests` 默认 UA 被 Nginx 直接拒绝
|
||||
3. [ ] Scrapy 爬取 100+ 页面被封 IP
|
||||
4. [ ] Puppeteer(无反指纹)指纹得分 ≥ 60,触发挑战
|
||||
5. [ ] 均匀间隔请求(CV < 0.2)被 AI 行为分析识别
|
||||
6. [ ] 蜜罐路由访问后 IP 被封
|
||||
7. [ ] PoW 挑战前端正确求解(difficulty=4,约 2 秒内完成)
|
||||
8. [ ] 同 /24 子网 500+ 请求触发代理池标记
|
||||
9. [ ] 正常用户误触率 < 0.2%
|
||||
|
||||
## Tier 3 深度参考
|
||||
|
||||
| 文件 | 内容 |
|
||||
|------|------|
|
||||
| `references/implementation-phases.md` | Phase 1–9 完整实现代码与 Redis 监控 |
|
||||
| `references/anti-scraping-patterns.md` | 反爬模式与策略速查 |
|
||||
@@ -0,0 +1,306 @@
|
||||
# Anti-Scraping 参考模式库
|
||||
|
||||
## 1. 常见爬虫特征指纹
|
||||
|
||||
### User-Agent 黑名单(正则)
|
||||
|
||||
```typescript
|
||||
const BOT_UA_PATTERNS = [
|
||||
// 爬虫框架
|
||||
/python-requests/i,
|
||||
/scrapy/i,
|
||||
/beautifulsoup/i,
|
||||
/selenium/i,
|
||||
/playwright/i,
|
||||
/puppeteer/i,
|
||||
/mechanize/i,
|
||||
/httpclient/i,
|
||||
/java\/\d/i,
|
||||
/go-http-client/i,
|
||||
/ruby/i,
|
||||
|
||||
// 命令行工具
|
||||
/curl\//i,
|
||||
/wget\//i,
|
||||
/httpie/i,
|
||||
/insomnia/i,
|
||||
|
||||
// 无头浏览器特征
|
||||
/headlesschrome/i,
|
||||
/phantomjs/i,
|
||||
/slimerjs/i,
|
||||
|
||||
// 已知数据采集
|
||||
/dataprovider/i,
|
||||
/yandexbot/i,
|
||||
/mj12bot/i,
|
||||
/ahrefsbot/i,
|
||||
/semrushbot/i,
|
||||
/dotbot/i,
|
||||
];
|
||||
|
||||
export function isBotUA(ua: string): boolean {
|
||||
return BOT_UA_PATTERNS.some(p => p.test(ua));
|
||||
}
|
||||
```
|
||||
|
||||
### 允许的搜索引擎爬虫白名单
|
||||
|
||||
```typescript
|
||||
// 合法爬虫:需要验证真实性(反向 DNS 查找)
|
||||
const ALLOWED_BOTS = [
|
||||
{ name: 'Googlebot', ua: /googlebot/i, rdns: 'googlebot.com' },
|
||||
{ name: 'Bingbot', ua: /bingbot/i, rdns: 'search.msn.com' },
|
||||
{ name: 'Baidu Spider', ua: /baiduspider/i, rdns: 'crawl.baidu.com' },
|
||||
];
|
||||
|
||||
async function isLegitimateBot(ua: string, ip: string): Promise<boolean> {
|
||||
const bot = ALLOWED_BOTS.find(b => b.ua.test(ua));
|
||||
if (!bot) return false;
|
||||
|
||||
// 反向 DNS 验证(防止伪造 UA)
|
||||
const hostname = await reverseDNS(ip);
|
||||
return hostname?.endsWith(bot.rdns) ?? false;
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. 风险评分算法
|
||||
|
||||
### 综合评分模型
|
||||
|
||||
```typescript
|
||||
interface RiskFactors {
|
||||
fingerprintScore: number; // 0-100
|
||||
rateScore: number; // 0-100(超速时增加)
|
||||
ipScore: number; // 0-100(数据中心/Tor/VPN)
|
||||
behaviorScore: number; // 0-100(行为异常)
|
||||
}
|
||||
|
||||
function calculateRiskScore(factors: RiskFactors): number {
|
||||
const weights = {
|
||||
fingerprint: 0.35,
|
||||
rate: 0.30,
|
||||
ip: 0.25,
|
||||
behavior: 0.10,
|
||||
};
|
||||
|
||||
return Math.min(
|
||||
Math.round(
|
||||
factors.fingerprintScore * weights.fingerprint +
|
||||
factors.rateScore * weights.rate +
|
||||
factors.ipScore * weights.ip +
|
||||
factors.behaviorScore * weights.behavior
|
||||
),
|
||||
100
|
||||
);
|
||||
}
|
||||
|
||||
// 响应策略
|
||||
function getResponseStrategy(score: number): 'allow' | 'slowdown' | 'challenge' | 'block' {
|
||||
if (score >= 80) return 'block';
|
||||
if (score >= 50) return 'challenge';
|
||||
if (score >= 30) return 'slowdown';
|
||||
return 'allow';
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. Redis 数据结构设计
|
||||
|
||||
```
|
||||
# 速率限制(滑动窗口)
|
||||
ZSET rl:{ip} → { timestamp: score }
|
||||
ZSET rl:{ip}:{endpoint} → { timestamp: score }
|
||||
|
||||
# IP 黑白名单
|
||||
SET ip:blocklist → { ip1, ip2, ... }
|
||||
SET ip:allowlist → { ip1, ip2, ... }(合法爬虫白名单)
|
||||
SET ip:tor-exit → { ip1, ip2, ... }
|
||||
SET ip:datacenter → { ip1, ip2, ... }
|
||||
|
||||
# 蜜罐触发记录
|
||||
HASH honeypot:hits → { ip: count }
|
||||
|
||||
# CAPTCHA 通过记录(防止重复挑战)
|
||||
STRING captcha:passed:{ip} → "1"(TTL 1 小时)
|
||||
|
||||
# 行为画像
|
||||
HASH behavior:{ip} → {
|
||||
first_seen: timestamp,
|
||||
request_count: number,
|
||||
path_entropy: number, # 访问路径多样性(低=爬虫)
|
||||
referer_missing_ratio: number, # 缺少 Referer 比例(高=爬虫)
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Nginx 层防护(可选,性能最佳)
|
||||
|
||||
```nginx
|
||||
# /etc/nginx/conf.d/anti-scraping.conf
|
||||
|
||||
# 限速区域定义
|
||||
limit_req_zone $binary_remote_addr zone=api:10m rate=30r/m;
|
||||
limit_req_zone $binary_remote_addr zone=login:10m rate=5r/m;
|
||||
limit_conn_zone $binary_remote_addr zone=conn:10m;
|
||||
|
||||
server {
|
||||
# 连接数限制(单 IP 最多 20 并发)
|
||||
limit_conn conn 20;
|
||||
|
||||
# UA 黑名单
|
||||
if ($http_user_agent ~* "(python|curl|wget|scrapy|selenium)") {
|
||||
return 403;
|
||||
}
|
||||
|
||||
# 空 UA 拒绝
|
||||
if ($http_user_agent = "") {
|
||||
return 403;
|
||||
}
|
||||
|
||||
location /api/ {
|
||||
limit_req zone=api burst=10 nodelay;
|
||||
limit_req_status 429;
|
||||
|
||||
proxy_pass http://app;
|
||||
}
|
||||
|
||||
location /api/auth/ {
|
||||
limit_req zone=login burst=2 nodelay;
|
||||
limit_req_status 429;
|
||||
|
||||
proxy_pass http://app;
|
||||
}
|
||||
|
||||
# 蜜罐路由(真实用户不会访问)
|
||||
location /admin-backup/ {
|
||||
access_log /var/log/nginx/honeypot.log;
|
||||
# 记录访问者 IP 并返回假数据
|
||||
return 200 '{"status":"ok"}';
|
||||
add_header Content-Type application/json;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Cloudflare Workers 方案(Edge 层,最推荐)
|
||||
|
||||
```typescript
|
||||
// workers/anti-scraping.ts
|
||||
export default {
|
||||
async fetch(request: Request, env: Env): Promise<Response> {
|
||||
const ip = request.headers.get('CF-Connecting-IP') ?? '';
|
||||
const ua = request.headers.get('User-Agent') ?? '';
|
||||
|
||||
// 利用 Cloudflare 的威胁评分
|
||||
const cfThreatScore = Number(request.headers.get('CF-Threat-Score') ?? 0);
|
||||
if (cfThreatScore > 30) {
|
||||
return new Response('Forbidden', { status: 403 });
|
||||
}
|
||||
|
||||
// 利用 Cloudflare 的 Bot 管理分数(需开启 Bot Management)
|
||||
const cfBotScore = Number(request.headers.get('CF-Bot-Score') ?? 100);
|
||||
if (cfBotScore < 30) {
|
||||
// 低分 = 高爬虫可能性
|
||||
return new Response('Forbidden', { status: 403 });
|
||||
}
|
||||
|
||||
// 自定义限速(使用 Durable Objects 或 KV)
|
||||
const rateLimitKey = `rl:${ip}`;
|
||||
const count = Number(await env.RATE_LIMIT.get(rateLimitKey) ?? 0);
|
||||
if (count > 60) {
|
||||
return new Response('Too Many Requests', {
|
||||
status: 429,
|
||||
headers: { 'Retry-After': '60' },
|
||||
});
|
||||
}
|
||||
|
||||
await env.RATE_LIMIT.put(rateLimitKey, String(count + 1), { expirationTtl: 60 });
|
||||
return fetch(request);
|
||||
},
|
||||
};
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. 监控 Dashboard(Datadog / Grafana 指标)
|
||||
|
||||
```typescript
|
||||
// lib/metrics.ts — 关键埋点
|
||||
export const antiScrapingMetrics = {
|
||||
// 请求被拦截
|
||||
blocked: (reason: 'fingerprint' | 'rate' | 'honeypot' | 'ip' | 'captcha') => {
|
||||
metrics.increment('anti_scraping.blocked', { reason });
|
||||
},
|
||||
|
||||
// 风险评分分布
|
||||
scoreDistribution: (score: number) => {
|
||||
metrics.histogram('anti_scraping.risk_score', score);
|
||||
},
|
||||
|
||||
// CAPTCHA 展示与通过
|
||||
captchaImpressed: () => metrics.increment('anti_scraping.captcha.impressed'),
|
||||
captchaPassed: () => metrics.increment('anti_scraping.captcha.passed'),
|
||||
captchaFailed: () => metrics.increment('anti_scraping.captcha.failed'),
|
||||
|
||||
// 误伤监控
|
||||
falsePositive: (userId: string) => {
|
||||
metrics.increment('anti_scraping.false_positive');
|
||||
logger.warn({ userId }, 'Possible false positive in anti-scraping');
|
||||
},
|
||||
};
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. 测试用例
|
||||
|
||||
```typescript
|
||||
// __tests__/anti-scraping.test.ts
|
||||
describe('Anti-Scraping', () => {
|
||||
describe('Fingerprint Analysis', () => {
|
||||
it('should flag Python requests as high risk', () => {
|
||||
const score = analyzeFingerpring(mockRequest({
|
||||
'user-agent': 'python-requests/2.28.0',
|
||||
}));
|
||||
expect(score).toBeGreaterThanOrEqual(50);
|
||||
});
|
||||
|
||||
it('should not flag normal Chrome browser', () => {
|
||||
const score = analyzeFingerpring(mockRequest({
|
||||
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36',
|
||||
'accept-language': 'zh-CN,zh;q=0.9',
|
||||
'accept-encoding': 'gzip, deflate, br',
|
||||
'accept': 'text/html,application/xhtml+xml',
|
||||
}));
|
||||
expect(score).toBeLessThan(20);
|
||||
});
|
||||
});
|
||||
|
||||
describe('Rate Limiting', () => {
|
||||
it('should block after exceeding limit', async () => {
|
||||
const ip = '192.168.1.100';
|
||||
// 发送 61 次请求
|
||||
for (let i = 0; i < 61; i++) {
|
||||
await checkRateLimit(ip, { windowMs: 60_000, limit: 60 });
|
||||
}
|
||||
const { allowed } = await checkRateLimit(ip);
|
||||
expect(allowed).toBe(false);
|
||||
});
|
||||
});
|
||||
|
||||
describe('Honeypot', () => {
|
||||
it('should block IP that triggers honeypot', async () => {
|
||||
const ip = '10.0.0.1';
|
||||
await triggerHoneypot(ip);
|
||||
const isBlocked = await redis.sismember('ip:blocklist', ip);
|
||||
expect(isBlocked).toBe(1);
|
||||
});
|
||||
});
|
||||
});
|
||||
```
|
||||
100
.cursor/skills/anti-scraping/references/implementation-phases.md
Normal file
100
.cursor/skills/anti-scraping/references/implementation-phases.md
Normal file
@@ -0,0 +1,100 @@
|
||||
# Anti-Scraping — 完整实现代码
|
||||
|
||||
> 主流程与决策逻辑见 SKILL.md,本文档为 Phase 1–9 的深度实现细节。
|
||||
|
||||
## Phase 1:Nginx 前置拦截
|
||||
|
||||
```nginx
|
||||
# /etc/nginx/conf.d/anti-scraping.conf
|
||||
map $http_user_agent $is_bot {
|
||||
default 0;
|
||||
~*python-requests|python-urllib|httpx|aiohttp|scrapy|mechanize 1;
|
||||
~*curl|wget|libwww-perl|Go-http-client|Java/|okhttp|axios 1;
|
||||
~*HeadlessChrome|headless|PhantomJS|Playwright|Puppeteer|Selenium|webdriver 1;
|
||||
~*GPTBot|ChatGPT-User|Claude-Web|PerplexityBot|anthropic-ai|CCBot|DataForSeoBot|SemrushBot 1;
|
||||
~*Googlebot/|~*Bingbot/ 0;
|
||||
"" 1;
|
||||
}
|
||||
|
||||
limit_req_zone $binary_remote_addr zone=api_limit:10m rate=30r/m;
|
||||
limit_req_zone $binary_remote_addr zone=page_limit:10m rate=60r/m;
|
||||
limit_req_zone $binary_remote_addr zone=login_limit:10m rate=5r/m;
|
||||
limit_conn_zone $binary_remote_addr zone=conn_limit:10m;
|
||||
|
||||
server {
|
||||
if ($is_bot) { return 403; }
|
||||
if ($http_user_agent = "") { return 444; }
|
||||
location /api/ {
|
||||
limit_req zone=api_limit burst=10 nodelay;
|
||||
limit_conn conn_limit 20;
|
||||
proxy_pass http://hyperf_upstream;
|
||||
}
|
||||
location /api/auth/ {
|
||||
limit_req zone=login_limit burst=2 nodelay;
|
||||
limit_req_status 429;
|
||||
proxy_pass http://hyperf_upstream;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Phase 2:HTTP 请求指纹识别
|
||||
|
||||
见 `RequestFingerprintMiddleware.php`:
|
||||
- `BROWSER_REQUIRED_HEADERS`: accept, accept-language, accept-encoding
|
||||
- `BROWSER_SECURITY_HEADERS`: sec-fetch-site, sec-fetch-mode, sec-fetch-dest, sec-ch-ua
|
||||
- `SUSPICIOUS_ACCEPT_PATTERNS`: */*, application/json, text/html,*/*;q=0.9
|
||||
- 评分逻辑:UA 缺失 +80,爬虫工具 +70,无头特征 +60,缺失 Header 每项 +15,缺失 sec-fetch +25,可疑 Accept +20,API 无 Referer +15
|
||||
|
||||
## Phase 3:速率限制与行为分析
|
||||
|
||||
`RateLimitService`:
|
||||
- `IP_RULES`: global/api/search/export/login/register 分层限速
|
||||
- `analyzeRequestPattern()`: 请求间隔变异系数 CV,CV < 0.3 或平均间隔 < 200ms 判定异常
|
||||
- `analyzeSubnetPattern()`: /24 子网 1 分钟 > 500 请求判定代理池
|
||||
- Redis key: `rl:{rule}:{ip}` (ZSET), `req_ts:{session}` (LIST)
|
||||
|
||||
## Phase 4:IP 信誉与代理检测
|
||||
|
||||
`IpIntelligenceService`:
|
||||
- `DATACENTER_PREFIXES`: AWS/GCP/Azure/Cloudflare/DigitalOcean/Vultr 等
|
||||
- `classify()`: blocklist → whitelist → tor → datacenter → residential
|
||||
- `trackCrawlBreadth()`: HyperLogLog 1 小时内 > 200 唯一路径判定爬虫
|
||||
|
||||
## Phase 5:浏览器指纹识别
|
||||
|
||||
前端 `collectFingerprint()`: Canvas/WebGL/音频/鼠标/字体 指纹
|
||||
后端 `BrowserFingerprintService.analyze()`: SwiftShader/llvmpipe/ANGLE +30,无音频 +20,字体全相同 +25,无鼠标 +20,4核0内存 +15
|
||||
|
||||
## Phase 6:AI 代理行为识别
|
||||
|
||||
`AiBotDetectionService`: CV < 0.2 加 45,纯只读 GET +25,平均间隔 < 300ms +30,UA 与语言不匹配 +15;总分 ≥ 60 判定 AI Bot
|
||||
|
||||
## Phase 7:工作量证明 PoW
|
||||
|
||||
前端 `solvePoW()`: SHA-256 前缀匹配,difficulty=4 约 1 万次
|
||||
后端 `PoWService`: generate() 生成 nonce+id,verify() 校验并防重放
|
||||
|
||||
## Phase 8:蜜罐
|
||||
|
||||
前端:CSS 隐藏 input,随机字段名,提交时检测 honeypot 有值则 reportBot
|
||||
后端:`/api/internal/user-export-all` 蜜罐路由,访问即封 IP 1 小时
|
||||
|
||||
## Phase 9:综合评分与差异化响应
|
||||
|
||||
`AntiScrapingMiddleware`: 加权汇总 headerScore*0.2 + ipRisk*0.15 + rateScore*0.25 + breadthScore*0.15 + patternScore*0.15 + subnetScore*0.05 + fpScore*0.05
|
||||
- total ≥ 90: 封 IP 2 小时
|
||||
- total ≥ 70: 要求 PoW 挑战
|
||||
- total ≥ 50: 延迟 0.5–2 秒
|
||||
- total ≥ 30: 敏感接口返回空数据
|
||||
|
||||
## 监控与 Redis 清理
|
||||
|
||||
| 指标 | 告警阈值 |
|
||||
|------|---------|
|
||||
| 403 响应率 | > 3% |
|
||||
| 429 响应率 | > 5% |
|
||||
| PoW 触发量 | > 100/小时 |
|
||||
| 蜜罐触发量 | > 50/天 |
|
||||
| IP 封禁量 | > 200/天 |
|
||||
|
||||
Redis TTL: rl/req_ts/req_log/crawl_breadth/subnet_req/pow 均设 TTL,ip:blocklist 定期清理过期封禁。
|
||||
Reference in New Issue
Block a user