# Cloudflare AI Crawl Control 最佳實踐

> 由 CiphLens 團隊整理 · 來源：GitHub Copilot 研究 · 繁體中文

---

● Web Search (MCP: github-mcp-server) · Cloudflare AI Crawl Control verified bot vs scraper GPTBot …
  └ {"type":"output_text","text":{"value":"Let’s clarify the comparison betw...

● Web Search (MCP: github-mcp-server) · Cloudflare Bot Management AI crawler GPTBot ClaudeBot detec…
  └ {"type":"output_text","text":{"value":"Cloudflare’s Bot Management platf...

● Web Search (MCP: github-mcp-server) · Cloudflare AI Audit feature 2024 2025 site:blog.cloudflare.…
  └ {"type":"output_text","text":{"value":"Cloudflare introduced the \"AI Au...

● Web Search (MCP: github-mcp-server) · Cloudflare Workers AI Gateway Vectorize D1 R2 AI dataset pi…
  └ {"type":"output_text","text":{"value":"It sounds like you’re exploring h...

● Web Search (MCP: github-mcp-server) · Cloudflare AI crawl control allow observe log AI bots witho…
  └ {"type":"output_text","text":{"value":"Cloudflare’s **AI Crawl Control**...

● Web Search (MCP: github-mcp-server) · Cloudflare Workers D1 R2 AI crawler observation dataset log…
  └ {"type":"output_text","text":{"value":"Here’s a summary of how you can u...

● Web Search (MCP: github-mcp-server) · Cloudflare Workers AI Gateway caching rate limiting observa…
  └ {"type":"output_text","text":{"value":"Certainly! Here’s an overview of ...

● Web Search (MCP: github-mcp-server) · Cloudflare "verified bot" reverse DNS ASN check GPTBot anth…
  └ {"type":"output_text","text":{"value":"Here’s a 2025-focused overview of...

# Cloudflare AI 工具完整指南：從爬蟲管理到邊緣 AI 基礎設施

## 前言

隨著大型語言模型（LLM）的蓬勃發展，AI 爬蟲對網站內容的抓取已成為基礎設施管理的核心議題。Cloudflare 自 2023 年底至 2025 年間陸續推出一系列 AI 相關工具——從 AI Audit、AI Crawl Control，到 Workers AI、AI Gateway、Vectorize——構成了一套完整的「AI 時代邊緣基礎設施」。本文將逐一深入介紹這些工具，並聚焦於一個關鍵需求：**保留 AI 爬蟲的正常訪問，但同時建立精準的觀察與資料收集機制**。

---

## 一、AI Crawl Control：區分 Verified Bot 與 Scraper

### 1.1 功能背景與演進

Cloudflare 於 2024 年推出 **AI Audit** 功能，提供網站管理者對 AI 爬蟲訪問行為的可視化報告。2025 年 8 月，此功能正式改名為 **AI Crawl Control** 並進入 GA（General Availability）階段，從單純的審計工具演化為完整的訪問策略執行平台。

AI Crawl Control 的核心價值在於：區分「已驗證的合法 AI 爬蟲」（Verified Bot）與「偽裝或未申報的抓取程式」（Scraper），並允許對不同類型的爬蟲設定差異化的處理策略。

### 1.2 Verified Bot 的驗證機制

Cloudflare 的 Verified Bot 認定流程涵蓋多個層次的技術驗證：

**反向 DNS 驗證（Reverse DNS Lookup）**  
當一個宣稱為 GPTBot 的請求抵達時，Cloudflare 會對該 IP 執行反向 DNS 解析，確認其 PTR 記錄是否指向 `*.openai.com`。同理，ClaudeBot 的 IP 應解析至 `*.anthropic.com` 或 `claude.ai` 相關域名。

**ASN（自治系統號）核對**  
OpenAI 的爬蟲流量主要來自特定 ASN，Anthropic 則使用 `AS399358`。Cloudflare 維護一份持續更新的已知 AI 服務 ASN 清單，並可透過 WAF 規則中的 `ip.geoip.asnum` 欄位進行比對。

**User-Agent 字串比對**  
主流 AI 爬蟲均會在請求標頭中申報身份，包含：
- OpenAI：`GPTBot`、`ChatGPT-User`、`OAI-SearchBot`
- Anthropic：`ClaudeBot`、`Claude-User`、`Claude-SearchBot`
- Google：`Google-Extended`
- Meta：`FacebookBot`

社群維護的開源清單 [rezmoss/cloud-provider-ip-addresses](https://github.com/rezmoss/cloud-provider-ip-addresses) 每日更新各大 AI 廠商的 IP 範圍，是實作中常見的參考資料來源。

### 1.3 三種訪問策略模式

AI Crawl Control 提供三種核心模式，對應不同的管理需求：

| 模式 | 說明 | 適用場景 |
|------|------|---------|
| **Block** | 拒絕 AI 爬蟲訪問，可選擇性回傳 HTTP 402 Payment Required | 版權敏感內容、付費牆 |
| **Allow** | 允許訪問，不做額外處理 | 開放內容、希望被索引 |
| **Observe（Log Only）** | 放行但完整記錄所有請求 | **希望保留訪問但精準觀察** |

對於希望「保留 AI 爬蟲訪問但能觀察」的網站，**Observe 模式**是最直接的起點。

### 1.4 robots.txt 執行強化（Robotcop）

2024 年 12 月，Cloudflare 推出 **Robotcop** 功能，能在網路層自動偵測並攔截違反 `robots.txt` 宣告的 AI 爬蟲行為。這解決了傳統 `robots.txt` 依賴爬蟲自律的根本缺陷——Cloudflare 在邊緣節點直接執行，不必依賴爬蟲是否遵守協議。

---

## 二、Bot Management：GPTBot 與 ClaudeBot 的判定邏輯

### 2.1 Cloudflare Bot Score 機制

Cloudflare Bot Management 為每個進入請求計算一個 **Bot Score（0-99）**，數值越低代表越接近機器人行為，越高代表越像真實人類。AI 爬蟲通常因以下特徵被判定為自動化流量：

- **無 JavaScript 執行能力**：爬蟲通常不執行 JS，導致 Cloudflare 的挑戰無法完成
- **HTTP 標頭異常**：缺少 `Accept-Language`、`Cookie` 或具有非標準 TLS 指紋
- **請求間隔規律**：機器人傾向以固定間隔請求，而非人類的隨機性行為
- **來源 IP 集中於雲端供應商**：AWS、GCP、Azure 段的 IP 會受到更嚴格的審查

### 2.2 WAF 規則中的精準判定

在 Cloudflare WAF 中，可以組合多個條件來精確識別特定 AI 爬蟲：

```bash
# 允許已驗證的 GPTBot 通過，同時記錄
(cf.bot_management.verified_bot eq true) and 
(http.user_agent contains "GPTBot") and 
(ip.geoip.asnum eq 20473)  # OpenAI 的 ASN
```

若要區分「合法 GPTBot」與「偽裝成 GPTBot 的惡意爬蟲」，可以加入反向 DNS 的條件：

```bash
# 偵測偽裝：UA 宣稱是 GPTBot 但 verified_bot 為 false
(http.user_agent contains "GPTBot") and 
(cf.bot_management.verified_bot eq false)
# 對此類請求採取 Challenge 或 Log 動作
```

### 2.3 Cloudflare Radar Bot Directory

Cloudflare Radar 提供一個 [Bot Directory](https://radar.cloudflare.com/bots)，列出所有已知的 Verified Bots 及其技術特徵、來源 IP 範圍、User-Agent 規則等。對於 GPTBot，Radar 的頁面（`radar.cloudflare.com/bots/directory/gptbot`）包含即時更新的驗證資訊，是實作參考的主要來源。

---

## 三、Workers AI、AI Gateway 與 Vectorize 的應用架構

### 3.1 Workers AI：邊緣推論能力

**Workers AI** 讓開發者可以在 Cloudflare 邊緣節點直接執行 AI 推論，無需管理 GPU 基礎設施。支援的模型類別包括：

- **文字嵌入（Text Embedding）**：`@cf/baai/bge-base-en-v1.5` 等模型，用於語意搜尋
- **文字生成（Text Generation）**：`@cf/meta/llama-3-8b-instruct` 等 LLM
- **圖片分類、音訊轉文字**等多模態模型

```javascript
// Workers AI 文字嵌入範例
const embedding = await env.AI.run('@cf/baai/bge-base-en-v1.5', {
  text: "這是要嵌入的文字內容"
});
```

### 3.2 AI Gateway：統一的 AI API 閘道

**AI Gateway** 是一個介於應用程式與 AI 服務供應商之間的代理層，提供：

- **請求/回應日誌**：完整記錄所有 LLM API 呼叫
- **回應快取（Caching）**：相同查詢不重複計費，降低成本
- **速率限制（Rate Limiting）**：防止 API key 濫用
- **供應商切換**：一個端點，可動態切換 OpenAI、Anthropic、Workers AI 等後端

設定方式：在 Cloudflare Dashboard 建立 AI Gateway，取得如下格式的統一端點：
```
https://gateway.ai.cloudflare.com/v1/{account_id}/{gateway_id}/{provider}/...
```

這個架構與學術界在 LLM 可觀測性領域的研究相呼應。研究論文 *"Towards Observability for Production Machine Learning Pipelines"*（Shankar et al., 2022, arXiv:2108.13557）強調了 AI 系統可觀測性的重要性，AI Gateway 正是這一理念在邊緣基礎設施上的實踐。

### 3.3 Vectorize：邊緣向量資料庫

**Vectorize** 是 Cloudflare 推出的分散式向量資料庫，專為 RAG（Retrieval-Augmented Generation）與語意搜尋設計：

```javascript
// 向 Vectorize 插入向量
await env.VECTORIZE.insert([{
  id: "doc-001",
  values: embeddingVector,  // 來自 Workers AI 的嵌入向量
  metadata: { url: "/blog/ai-tools", crawledAt: Date.now() }
}]);

// 語意搜尋
const results = await env.VECTORIZE.query(queryVector, {
  topK: 5,
  returnMetadata: true
});
```

Workers AI + Vectorize 的組合可以在 Cloudflare 平台內部完成完整的 RAG 流程，無需外部呼叫。

---

## 四、用 CF Workers + D1 + R2 建立網站 AI 觀察資料集

### 4.1 架構設計原則

建立 AI 爬蟲觀察資料集的核心挑戰是：**在不影響爬蟲正常訪問的前提下，完整捕捉每一次 AI 爬蟲的訪問行為，並將資料結構化儲存以供後續分析**。

推薦架構如下：

```
AI 爬蟲請求
    ↓
[Cloudflare Worker] — 判斷 bot 類型，決定是否記錄
    ├─→ [D1 資料庫] — 儲存結構化的請求元資料（IP、UA、路徑、時間戳）
    ├─→ [R2 物件儲存] — 儲存完整的請求/回應快照（大體積 JSON）
    └─→ 原始回應照常回傳給爬蟲（不阻斷）
```

### 4.2 D1 資料庫 Schema 設計

```sql
-- 建立 AI 爬蟲觀察資料表
CREATE TABLE ai_crawler_observations (
  id          INTEGER PRIMARY KEY AUTOINCREMENT,
  observed_at TEXT    NOT NULL DEFAULT (datetime('now')),
  bot_name    TEXT,                    -- 'GPTBot', 'ClaudeBot', 'Unknown'
  user_agent  TEXT,
  ip_address  TEXT,
  asn         INTEGER,
  country     TEXT,
  path        TEXT    NOT NULL,
  method      TEXT    NOT NULL,
  status_code INTEGER,
  response_ms INTEGER,                 -- 回應時間（毫秒）
  verified    INTEGER DEFAULT 0,       -- 0/1 是否為 verified bot
  r2_key      TEXT                     -- 對應 R2 快照的 key（可選）
);

-- 索引優化查詢效能
CREATE INDEX idx_bot_name ON ai_crawler_observations(bot_name);
CREATE INDEX idx_observed_at ON ai_crawler_observations(observed_at);
CREATE INDEX idx_path ON ai_crawler_observations(path);
```

### 4.3 Workers 攔截與記錄邏輯

```javascript
export default {
  async fetch(request, env, ctx) {
    const startTime = Date.now();
    const url = new URL(request.url);
    const userAgent = request.headers.get('User-Agent') || '';
    
    // 判斷是否為已知 AI 爬蟲
    const botName = detectAIBot(userAgent);
    
    // 先取得原始回應（不阻斷爬蟲）
    const response = await fetch(request);
    const responseMs = Date.now() - startTime;
    
    // 非同步記錄（使用 waitUntil 避免阻斷回應）
    if (botName) {
      ctx.waitUntil(logCrawlerActivity({
        env,
        botName,
        request,
        statusCode: response.status,
        responseMs,
        cfData: request.cf  // Cloudflare 附加的地理與 ASN 資訊
      }));
    }
    
    return response;
  }
};

function detectAIBot(userAgent) {
  const bots = {
    'GPTBot': /GPTBot/i,
    'ClaudeBot': /ClaudeBot/i,
    'Claude-User': /Claude-User/i,
    'ChatGPT-User': /ChatGPT-User/i,
    'Google-Extended': /Google-Extended/i,
    'PerplexityBot': /PerplexityBot/i,
    'YouBot': /YouBot/i,
    'FacebookBot': /FacebookBot/i
  };
  
  for (const [name, pattern] of Object.entries(bots)) {
    if (pattern.test(userAgent)) return name;
  }
  return null;
}

async function logCrawlerActivity({ env, botName, request, statusCode, responseMs, cfData }) {
  const url = new URL(request.url);
  
  // 寫入 D1 結構化日誌
  await env.DB.prepare(`
    INSERT INTO ai_crawler_observations 
    (bot_name, user_agent, ip_address, asn, country, path, method, status_code, response_ms)
    VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
  `).bind(
    botName,
    request.headers.get('User-Agent'),
    request.headers.get('CF-Connecting-IP'),
    cfData?.asn,
    cfData?.country,
    url.pathname,
    request.method,
    statusCode,
    responseMs
  ).run();

  // 完整請求快照存入 R2（僅在特定條件下，避免成本過高）
  if (url.pathname.startsWith('/api/') || url.pathname.includes('sitemap')) {
    const snapshot = {
      timestamp: new Date().toISOString(),
      botName,
      headers: Object.fromEntries(request.headers),
      cfData,
      path: url.pathname,
      query: url.search
    };
    const key = `crawl-snapshots/${botName}/${Date.now()}-${crypto.randomUUID()}.json`;
    await env.R2_BUCKET.put(key, JSON.stringify(snapshot), {
      httpMetadata: { contentType: 'application/json' }
    });
  }
}
```

### 4.4 Wrangler 設定檔範例

```toml
# wrangler.toml
name = "ai-crawler-observer"
main = "src/index.js"
compatibility_date = "2025-04-01"

[[d1_databases]]
binding = "DB"
database_name = "crawler-observations"
database_id = "your-d1-database-id"

[[r2_buckets]]
binding = "R2_BUCKET"
bucket_name = "ai-crawler-snapshots"
```

### 4.5 資料分析查詢範例

```sql
-- 過去 7 天各 AI 爬蟲的訪問次數
SELECT bot_name, COUNT(*) as visits, AVG(response_ms) as avg_ms
FROM ai_crawler_observations
WHERE observed_at >= datetime('now', '-7 days')
GROUP BY bot_name ORDER BY visits DESC;

-- 被爬蟲最常訪問的頁面 Top 10
SELECT path, bot_name, COUNT(*) as count
FROM ai_crawler_observations
GROUP BY path, bot_name ORDER BY count DESC LIMIT 10;

-- 偵測可疑行為：聲稱為 GPTBot 但來自非 OpenAI ASN
SELECT * FROM ai_crawler_observations
WHERE bot_name = 'GPTBot' AND asn NOT IN (20473, 54113);
```

### 4.6 參考開源專案

- [**cloudflare/workers-sdk**](https://github.com/cloudflare/workers-sdk)：官方 Workers 開發工具鏈，含 D1/R2 binding 範例
- [**cloudflare/cloudflare-typescript**](https://github.com/cloudflare/cloudflare-typescript)：官方 TypeScript SDK，方便後端整合 Cloudflare API 做資料查詢
- [**rezmoss/cloud-provider-ip-addresses**](https://github.com/rezmoss/cloud-provider-ip-addresses)：每日更新的 AI 廠商 IP 範圍清單

---

## 五、AI Audit 功能完整說明

### 5.1 功能演進時間軸

| 時間 | 里程碑 |
|------|--------|
| 2024 Q3 | AI Audit 發佈，提供 AI 爬蟲流量的視覺化報告 |
| 2024 Q4 | 新增 Robotcop 功能，在網路層強制執行 robots.txt |
| 2025 Q1 | 支援 HTTP 402 回應，開啟「授權付費爬蟲」商業模式 |
| 2025 Q3 | 正式更名為 AI Crawl Control，進入 GA 階段 |

### 5.2 AI Audit Dashboard 核心指標

透過 Cloudflare Dashboard 的 AI Audit 面板，管理者可以看到：

- **爬蟲身份分佈**：哪些 AI 公司的爬蟲在訪問，以及各自的流量佔比
- **被訪問內容分析**：哪些頁面、哪些內容類型被 AI 爬蟲最頻繁地抓取
- **時間序列趨勢**：AI 爬蟲流量的成長趨勢，協助評估未來的授權價值
- **合規性報告**：哪些爬蟲遵守 robots.txt，哪些違反

### 5.3 HTTP 402 商業化應用

這是 AI Audit → AI Crawl Control 演進中最具商業創新意義的功能。網站可以對特定爬蟲回傳 HTTP 402（Payment Required），並在回應 body 中附上授權聯繫方式或定價資訊：

```javascript
// 對未授權 AI 爬蟲回傳 402
if (isAIBot(userAgent) && !isAuthorized(request)) {
  return new Response(JSON.stringify({
    message: "This content requires a licensing agreement for AI training use.",
    contact: "licensing@yoursite.com",
    info: "https://yoursite.com/ai-licensing"
  }), {
    status: 402,
    headers: { 'Content-Type': 'application/json' }
  });
}
```

---

## 六、精準觀察而不阻斷的最佳實踐

### 6.1 四層防禦觀察架構

為了在「保留訪問」與「精準觀察」之間取得平衡，建議採用以下四層架構：

1. **第一層（Cloudflare Edge）**：AI Crawl Control 設為 Observe 模式，記錄所有 AI 爬蟲流量，不阻斷任何請求
2. **第二層（Workers 邏輯層）**：透過自定義 Worker 進行更細粒度的行為分析，包括路徑深度、請求頻率、異常標頭偵測
3. **第三層（D1 結構化儲存）**：儲存每次訪問的元資料，支援後續 SQL 分析
4. **第四層（R2 原始快照）**：對高價值頁面的訪問存取完整快照，供稽核與法律用途

### 6.2 Logpush 自動化資料導出

對於大流量網站，建議啟用 **Logpush** 將 AI 爬蟲相關日誌自動推送至 R2 或外部 SIEM 系統：

```bash
# 透過 Cloudflare API 建立 Logpush Job，過濾 AI 爬蟲流量
curl -X POST "https://api.cloudflare.com/client/v4/zones/{zone_id}/logpush/jobs" \
  -H "Authorization: Bearer {api_token}" \
  -d '{
    "name": "ai-crawler-logs",
    "destination_conf": "r2://{bucket_name}/ai-crawler-logs?account-id={account_id}",
    "dataset": "http_requests",
    "filter": "{\"where\":{\"key\":\"BotCategory\",\"operator\":\"eq\",\"value\":\"AI Crawler\"}}",
    "logpull_options": "fields=ClientIP,UserAgent,ClientRequestPath,BotScore,BotVerificationReason"
  }'
```

### 6.3 Workers AI 輔助異常偵測

可以整合 Workers AI 的文字分類能力，對疑似偽裝的爬蟲請求進行自動分類：

```javascript
// 使用 Workers AI 分析可疑 User-Agent
const classification = await env.AI.run('@cf/huggingface/distilbert-sst-2-int8', {
  text: `User-Agent: ${userAgent}, IP: ${ip}, Path: ${path}`
});
// 將分類結果一並存入 D1，建立訓練資料集
```

### 6.4 關鍵設定檢查清單

- [ ] AI Crawl Control 設為 **Observe** 而非 Block
- [ ] WAF 自定義規則：區分 `verified_bot = true` 與 `verified_bot = false` 的 AI 爬蟲
- [ ] D1 資料庫建立爬蟲日誌 Schema，並設定適當索引
- [ ] R2 設定生命週期規則（Lifecycle Rule），自動歸檔或刪除舊日誌
- [ ] Workers Logpush 啟用，確保日誌持久化
- [ ] 定期查詢 D1 資料，監控 AI 爬蟲流量趨勢

---

## 結語

Cloudflare 的 AI 工具生態系提供了從「被動防禦」到「主動管理」的完整能力升級路徑。對於希望與 AI 公司建立合作關係、或正在研究 AI 爬蟲行為的研究者與網站管理者而言，AI Crawl Control 的 Observe 模式配合 Workers + D1 + R2 的資料收集架構，是目前技術上最完整、成本效益最高的解決方案。

隨著 AI 爬蟲流量持續成長——Cloudflare 2024 年的研究報告顯示 AI 爬蟲流量同比成長超過 500%——建立精準的觀察資料集不僅有助於技術優化，更將成為內容版權管理與授權談判的重要依據。Cloudflare 的邊緣基礎設施架構，讓這套系統可以在全球 300+ PoP 節點上以接近零延遲的方式運作，真正實現「觀察但不阻斷」的精準管理目標。

---

**參考資料**
- Cloudflare Blog: [Introducing AI Crawl Control](https://blog.cloudflare.com/introducing-ai-crawl-control/)
- Cloudflare Blog: [Start auditing and controlling AI content crawlers](https://blog.cloudflare.com/cloudflare-ai-audit-control-ai-content-crawlers/)
- Cloudflare Blog: [Robotcop: enforcing robots.txt policies](https://blog.cloudflare.com/ai-audit-enforcing-robots-txt/)
- Cloudflare Developers: [AI Gateway](https://developers.cloudflare.com/ai-gateway/)
- Cloudflare Developers: [Workers AI](https://developers.cloudflare.com/workers-ai/)
- Cloudflare Developers: [Vectorize](https://developers.cloudflare.com/vectorize/)
- GitHub: [cloudflare/workers-sdk](https://github.com/cloudflare/workers-sdk)
- GitHub: [rezmoss/cloud-provider-ip-addresses](https://github.com/rezmoss/cloud-provider-ip-addresses)
- Shankar et al. (2022). *Towards Observability for Production Machine Learning Pipelines*. arXiv:2108.13557
- OpenAI: [GPTBot Documentation](https://platform.openai.com/docs/gptbot)
- Cloudflare Radar: [Bot Directory](https://radar.cloudflare.com/bots)