openclaw/docs/refactor/firecrawl-extension.md

---
summary: "Design for an opt-in Firecrawl extension that adds search/scrape value without hardwiring Firecrawl into core defaults"
read_when:
  - Designing Firecrawl integration work
  - Evaluating web_search/web_fetch plugin seams
  - Deciding whether Firecrawl belongs in core or as an extension
title: "Firecrawl Extension Design"
---

# Firecrawl Extension Design

## Goal

Ship Firecrawl as an **opt-in extension** that adds:

- explicit Firecrawl tools for agents,
- optional Firecrawl-backed `web_search` integration,
- self-hosted support,
- stronger security defaults than the current core fallback path,

without pushing Firecrawl into the default setup/onboarding path.

## Why this shape

Recent Firecrawl issues/PRs cluster into three buckets:

1. **Release/schema drift**
   - Several releases rejected `tools.web.fetch.firecrawl` even though docs and runtime code supported it.
2. **Security hardening**
   - Current `fetchFirecrawlContent()` still posts to the Firecrawl endpoint with raw `fetch()`, while the main web-fetch path uses the SSRF guard.
3. **Product pressure**
   - Users want Firecrawl-native search/scrape flows, especially for self-hosted/private setups.
   - Maintainers explicitly rejected wiring Firecrawl deeply into core defaults, setup flow, and browser behavior.

That combination argues for an extension, not more Firecrawl-specific logic in the default core path.

## Design principles

- **Opt-in, vendor-scoped**: no auto-enable, no setup hijack, no default tool-profile widening.
- **Extension owns Firecrawl-specific config**: prefer plugin config over growing `tools.web.*` again.
- **Useful on day one**: works even if core `web_search` / `web_fetch` seams stay unchanged.
- **Security-first**: endpoint fetches use the same guarded networking posture as other web tools.
- **Self-hosted-friendly**: config + env fallback, explicit base URL, no hosted-only assumptions.

## Proposed extension

Plugin id: `firecrawl`

### MVP capabilities

Register explicit tools:

- `firecrawl_search`
- `firecrawl_scrape`

Optional later:

- `firecrawl_crawl`
- `firecrawl_map`

Do **not** add Firecrawl browser automation in the first version. That was the part of PR #32543 that pulled Firecrawl too far into core behavior and raised the most maintainership concern.

## Config shape

Use plugin-scoped config:

```json5
{
  plugins: {
    entries: {
      firecrawl: {
        enabled: true,
        config: {
          apiKey: "FIRECRAWL_API_KEY",
          baseUrl: "https://api.firecrawl.dev",
          timeoutSeconds: 60,
          maxAgeMs: 172800000,
          proxy: "auto",
          storeInCache: true,
          onlyMainContent: true,
          search: {
            enabled: true,
            defaultLimit: 5,
            sources: ["web"],
            categories: [],
            scrapeResults: false,
          },
          scrape: {
            formats: ["markdown"],
            fallbackForWebFetchLikeUse: false,
          },
        },
      },
    },
  },
}
```

### Credential resolution

Precedence:

1. `plugins.entries.firecrawl.config.apiKey`
2. `FIRECRAWL_API_KEY`

Base URL precedence:

1. `plugins.entries.firecrawl.config.baseUrl`
2. `FIRECRAWL_BASE_URL`
3. `https://api.firecrawl.dev`

### Compatibility bridge

For the first release, the extension may also **read** existing core config at `tools.web.fetch.firecrawl.*` as a fallback source so existing users do not need to migrate immediately.

Write path stays plugin-local. Do not keep expanding core Firecrawl config surfaces.

## Tool design

### `firecrawl_search`

Inputs:

- `query`
- `limit`
- `sources`
- `categories`
- `scrapeResults`
- `timeoutSeconds`

Behavior:

- Calls Firecrawl `v2/search`
- Returns normalized OpenClaw-friendly result objects:
  - `title`
  - `url`
  - `snippet`
  - `source`
  - optional `content`
- Wraps result content as untrusted external content
- Cache key includes query + relevant provider params

Why explicit tool first:

- Works today without changing `tools.web.search.provider`
- Avoids current schema/loader constraints
- Gives users Firecrawl value immediately

### `firecrawl_scrape`

Inputs:

- `url`
- `formats`
- `onlyMainContent`
- `maxAgeMs`
- `proxy`
- `storeInCache`
- `timeoutSeconds`

Behavior:

- Calls Firecrawl `v2/scrape`
- Returns markdown/text plus metadata:
  - `title`
  - `finalUrl`
  - `status`
  - `warning`
- Wraps extracted content the same way `web_fetch` does
- Shares cache semantics with web tool expectations where practical

Why explicit scrape tool:

- Sidesteps the unresolved `Readability -> Firecrawl -> basic HTML cleanup` ordering bug in core `web_fetch`
- Gives users a deterministic “always use Firecrawl” path for JS-heavy/bot-protected sites

## What the extension should not do

- No auto-adding `browser`, `web_search`, or `web_fetch` to `tools.alsoAllow`
- No default onboarding step in `openclaw setup`
- No Firecrawl-specific browser session lifecycle in core
- No change to built-in `web_fetch` fallback semantics in the extension MVP

## Phase plan

### Phase 1: extension-only, no core schema changes

Implement:

- `extensions/firecrawl/`
- plugin config schema
- `firecrawl_search`
- `firecrawl_scrape`
- tests for config resolution, endpoint selection, caching, error handling, and SSRF guard usage

This phase is enough to ship real user value.

### Phase 2: optional `web_search` provider integration

Support `tools.web.search.provider = "firecrawl"` only after fixing two core constraints:

1. `src/plugins/web-search-providers.ts` must load configured/installed web-search-provider plugins instead of a hardcoded bundled list.
2. `src/config/types.tools.ts` and `src/config/zod-schema.agent-runtime.ts` must stop hardcoding the provider enum in a way that blocks plugin-registered ids.

Recommended shape:

- keep built-in providers documented,
- allow any registered plugin provider id at runtime,
- validate provider-specific config via the provider plugin or a generic provider bag.

### Phase 3: optional `web_fetch` provider seam

Do this only if maintainers want vendor-specific fetch backends to participate in `web_fetch`.

Needed core addition:

- `registerWebFetchProvider` or equivalent fetch-backend seam

Without that seam, the extension should keep `firecrawl_scrape` as an explicit tool rather than trying to patch built-in `web_fetch`.

## Security requirements

The extension must treat Firecrawl as a **trusted operator-configured endpoint**, but still harden transport:

- Use SSRF-guarded fetch for the Firecrawl endpoint call, not raw `fetch()`
- Preserve self-hosted/private-network compatibility using the same trusted-web-tools endpoint policy used elsewhere
- Never log the API key
- Keep endpoint/base URL resolution explicit and predictable
- Treat Firecrawl-returned content as untrusted external content

This mirrors the intent behind the SSRF hardening PRs without assuming Firecrawl is a hostile multi-tenant surface.

## Why not a skill

The repo already closed a Firecrawl skill PR in favor of ClawHub distribution. That is fine for optional user-installed prompt workflows, but it does not solve:

- deterministic tool availability,
- provider-grade config/credential handling,
- self-hosted endpoint support,
- caching,
- stable typed outputs,
- security review on network behavior.

This belongs as an extension, not a prompt-only skill.

## Success criteria

- Users can install/enable one extension and get reliable Firecrawl search/scrape without touching core defaults.
- Self-hosted Firecrawl works with config/env fallback.
- Extension endpoint fetches use guarded networking.
- No new Firecrawl-specific core onboarding/default behavior.
- Core can later adopt plugin-native `web_search` / `web_fetch` seams without redesigning the extension.

## Recommended implementation order

1. Build `firecrawl_scrape`
2. Build `firecrawl_search`
3. Add docs and examples
4. If desired, generalize `web_search` provider loading so the extension can back `web_search`
5. Only then consider a true `web_fetch` provider seam