openclaw/firecrawl-extension.md at bd67f3336414da8fee874aa160a3a95db2da243e

7.9 KiB

Raw Blame History

summary

read_when

title

Design for an opt-in Firecrawl extension that adds search/scrape value without hardwiring Firecrawl into core defaults

Designing Firecrawl integration work

Evaluating web_search/web_fetch plugin seams

Deciding whether Firecrawl belongs in core or as an extension

Firecrawl Extension Design

Goal

Ship Firecrawl as an opt-in extension that adds:

explicit Firecrawl tools for agents,
optional Firecrawl-backed web_search integration,
self-hosted support,
stronger security defaults than the current core fallback path,

without pushing Firecrawl into the default setup/onboarding path.

Why this shape

Recent Firecrawl issues/PRs cluster into three buckets:

Release/schema drift
- Several releases rejected tools.web.fetch.firecrawl even though docs and runtime code supported it.
Security hardening
- Current fetchFirecrawlContent() still posts to the Firecrawl endpoint with raw fetch(), while the main web-fetch path uses the SSRF guard.
Product pressure
- Users want Firecrawl-native search/scrape flows, especially for self-hosted/private setups.
- Maintainers explicitly rejected wiring Firecrawl deeply into core defaults, setup flow, and browser behavior.

That combination argues for an extension, not more Firecrawl-specific logic in the default core path.

Design principles

Opt-in, vendor-scoped: no auto-enable, no setup hijack, no default tool-profile widening.
Extension owns Firecrawl-specific config: prefer plugin config over growing tools.web.* again.
Useful on day one: works even if core web_search / web_fetch seams stay unchanged.
Security-first: endpoint fetches use the same guarded networking posture as other web tools.
Self-hosted-friendly: config + env fallback, explicit base URL, no hosted-only assumptions.

Proposed extension

Plugin id: firecrawl

MVP capabilities

firecrawl_search
firecrawl_scrape

Optional later:

firecrawl_crawl
firecrawl_map

Do not add Firecrawl browser automation in the first version. That was the part of PR #32543 that pulled Firecrawl too far into core behavior and raised the most maintainership concern.

Config shape

Use plugin-scoped config:

{
  plugins: {
    entries: {
      firecrawl: {
        enabled: true,
        config: {
          apiKey: "FIRECRAWL_API_KEY",
          baseUrl: "https://api.firecrawl.dev",
          timeoutSeconds: 60,
          maxAgeMs: 172800000,
          proxy: "auto",
          storeInCache: true,
          onlyMainContent: true,
          search: {
            enabled: true,
            defaultLimit: 5,
            sources: ["web"],
            categories: [],
            scrapeResults: false,
          },
          scrape: {
            formats: ["markdown"],
            fallbackForWebFetchLikeUse: false,
          },
        },
      },
    },
  },
}

Credential resolution

Precedence:

plugins.entries.firecrawl.config.apiKey
FIRECRAWL_API_KEY

Base URL precedence:

plugins.entries.firecrawl.config.baseUrl
FIRECRAWL_BASE_URL
https://api.firecrawl.dev

Compatibility bridge

For the first release, the extension may also read existing core config at tools.web.fetch.firecrawl.* as a fallback source so existing users do not need to migrate immediately.

Write path stays plugin-local. Do not keep expanding core Firecrawl config surfaces.

Tool design

`firecrawl_search`

Inputs:

query
limit
sources
categories
scrapeResults
timeoutSeconds

Behavior:

Calls Firecrawl v2/search
Returns normalized OpenClaw-friendly result objects:
- title
- url
- snippet
- source
- optional content
Wraps result content as untrusted external content
Cache key includes query + relevant provider params

Why explicit tool first:

Works today without changing tools.web.search.provider
Avoids current schema/loader constraints
Gives users Firecrawl value immediately

`firecrawl_scrape`

Inputs:

url
formats
onlyMainContent
maxAgeMs
proxy
storeInCache
timeoutSeconds

Behavior:

Calls Firecrawl v2/scrape
Returns markdown/text plus metadata:
- title
- finalUrl
- status
- warning
Wraps extracted content the same way web_fetch does
Shares cache semantics with web tool expectations where practical

Why explicit scrape tool:

Sidesteps the unresolved Readability -> Firecrawl -> basic HTML cleanup ordering bug in core web_fetch
Gives users a deterministic “always use Firecrawl” path for JS-heavy/bot-protected sites

What the extension should not do

No auto-adding browser, web_search, or web_fetch to tools.alsoAllow
No default onboarding step in openclaw setup
No Firecrawl-specific browser session lifecycle in core
No change to built-in web_fetch fallback semantics in the extension MVP

Phase plan

Phase 1: extension-only, no core schema changes

Implement:

extensions/firecrawl/
plugin config schema
firecrawl_search
firecrawl_scrape
tests for config resolution, endpoint selection, caching, error handling, and SSRF guard usage

This phase is enough to ship real user value.

Phase 2: optional `web_search` provider integration

Support tools.web.search.provider = "firecrawl" only after fixing two core constraints:

src/plugins/web-search-providers.ts must load configured/installed web-search-provider plugins instead of a hardcoded bundled list.
src/config/types.tools.ts and src/config/zod-schema.agent-runtime.ts must stop hardcoding the provider enum in a way that blocks plugin-registered ids.

Recommended shape:

keep built-in providers documented,
allow any registered plugin provider id at runtime,
validate provider-specific config via the provider plugin or a generic provider bag.

Phase 3: optional `web_fetch` provider seam

Do this only if maintainers want vendor-specific fetch backends to participate in web_fetch.

Needed core addition:

registerWebFetchProvider or equivalent fetch-backend seam

Without that seam, the extension should keep firecrawl_scrape as an explicit tool rather than trying to patch built-in web_fetch.

Security requirements

The extension must treat Firecrawl as a trusted operator-configured endpoint, but still harden transport:

Use SSRF-guarded fetch for the Firecrawl endpoint call, not raw fetch()
Preserve self-hosted/private-network compatibility using the same trusted-web-tools endpoint policy used elsewhere
Never log the API key
Keep endpoint/base URL resolution explicit and predictable
Treat Firecrawl-returned content as untrusted external content

This mirrors the intent behind the SSRF hardening PRs without assuming Firecrawl is a hostile multi-tenant surface.

Why not a skill

The repo already closed a Firecrawl skill PR in favor of ClawHub distribution. That is fine for optional user-installed prompt workflows, but it does not solve:

deterministic tool availability,
provider-grade config/credential handling,
self-hosted endpoint support,
caching,
stable typed outputs,
security review on network behavior.

This belongs as an extension, not a prompt-only skill.

Success criteria

Users can install/enable one extension and get reliable Firecrawl search/scrape without touching core defaults.
Self-hosted Firecrawl works with config/env fallback.
Extension endpoint fetches use guarded networking.
No new Firecrawl-specific core onboarding/default behavior.
Core can later adopt plugin-native web_search / web_fetch seams without redesigning the extension.

Recommended implementation order

Build firecrawl_scrape
Build firecrawl_search
Add docs and examples
If desired, generalize web_search provider loading so the extension can back web_search
Only then consider a true web_fetch provider seam

7.9 KiB Raw Blame History