app icon
Firecrawl
0.2.10

Firecrawl Datasource

langgenius/firecrawl_datasource51852 installs

Firecrawl Datasource Plugin

Author: langgenius
Version: 0.2.3
Type: datasource (website_crawl)

Introduction

This plugin integrates with Firecrawl, a powerful web scraping and crawling API service that recursively searches through URLs and subdomains to gather structured content. Firecrawl provides clean, LLM-ready data extraction with advanced filtering options, making it ideal for building knowledge bases, monitoring websites, and extracting structured information for AI applications like Dify.

Features

  • Recursive Web Crawling: Automatically crawl websites and their subpages
  • Depth Control: Configure maximum crawl depth relative to the starting URL
  • Smart Content Extraction: Extract only main content, excluding headers, footers, and navigation
  • URL Pattern Filtering: Include or exclude specific URL patterns
  • Page Limit Control: Set maximum number of pages to crawl
  • Real-time Progress Tracking: Monitor crawl status and completion
  • Clean Markdown Output: Convert web content to structured, LLM-friendly format
  • Self-Hosted Support: Use Firecrawl cloud or your own instance

Setup

Prerequisites

Before using this plugin, you need:

  1. A Firecrawl API key (for cloud service) or self-hosted Firecrawl instance
  2. Target URLs ready for crawling
  3. Understanding of your crawling requirements (depth, limits, patterns)

Configuration Steps

Option 1: Using Firecrawl Cloud Service

  1. Get a Firecrawl API Key:

  2. Configure the Plugin in Dify:

    • Navigate to the datasource plugins section in Dify
    • Select Firecrawl
    • Base URL: Leave empty or enter
    • API Key: Enter your Firecrawl API key
    • Click "Save" to store the configuration

Option 2: Using Self-Hosted Firecrawl

  1. Deploy Firecrawl:

  2. Configure the Plugin in Dify:

    • Navigate to the datasource plugins section in Dify
    • Select Firecrawl
    • Base URL: Enter your self-hosted Firecrawl URL (e.g., )
    • API Key: Enter any key (required by the plugin but can be arbitrary for self-hosted)
    • Click "Save" to store the configuration

Usage

Basic Single Page Extraction

To extract content from a single page without crawling subpages:

Parameters:

  • Start URL (required):
  • Crawl Subpages:
  • Maximum pages to crawl:
  • Only Main Content:

Full Website Crawling

To crawl an entire website with subpages:

Parameters:

  • Start URL (required):
  • Crawl Subpages:
  • Maximum crawl depth: (crawls up to 2 levels deep)
  • Maximum pages to crawl:
  • Only Main Content:

Targeted Section Crawling

To crawl only specific sections of a website:

Parameters:

  • Start URL:
  • Crawl Subpages:
  • URL patterns to include:
  • URL patterns to exclude:
  • Maximum pages to crawl:

Parameters Explained

ParameterTypeRequiredDefaultDescription
stringYes-The base URL to start crawling from
booleanNotrueWhether to crawl subpages
stringNo-Comma-separated patterns to exclude (e.g., )
stringNo-Comma-separated patterns to include (e.g., )
numberNo2Maximum depth to crawl (0 = only start URL)
numberNo10Maximum number of pages to crawl
booleanNofalseExtract only main content, excluding navigation elements

Understanding Crawl Depth

  • Depth 0: Only the entered URL
  • Depth 1: The entered URL + all directly linked pages
  • Depth 2: The entered URL + directly linked pages + pages linked from those
  • Higher values follow the same pattern

URL Pattern Examples

Include Patterns:

  • - Include all blog pages
  • - Include API documentation
  • - Include all product specification pages

Exclude Patterns:

  • - Exclude admin pages
  • - Exclude PDF files
  • - Exclude tag and category pages

Output Format

The plugin returns structured data for each crawled page:

How It Works

  1. Job Creation: The plugin creates a crawl job with Firecrawl API
  2. Asynchronous Crawling: Firecrawl processes the website based on your parameters
  3. Status Monitoring: The plugin polls for job status every 5 seconds
  4. Content Processing: Completed pages are formatted and structured
  5. Result Delivery: Clean, structured content is returned

Use Cases

1. Documentation Indexing

Crawl technical documentation sites for AI-powered search:

2. Blog Content Extraction

Extract all blog posts for content analysis:

3. Product Catalog Building

Gather product information from e-commerce sites:

4. Competitor Monitoring

Track competitor website changes:

Best Practices

  1. Start Small: Begin with lower limits and depths for testing
  2. Use Pattern Filtering: Focus crawling with include/exclude patterns
  3. Respect Robots.txt: Ensure target sites allow crawling
  4. Monitor Progress: Check crawl status for large operations
  5. Extract Main Content: Use for cleaner data
  6. Set Appropriate Limits: Balance comprehensiveness with efficiency
  7. Test Patterns: Verify your URL patterns match intended pages

Performance Considerations

  • Large Sites: May take several minutes to crawl
  • Deep Crawling: Exponentially increases pages (be cautious with depth > 3)
  • Rate Limiting: Firecrawl handles rate limiting automatically
  • Concurrent Jobs: Multiple crawl jobs can run simultaneously

Troubleshooting

Common Issues

  1. "API key is required" error:

    • Verify your API key is correctly entered
    • Check if using the correct base URL
  2. "Failed to crawl" error:

    • Check if the target URL is accessible
    • Verify your API key is valid
    • Ensure you haven't exceeded rate limits
  3. Incomplete crawling:

    • Some sites may block automated crawling
    • JavaScript-heavy sites might not render fully
    • Check if robots.txt restricts access
  4. Slow crawling:

    • Large sites naturally take longer
    • Consider reducing depth or page limits
    • Use pattern filtering to focus crawling
  5. Missing pages:

    • Verify include/exclude patterns are correct
    • Check if pages are within specified depth
    • Ensure limit hasn't been reached
  6. Self-hosted connection issues:

    • Verify base URL is correct and accessible
    • Check firewall/network settings
    • Ensure SSL certificates are valid

API Limits

Firecrawl Cloud

Self-Hosted

  • No external rate limits
  • Performance depends on your infrastructure

Security Considerations

  • API keys are transmitted securely via HTTPS
  • Use environment variables for API key storage in production
  • For sensitive data, consider self-hosting
  • Review crawled content for any inadvertently captured sensitive information

Privacy

Please refer to the Privacy Policy [blocked] for information on how your data is handled when using this plugin.

Support

For issues or questions:

Additional Resources

Updates and Changelog

Version 0.2.2 (Current)

  • Enhanced pattern filtering
  • Improved error handling
  • Better progress tracking
  • Self-hosted instance support

Last updated: December 2024

CATEGORY
Data Source
TAGS
RAG
VERSION
0.2.10
langgenius·05/24/2026 02:06 PM
REQUIREMENTS
LLM invocation
Maximum memory
256MB