Firecrawl Datasource Plugin

Author: langgenius
Version: 0.2.3
Type: datasource (website_crawl)

Introduction

This plugin integrates with Firecrawl, a powerful web scraping and crawling API service that recursively searches through URLs and subdomains to gather structured content. Firecrawl provides clean, LLM-ready data extraction with advanced filtering options, making it ideal for building knowledge bases, monitoring websites, and extracting structured information for AI applications like Dify.

Features

Recursive Web Crawling: Automatically crawl websites and their subpages
Depth Control: Configure maximum crawl depth relative to the starting URL
Smart Content Extraction: Extract only main content, excluding headers, footers, and navigation
URL Pattern Filtering: Include or exclude specific URL patterns
Page Limit Control: Set maximum number of pages to crawl
Real-time Progress Tracking: Monitor crawl status and completion
Clean Markdown Output: Convert web content to structured, LLM-friendly format
Self-Hosted Support: Use Firecrawl cloud or your own instance

Setup

Prerequisites

Before using this plugin, you need:

A Firecrawl API key (for cloud service) or self-hosted Firecrawl instance
Target URLs ready for crawling
Understanding of your crawling requirements (depth, limits, patterns)

Configuration Steps

Option 1: Using Firecrawl Cloud Service

Get a Firecrawl API Key:
- Visit Firecrawl
- Sign up for an account
- Navigate to your account settings
- Copy your API key
Configure the Plugin in Dify:
- Navigate to the datasource plugins section in Dify
- Select Firecrawl
- Base URL: Leave empty or enter
- API Key: Enter your Firecrawl API key
- Click "Save" to store the configuration

Option 2: Using Self-Hosted Firecrawl

Deploy Firecrawl:
- Follow the self-hosting guide
- Note your instance URL
Configure the Plugin in Dify:
- Navigate to the datasource plugins section in Dify
- Select Firecrawl
- Base URL: Enter your self-hosted Firecrawl URL (e.g., )
- API Key: Enter any key (required by the plugin but can be arbitrary for self-hosted)
- Click "Save" to store the configuration

Usage

Basic Single Page Extraction

To extract content from a single page without crawling subpages:

Parameters:

Start URL (required):
Crawl Subpages:
Maximum pages to crawl:
Only Main Content:

Full Website Crawling

To crawl an entire website with subpages:

Parameters:

Start URL (required):
Crawl Subpages:
Maximum crawl depth: (crawls up to 2 levels deep)
Maximum pages to crawl:
Only Main Content:

Targeted Section Crawling

To crawl only specific sections of a website:

Parameters:

Start URL:
Crawl Subpages:
URL patterns to include:
URL patterns to exclude:
Maximum pages to crawl:

Parameters Explained

Type	Required	Default	Description
string	Yes	-	The base URL to start crawling from
boolean	No	true	Whether to crawl subpages
string	No	-	Comma-separated patterns to exclude (e.g., )
string	No	-	Comma-separated patterns to include (e.g., )
number	No	2	Maximum depth to crawl (0 = only start URL)
number	No	10	Maximum number of pages to crawl
boolean	No	false	Extract only main content, excluding navigation elements

Understanding Crawl Depth

Depth 0: Only the entered URL
Depth 1: The entered URL + all directly linked pages
Depth 2: The entered URL + directly linked pages + pages linked from those
Higher values follow the same pattern

URL Pattern Examples

Include Patterns:

- Include all blog pages
- Include API documentation
- Include all product specification pages

Exclude Patterns:

- Exclude admin pages
- Exclude PDF files
- Exclude tag and category pages

Output Format

The plugin returns structured data for each crawled page:

How It Works

Job Creation: The plugin creates a crawl job with Firecrawl API
Asynchronous Crawling: Firecrawl processes the website based on your parameters
Status Monitoring: The plugin polls for job status every 5 seconds
Content Processing: Completed pages are formatted and structured
Result Delivery: Clean, structured content is returned

Use Cases

1. Documentation Indexing

Crawl technical documentation sites for AI-powered search:

2. Blog Content Extraction

Extract all blog posts for content analysis:

3. Product Catalog Building

Gather product information from e-commerce sites:

4. Competitor Monitoring

Track competitor website changes:

Best Practices

Start Small: Begin with lower limits and depths for testing
Use Pattern Filtering: Focus crawling with include/exclude patterns
Respect Robots.txt: Ensure target sites allow crawling
Monitor Progress: Check crawl status for large operations
Extract Main Content: Use for cleaner data
Set Appropriate Limits: Balance comprehensiveness with efficiency
Test Patterns: Verify your URL patterns match intended pages

Performance Considerations

Large Sites: May take several minutes to crawl
Deep Crawling: Exponentially increases pages (be cautious with depth > 3)
Rate Limiting: Firecrawl handles rate limiting automatically
Concurrent Jobs: Multiple crawl jobs can run simultaneously

Troubleshooting

Common Issues

"API key is required" error:
- Verify your API key is correctly entered
- Check if using the correct base URL
"Failed to crawl" error:
- Check if the target URL is accessible
- Verify your API key is valid
- Ensure you haven't exceeded rate limits
Incomplete crawling:
- Some sites may block automated crawling
- JavaScript-heavy sites might not render fully
- Check if robots.txt restricts access
Slow crawling:
- Large sites naturally take longer
- Consider reducing depth or page limits
- Use pattern filtering to focus crawling
Missing pages:
- Verify include/exclude patterns are correct
- Check if pages are within specified depth
- Ensure limit hasn't been reached
Self-hosted connection issues:
- Verify base URL is correct and accessible
- Check firewall/network settings
- Ensure SSL certificates are valid

API Limits

Firecrawl Cloud

Rate limits based on your plan
Check your account dashboard for usage

Self-Hosted

No external rate limits
Performance depends on your infrastructure

Security Considerations

API keys are transmitted securely via HTTPS
Use environment variables for API key storage in production
For sensitive data, consider self-hosting
Review crawled content for any inadvertently captured sensitive information

Privacy

Please refer to the Privacy Policy [blocked] for information on how your data is handled when using this plugin.

Support

For issues or questions:

Plugin Support: [email protected]
Firecrawl Documentation: https://docs.firecrawl.dev
Firecrawl Support: Visit Firecrawl

Additional Resources

Updates and Changelog

Version 0.2.2 (Current)

Enhanced pattern filtering
Improved error handling
Better progress tracking
Self-hosted instance support

Last updated: December 2024