Firecrawl Datasource Plugin
Author: langgenius
Version: 0.2.3
Type: datasource (website_crawl)
Introduction
This plugin integrates with Firecrawl, a powerful web scraping and crawling API service that recursively searches through URLs and subdomains to gather structured content. Firecrawl provides clean, LLM-ready data extraction with advanced filtering options, making it ideal for building knowledge bases, monitoring websites, and extracting structured information for AI applications like Dify.
Features
- Recursive Web Crawling: Automatically crawl websites and their subpages
- Depth Control: Configure maximum crawl depth relative to the starting URL
- Smart Content Extraction: Extract only main content, excluding headers, footers, and navigation
- URL Pattern Filtering: Include or exclude specific URL patterns
- Page Limit Control: Set maximum number of pages to crawl
- Real-time Progress Tracking: Monitor crawl status and completion
- Clean Markdown Output: Convert web content to structured, LLM-friendly format
- Self-Hosted Support: Use Firecrawl cloud or your own instance
Setup
Prerequisites
Before using this plugin, you need:
- A Firecrawl API key (for cloud service) or self-hosted Firecrawl instance
- Target URLs ready for crawling
- Understanding of your crawling requirements (depth, limits, patterns)
Configuration Steps
Option 1: Using Firecrawl Cloud Service
-
Get a Firecrawl API Key:
-
Configure the Plugin in Dify:
- Navigate to the datasource plugins section in Dify
- Select Firecrawl
- Base URL: Leave empty or enter
- API Key: Enter your Firecrawl API key
- Click "Save" to store the configuration
Option 2: Using Self-Hosted Firecrawl
-
Deploy Firecrawl:
-
Configure the Plugin in Dify:
- Navigate to the datasource plugins section in Dify
- Select Firecrawl
- Base URL: Enter your self-hosted Firecrawl URL (e.g., )
- API Key: Enter any key (required by the plugin but can be arbitrary for self-hosted)
- Click "Save" to store the configuration
Usage
Basic Single Page Extraction
To extract content from a single page without crawling subpages:
Parameters:
- Start URL (required):
- Crawl Subpages:
- Maximum pages to crawl:
- Only Main Content:
Full Website Crawling
To crawl an entire website with subpages:
Parameters:
- Start URL (required):
- Crawl Subpages:
- Maximum crawl depth: (crawls up to 2 levels deep)
- Maximum pages to crawl:
- Only Main Content:
Targeted Section Crawling
To crawl only specific sections of a website:
Parameters:
- Start URL:
- Crawl Subpages:
- URL patterns to include:
- URL patterns to exclude:
- Maximum pages to crawl:
Parameters Explained
Understanding Crawl Depth
- Depth 0: Only the entered URL
- Depth 1: The entered URL + all directly linked pages
- Depth 2: The entered URL + directly linked pages + pages linked from those
- Higher values follow the same pattern
URL Pattern Examples
Include Patterns:
- - Include all blog pages
- - Include API documentation
- - Include all product specification pages
Exclude Patterns:
- - Exclude admin pages
- - Exclude PDF files
- - Exclude tag and category pages
Output Format
The plugin returns structured data for each crawled page:
How It Works
- Job Creation: The plugin creates a crawl job with Firecrawl API
- Asynchronous Crawling: Firecrawl processes the website based on your parameters
- Status Monitoring: The plugin polls for job status every 5 seconds
- Content Processing: Completed pages are formatted and structured
- Result Delivery: Clean, structured content is returned
Use Cases
1. Documentation Indexing
Crawl technical documentation sites for AI-powered search:
2. Blog Content Extraction
Extract all blog posts for content analysis:
3. Product Catalog Building
Gather product information from e-commerce sites:
4. Competitor Monitoring
Track competitor website changes:
Best Practices
- Start Small: Begin with lower limits and depths for testing
- Use Pattern Filtering: Focus crawling with include/exclude patterns
- Respect Robots.txt: Ensure target sites allow crawling
- Monitor Progress: Check crawl status for large operations
- Extract Main Content: Use for cleaner data
- Set Appropriate Limits: Balance comprehensiveness with efficiency
- Test Patterns: Verify your URL patterns match intended pages
Performance Considerations
- Large Sites: May take several minutes to crawl
- Deep Crawling: Exponentially increases pages (be cautious with depth > 3)
- Rate Limiting: Firecrawl handles rate limiting automatically
- Concurrent Jobs: Multiple crawl jobs can run simultaneously
Troubleshooting
Common Issues
-
"API key is required" error:
- Verify your API key is correctly entered
- Check if using the correct base URL
-
"Failed to crawl" error:
- Check if the target URL is accessible
- Verify your API key is valid
- Ensure you haven't exceeded rate limits
-
Incomplete crawling:
- Some sites may block automated crawling
- JavaScript-heavy sites might not render fully
- Check if robots.txt restricts access
-
Slow crawling:
- Large sites naturally take longer
- Consider reducing depth or page limits
- Use pattern filtering to focus crawling
-
Missing pages:
- Verify include/exclude patterns are correct
- Check if pages are within specified depth
- Ensure limit hasn't been reached
-
Self-hosted connection issues:
- Verify base URL is correct and accessible
- Check firewall/network settings
- Ensure SSL certificates are valid
API Limits
Firecrawl Cloud
Self-Hosted
- No external rate limits
- Performance depends on your infrastructure
Security Considerations
- API keys are transmitted securely via HTTPS
- Use environment variables for API key storage in production
- For sensitive data, consider self-hosting
- Review crawled content for any inadvertently captured sensitive information
Privacy
Please refer to the Privacy Policy [blocked] for information on how your data is handled when using this plugin.
Support
For issues or questions:
Additional Resources
Updates and Changelog
Version 0.2.2 (Current)
- Enhanced pattern filtering
- Improved error handling
- Better progress tracking
- Self-hosted instance support
Last updated: December 2024