Jina Reader Datasource Plugin

Author: langgenius
Version: 0.0.4
Type: datasource (website_crawl)

Introduction

This plugin integrates with Jina AI's Reader API to fetch and convert web content into LLM-friendly markdown format. It supports crawling websites, extracting content from web pages and PDFs, and processing the information for use in AI applications like Dify. The plugin provides intelligent web scraping capabilities with automatic content extraction and formatting.

Features

Web Page to Markdown Conversion: Automatically converts web pages to clean, LLM-friendly markdown
PDF Support: Extract and convert content from PDF documents
Smart Crawling: Crawl websites with configurable depth and page limits
Sitemap Support: Utilize sitemaps for efficient website crawling
Subpage Crawling: Optionally crawl and process subpages from the main URL
Real-time Progress Tracking: Monitor crawling progress with status updates
Clean Content Extraction: Remove ads, navigation, and other non-content elements

Setup

Prerequisites

Before using this plugin:

(Optional) Obtain a Jina AI API key for higher rate limits
Have target URLs ready for crawling
Ensure target websites allow crawling (check robots.txt)

Configuration Steps

Get a Jina AI API Key (Optional but Recommended):
- Visit Jina AI
- Sign up for an account
- Navigate to your dashboard to get your API key
- Note: The plugin works without an API key but with lower rate limits
Configure the Plugin in Dify:
- Navigate to the datasource plugins section in Dify
- Select Jina Reader
- Enter your API Key (optional - leave empty if you don't have one)
- Click "Save" to store the configuration

Usage

Basic Web Page Extraction

To extract content from a single web page:

Parameters:

Start URL (required): The web page URL to extract content from
Crawl Subpages: Set to for single page extraction
Maximum Pages: Set to for single page
Use Sitemap: Set to for direct extraction

Example:

Website Crawling

To crawl multiple pages from a website:

Parameters:

Start URL (required): The base URL to start crawling
Crawl Subpages: Set to to crawl linked pages
Maximum Pages: Number of pages to crawl (default: 10)
Use Sitemap: Set to to use the website's sitemap

Example:

PDF Content Extraction

The plugin automatically detects and processes PDF URLs:

Example:

Parameters Explained

Type	Required	Default	Description
string	Yes	-	The starting URL for crawling
boolean	No	true	Whether to crawl linked subpages
number	No	10	Maximum number of pages to crawl
boolean	No	false	Use the website's sitemap for crawling

Output Format

The plugin returns structured data for each crawled page:

How It Works

Job Initiation: When you provide a URL, the plugin creates a crawling job with Jina AI
Crawling Process: Jina AI's crawler visits the specified pages
Content Extraction: The Reader API extracts main content, removing clutter
Markdown Conversion: Content is converted to clean, structured markdown
Status Updates: The plugin provides real-time progress updates
Result Delivery: Processed content is returned in a structured format

Use Cases

1. Knowledge Base Building

Extract documentation from websites to build knowledge bases:

Technical documentation sites
API references
Help centers

2. Research and Analysis

Gather information from multiple web sources:

News articles
Blog posts
Academic papers (PDFs)

3. Content Migration

Convert web content for use in AI applications:

Website content to chatbot knowledge
Blog posts to training data
Documentation to Q&A systems

4. Competitive Analysis

Monitor and analyze competitor websites:

Product pages
Pricing information
Feature documentation

Best Practices

Respect Website Policies:
- Check robots.txt before crawling
- Respect rate limits
- Don't overload servers with excessive requests
Optimize Crawling:
- Start with smaller page limits for testing
- Use sitemaps when available for better coverage
- Focus crawling on relevant sections
API Key Management:
- Keep your API key secure
- Monitor usage to stay within limits
- Consider upgrading for higher rate limits if needed
Content Quality:
- Review extracted content for completeness
- Verify important information is captured
- Check for any formatting issues

Limitations

Rate Limits: Without API key: lower rate limits; With API key: higher limits based on plan
JavaScript Content: Dynamic content may not be fully captured
Authentication: Cannot access password-protected content
File Types: Primarily supports HTML and PDF formats
Crawl Depth: Limited by the maximum pages parameter

Troubleshooting

Common Issues

"Failed to crawl" error:
- Check if the URL is accessible
- Verify your API key is valid (if provided)
- Ensure you haven't exceeded rate limits
Incomplete content extraction:
- Some websites may have anti-scraping measures
- JavaScript-rendered content might not be captured
- Try adjusting crawl parameters
Slow crawling speed:
- Large websites take time to process
- Consider reducing the page limit
- Use sitemap for more efficient crawling
Missing subpages:
- Ensure "Crawl Subpages" is enabled
- Check if the limit is sufficient
- Verify links are discoverable (not behind JavaScript)
API key issues:
- Verify the API key is correctly entered
- Check if the key is active and not expired
- Monitor your usage quota

Pricing

Free Tier: Works without API key with basic rate limits
With API Key: Higher rate limits and priority processing
Visit Jina AI Pricing for current pricing information

Performance Tips

For Large Websites:
- Use sitemaps when available
- Set reasonable page limits
- Consider breaking crawls into sections
For Better Results:
- Provide specific URLs rather than homepages when possible
- Disable subpage crawling for single article extraction
- Use the API key for more reliable service
For PDFs:
- Direct PDF URLs work best
- Large PDFs may take longer to process
- Ensure PDFs are publicly accessible

Privacy and Security

Content is processed through Jina AI's secure infrastructure
No permanent storage of crawled content by the plugin
API keys are transmitted securely via HTTPS
See Privacy Policy [blocked] for detailed information

Support

For issues or questions:

Contact: [email protected]
Jina AI Documentation: https://jina.ai/reader
Repository: Check the project repository for updates

Additional Resources

Updates and Changelog

Version 0.0.3 (Current)

Improved crawling stability
Better error handling
Enhanced progress tracking

Last updated: December 2024