reader

Convert any URL to an LLM-friendly input with a simple prefix https://r.jina.ai/

Stars: 8465

Visit

Reader is a tool that converts any URL to an LLM-friendly input with a simple prefix `https://r.jina.ai/`. It improves the output for your agent and RAG systems at no cost. Reader supports image reading, captioning all images at the specified URL and adding `Image [idx]: [caption]` as an alt tag. This enables downstream LLMs to interact with the images in reasoning, summarizing, etc. Reader offers a streaming mode, useful when the standard mode provides an incomplete result. In streaming mode, Reader waits a bit longer until the page is fully rendered, providing more complete information. Reader also supports a JSON mode, which contains three fields: `url`, `title`, and `content`. Reader is backed by Jina AI and licensed under Apache-2.0.

README:

Reader

Your LLMs deserve better input.

Reader does two things:

Read: It converts any URL to an LLM-friendly input with https://r.jina.ai/https://your.url. Get improved output for your agent and RAG systems at no cost.
Search: It searches the web for a given query with https://s.jina.ai/your+query. This allows your LLMs to access the latest world knowledge from the web.

Check out the live demo

Or just visit these URLs (Read) https://r.jina.ai/https://github.com/jina-ai/reader, (Search) https://s.jina.ai/Who%20will%20win%202024%20US%20presidential%20election%3F and see yourself.

Feel free to use Reader API in production. It is free, stable and scalable. We are maintaining it actively as one of the core products of Jina AI. Check out rate limit

Updates

2024-10-08: Introduced an adaptive crawler. It can recursively crawl the website and extract the most relevant pages for a given webpage.
2024-07-15: To restrict the results of s.jina.ai to certain domain/website, you can set e.g. site=jina.ai in the query parameters, which enables in-site search. For more options, try our updated live-demo.
2024-07-01: We have resolved a DDoS attack and other traffic abusing since June 27th. We also found a bug introduced on June 28th which may cause higher latency for some websites. The attack and the bug have been solved; if you have experienced high latency of r.jina.ai between June 27th-30th, it should back to normal now.
2024-05-30: Reader can now read abitrary PDF from any URL! Check out this PDF result from NASA.gov vs the original.
2024-05-15: We introduced a new endpoint s.jina.ai that searches on the web and return top-5 results, each in a LLM-friendly format. Read more about this new feature here.
2024-05-08: Image caption is off by default for better latency. To turn it on, set x-with-generated-alt: true in the request header.
2024-05-03: We finally resolved a DDoS attack since April 29th. Now our API is much more reliable and scalable than ever!
2024-04-24: You now have more fine-grained control over Reader API using headers, e.g. forwarding cookies, using HTTP proxy.
2024-04-15: Reader now supports image reading! It captions all images at the specified URL and adds Image [idx]: [caption] as an alt tag (if they initially lack one). This enables downstream LLMs to interact with the images in reasoning, summarizing etc. See example here.

Usage

Using `r.jina.ai` for single URL fetching

Simply prepend https://r.jina.ai/ to any URL. For example, to convert the URL https://en.wikipedia.org/wiki/Artificial_intelligence to an LLM-friendly input, use the following URL:

https://r.jina.ai/https://en.wikipedia.org/wiki/Artificial_intelligence

Using `r.jina.ai` for a full website fetching (Google Colab)

Using `s.jina.ai` for web search

Simply prepend https://s.jina.ai/ to your search query. Note that if you are using this in the code, make sure to encode your search query first, e.g. if your query is Who will win 2024 US presidential election? then your url should look like:

https://s.jina.ai/Who%20will%20win%202024%20US%20presidential%20election%3F

Behind the scenes, Reader searches the web, fetches the top 5 results, visits each URL, and applies r.jina.ai to it. This is different from many web search function-calling in agent/RAG frameworks, which often return only the title, URL, and description provided by the search engine API. If you want to read one result more deeply, you have to fetch the content yourself from that URL. With Reader, http://s.jina.ai automatically fetches the content from the top 5 search result URLs for you (reusing the tech stack behind http://r.jina.ai). This means you don't have to handle browser rendering, blocking, or any issues related to JavaScript and CSS yourself.

Using `s.jina.ai` for in-site search

Simply specify site in the query parameters such as:

curl 'https://s.jina.ai/When%20was%20Jina%20AI%20founded%3F?site=jina.ai&site=github.com'

Interactive Code Snippet Builder

We highly recommend using the code builder to explore different parameter combinations of the Reader API.

Using request headers

As you have already seen above, one can control the behavior of the Reader API using request headers. Here is a complete list of supported headers.

You can enable the image caption feature via the x-with-generated-alt: true header.
You can ask the Reader API to forward cookies settings via the x-set-cookie header.
- Note that requests with cookies will not be cached.
You can bypass readability filtering via the x-respond-with header, specifically:
- x-respond-with: markdown returns markdown without going through reability
- x-respond-with: html returns documentElement.outerHTML
- x-respond-with: text returns document.body.innerText
- x-respond-with: screenshot returns the URL of the webpage's screenshot
You can specify a proxy server via the x-proxy-url header.
You can customize cache tolerance via the x-cache-tolerance header (integer in seconds).
You can bypass the cached page (lifetime 3600s) via the x-no-cache: true header (equivalent of x-cache-tolerance: 0).
If you already know the HTML structure of your target page, you may specify x-target-selector or x-wait-for-selector to direct the Reader API to focus on a specific part of the page.
- By setting x-target-selector header to a CSS selector, the Reader API return the content within the matched element, instead of the full HTML. Setting this header is useful when the automatic content extraction fails to capture the desired content and you can manually select the correct target.
- By setting x-wait-for-selector header to a CSS selector, the Reader API will wait until the matched element is rendered before returning the content. If you already specified x-wait-for-selector, this header can be omitted if you plan to wait for the same element.

Using `r.jina.ai` for single page application (SPA) fetching

Many websites nowadays rely on JavaScript frameworks and client-side rendering. Usually known as Single Page Application (SPA). Thanks to Puppeteer and headless Chrome browser, Reader natively supports fetching these websites. However, due to specific approach some SPA are developed, there may be some extra precautions to take.

SPAs with hash-based routing

By definition of the web standards, content come after # in a URL is not sent to the server. To mitigate this issue, use POST method with url parameter in body.

curl -X POST 'https://r.jina.ai/' -d 'url=https://example.com/#/route'

SPAs with preloading contents

Some SPAs, or even some websites that are not strictly SPAs, may show preload contents before later loading the main content dynamically. In this case, Reader may be capturing the preload content instead of the main content. To mitigate this issue, here are some possible solutions:

Specifying `x-timeout`

When timeout is explicitly specified, Reader will not attempt to return early and will wait for network idle until the timeout is reached. This is useful when the target website will eventually come to a network idle.

curl 'https://example.com/' -H 'x-timeout: 30'

Specifying `x-wait-for-selector`

When wait-for-selector is explicitly specified, Reader will wait for the appearance of the specified CSS selector until timeout is reached. This is useful when you know exactly what element to wait for.

curl 'https://example.com/' -H 'x-wait-for-selector: #content'

Streaming mode

Streaming mode is useful when you find that the standard mode provides an incomplete result. This is because the Reader will wait a bit longer until the page is stablely rendered. Use the accept-header to toggle the streaming mode:

curl -H "Accept: text/event-stream" https://r.jina.ai/https://en.m.wikipedia.org/wiki/Main_Page

The data comes in a stream; each subsequent chunk contains more complete information. The last chunk should provide the most complete and final result. If you come from LLMs, please note that it is a different behavior than the LLMs' text-generation streaming.

For example, compare these two curl commands below. You can see streaming one gives you complete information at last, whereas standard mode does not. This is because the content loading on this particular site is triggered by some js after the page is fully loaded, and standard mode returns the page "too soon".

curl -H 'x-no-cache: true' https://access.redhat.com/security/cve/CVE-2023-45853
curl -H "Accept: text/event-stream" -H 'x-no-cache: true' https://r.jina.ai/https://access.redhat.com/security/cve/CVE-2023-45853

Note: -H 'x-no-cache: true' is used only for demonstration purposes to bypass the cache.

Streaming mode is also useful if your downstream LLM/agent system requires immediate content delivery or needs to process data in chunks to interleave I/O and LLM processing times. This allows for quicker access and more efficient data handling:

Reader API:  streamContent1 ----> streamContent2 ----> streamContent3 ---> ... 
                          |                    |                     |
                          v                    |                     |
Your LLM:                 LLM(streamContent1)  |                     |
                                               v                     |
                                               LLM(streamContent2)   |
                                                                     v
                                                                     LLM(streamContent3)

Note that in terms of completeness: ... > streamContent3 > streamContent2 > streamContent1, each subsequent chunk contains more complete information.

JSON mode

This is still very early and the result is not really a "useful" JSON. It contains three fields url, title and content only. Nonetheless, you can use accept-header to control the output format:

curl -H "Accept: application/json" https://r.jina.ai/https://en.m.wikipedia.org/wiki/Main_Page

JSON mode is probably more useful in s.jina.ai than r.jina.ai. For s.jina.ai with JSON mode, it returns 5 results in a list, each in the structure of {'title', 'content', 'url'}.

Generated alt

All images in that page that lack alt tag can be auto-captioned by a VLM (vision langauge model) and formatted as !(Image [idx]: [VLM_caption])[img_URL]. This should give your downstream text-only LLM just enough hints to include those images into reasoning, selecting, and summarization. Use the x-with-generated-alt header to toggle the streaming mode:

curl -H "X-With-Generated-Alt: true" https://r.jina.ai/https://en.m.wikipedia.org/wiki/Main_Page

Install

You will need the following tools to run the project:

Node v18 (The build fails for Node version >18)

git clone [email protected]:jina-ai/reader.git
npm install

What is `thinapps-shared` submodule?

You might notice a reference to thinapps-shared submodule, an internal package we use to share code across our products. While it’s not open-sourced and isn't integral to the Reader's functions, it mainly helps with decorators, logging, secrets management, etc. Feel free to ignore it for now.

That said, this is the single codebase behind https://r.jina.ai, so everytime we commit here, we will deploy the new version to the https://r.jina.ai.

Having trouble on some websites?

Please raise an issue with the URL you are having trouble with. We will look into it and try to fix it.

License

Reader is backed by Jina AI and licensed under Apache-2.0.

For Tasks:

Click tags to check more tools for each tasks

convert url to llm input read web page caption image get json from url stream web content

For Jobs:

content writer researcher student journalist marketer

Alternative AI tools for reader

Similar Open Source Tools

reader

github

: 8.5k

vector-inference

This repository provides an easy-to-use solution for running inference servers on Slurm-managed computing clusters using vLLM. All scripts in this repository run natively on the Vector Institute cluster environment. Users can deploy models as Slurm jobs, check server status and performance metrics, and shut down models. The repository also supports launching custom models with specific configurations. Additionally, users can send inference requests and set up an SSH tunnel to run inference from a local device.

github

: 53

warc-gpt

WARC-GPT is an experimental retrieval augmented generation pipeline for web archive collections. It allows users to interact with WARC files, extract text, generate text embeddings, visualize embeddings, and interact with a web UI and API. The tool is highly customizable, supporting various LLMs, providers, and embedding models. Users can configure the application using environment variables, ingest WARC files, start the server, and interact with the web UI and API to search for content and generate text completions. WARC-GPT is designed for exploration and experimentation in exploring web archives using AI.

github

: 219

ai-models

The `ai-models` command is a tool used to run AI-based weather forecasting models. It provides functionalities to install, run, and manage different AI models for weather forecasting. Users can easily install and run various models, customize model settings, download assets, and manage input data from different sources such as ECMWF, CDS, and GRIB files. The tool is designed to optimize performance by running on GPUs and provides options for better organization of assets and output files. It offers a range of command line options for users to interact with the models and customize their forecasting tasks.

github

: 367

opencommit

OpenCommit is a tool that auto-generates meaningful commits using AI, allowing users to quickly create commit messages for their staged changes. It provides a CLI interface for easy usage and supports customization of commit descriptions, emojis, and AI models. Users can configure local and global settings, switch between different AI providers, and set up Git hooks for integration with IDE Source Control. Additionally, OpenCommit can be used as a GitHub Action to automatically improve commit messages on push events, ensuring all commits are meaningful and not generic. Payments for OpenAI API requests are handled by the user, with the tool storing API keys locally.

github

: 5.9k

genai-toolbox

Gen AI Toolbox for Databases is an open source server that simplifies building Gen AI tools for interacting with databases. It handles complexities like connection pooling, authentication, and more, enabling easier, faster, and more secure tool development. The toolbox sits between the application's orchestration framework and the database, providing a control plane to modify, distribute, or invoke tools. It offers simplified development, better performance, enhanced security, and end-to-end observability. Users can install the toolbox as a binary, container image, or compile from source. Configuration is done through a 'tools.yaml' file, defining sources, tools, and toolsets. The project follows semantic versioning and welcomes contributions.

github

: 539

vectara-answer

Vectara Answer is a sample app for Vectara-powered Summarized Semantic Search (or question-answering) with advanced configuration options. For examples of what you can build with Vectara Answer, check out Ask News, LegalAid, or any of the other demo applications.

github

: 249

aider-composer

Aider Composer is a VSCode extension that integrates Aider into your development workflow. It allows users to easily add and remove files, toggle between read-only and editable modes, review code changes, use different chat modes, and reference files in the chat. The extension supports multiple models, code generation, code snippets, and settings customization. It has limitations such as lack of support for multiple workspaces, Git repository features, linting, testing, voice features, in-chat commands, and configuration options.

github

: 362

renumics-rag

Renumics RAG is a retrieval-augmented generation assistant demo that utilizes LangChain and Streamlit. It provides a tool for indexing documents and answering questions based on the indexed data. Users can explore and visualize RAG data, configure OpenAI and Hugging Face models, and interactively explore questions and document snippets. The tool supports GPU and CPU setups, offers a command-line interface for retrieving and answering questions, and includes a web application for easy access. It also allows users to customize retrieval settings, embeddings models, and database creation. Renumics RAG is designed to enhance the question-answering process by leveraging indexed documents and providing detailed answers with sources.

github

: 155

easy-llama

easy-llama is a Python tool designed to make text generation using on-device large language models (LLMs) as easy as possible. It provides an abstraction layer over llama-cpp-python, simplifying the process of utilizing language models. The tool offers features such as automatic context length adjustment, terminal-based interactive chat, programmatic multi-turn interaction, support for various prompt formats, message-based context length handling, retrieval of likely next tokens, and compatibility with multiple models supported by llama-cpp-python. The upcoming version 0.2.0 will remove the llama-cpp-python dependency for improved efficiency and maintainability.

github

: 54

TypeGPT

TypeGPT is a Python application that enables users to interact with ChatGPT or Google Gemini from any text field in their operating system using keyboard shortcuts. It provides global accessibility, keyboard shortcuts for communication, and clipboard integration for larger text inputs. Users need to have Python 3.x installed along with specific packages and API keys from OpenAI for ChatGPT access. The tool allows users to run the program normally or in the background, manage processes, and stop the program. Users can use keyboard shortcuts like `/ask`, `/see`, `/stop`, `/chatgpt`, `/gemini`, `/check`, and `Shift + Cmd + Enter` to interact with the application in any text field. Customization options are available by modifying files like `keys.txt` and `system_prompt.txt`. Contributions are welcome, and future plans include adding support for other APIs and a user-friendly GUI.

github

: 135

ai-town

AI Town is a virtual town where AI characters live, chat, and socialize. This project provides a deployable starter kit for building and customizing your own version of AI Town. It features a game engine, database, vector search, auth, text model, deployment, pixel art generation, background music generation, and local inference. You can customize your own simulation by creating characters and stories, updating spritesheets, changing the background, and modifying the background music.

github

: 6.3k

starter-monorepo

github

: 66

repo-to-text

The `repo-to-text` tool converts a directory's structure and contents into a single text file. It generates a formatted text representation that includes the directory tree and file contents, making it easy to share code with LLMs for development and debugging. Users can customize the tool's behavior with various options and settings, including output directory specification, debug logging, and file inclusion/exclusion rules. The tool supports Docker usage for containerized environments and provides detailed instructions for installation, usage, settings configuration, and contribution guidelines. It is a versatile tool for converting repository contents into text format for easy sharing and documentation.

github

: 122

aides-jeunes

The user interface (and the main server) of the simulator of aids and social benefits for young people. It is based on the free socio-fiscal simulator Openfisca.

github

: 79

eval-dev-quality

DevQualityEval is an evaluation benchmark and framework designed to compare and improve the quality of code generation of Language Model Models (LLMs). It provides developers with a standardized benchmark to enhance real-world usage in software development and offers users metrics and comparisons to assess the usefulness of LLMs for their tasks. The tool evaluates LLMs' performance in solving software development tasks and measures the quality of their results through a point-based system. Users can run specific tasks, such as test generation, across different programming languages to evaluate LLMs' language understanding and code generation capabilities.

github

: 159

For similar tasks

reader

github

: 8.5k

For similar jobs

ChatFAQ

ChatFAQ is an open-source comprehensive platform for creating a wide variety of chatbots: generic ones, business-trained, or even capable of redirecting requests to human operators. It includes a specialized NLP/NLG engine based on a RAG architecture and customized chat widgets, ensuring a tailored experience for users and avoiding vendor lock-in.

github

: 128

anything-llm

AnythingLLM is a full-stack application that enables you to turn any document, resource, or piece of content into context that any LLM can use as references during chatting. This application allows you to pick and choose which LLM or Vector Database you want to use as well as supporting multi-user management and permissions.

github

: 42.1k

ai-guide

This guide is dedicated to Large Language Models (LLMs) that you can run on your home computer. It assumes your PC is a lower-end, non-gaming setup.

github

: 159

classifai

Supercharge WordPress Content Workflows and Engagement with Artificial Intelligence. Tap into leading cloud-based services like OpenAI, Microsoft Azure AI, Google Gemini and IBM Watson to augment your WordPress-powered websites. Publish content faster while improving SEO performance and increasing audience engagement. ClassifAI integrates Artificial Intelligence and Machine Learning technologies to lighten your workload and eliminate tedious tasks, giving you more time to create original content that matters.

github

: 620

mikupad

mikupad is a lightweight and efficient language model front-end powered by ReactJS, all packed into a single HTML file. Inspired by the likes of NovelAI, it provides a simple yet powerful interface for generating text with the help of various backends.

github

: 300

glide

Glide is a cloud-native LLM gateway that provides a unified REST API for accessing various large language models (LLMs) from different providers. It handles LLMOps tasks such as model failover, caching, key management, and more, making it easy to integrate LLMs into applications. Glide supports popular LLM providers like OpenAI, Anthropic, Azure OpenAI, AWS Bedrock (Titan), Cohere, Google Gemini, OctoML, and Ollama. It offers high availability, performance, and observability, and provides SDKs for Python and NodeJS to simplify integration.

github

: 110

onnxruntime-genai

ONNX Runtime Generative AI is a library that provides the generative AI loop for ONNX models, including inference with ONNX Runtime, logits processing, search and sampling, and KV cache management. Users can call a high level `generate()` method, or run each iteration of the model in a loop. It supports greedy/beam search and TopP, TopK sampling to generate token sequences, has built in logits processing like repetition penalties, and allows for easy custom scoring.

github

: 442

firecrawl

Firecrawl is an API service that takes a URL, crawls it, and converts it into clean markdown. It crawls all accessible subpages and provides clean markdown for each, without requiring a sitemap. The API is easy to use and can be self-hosted. It also integrates with Langchain and Llama Index. The Python SDK makes it easy to crawl and scrape websites in Python code.

github

: 34.1k

reader

README:

Reader

Updates

Usage

Using r.jina.ai for single URL fetching

Using r.jina.ai for a full website fetching (Google Colab)

Using s.jina.ai for web search

Using s.jina.ai for in-site search

Interactive Code Snippet Builder

Using request headers

Using r.jina.ai for single page application (SPA) fetching

SPAs with hash-based routing

SPAs with preloading contents

Specifying x-timeout

Specifying x-wait-for-selector

Streaming mode

JSON mode

Generated alt

Install

What is thinapps-shared submodule?

Having trouble on some websites?

License

For Tasks:

For Jobs:

Alternative AI tools for reader

Similar Open Source Tools

reader

vector-inference

warc-gpt

ai-models

opencommit

genai-toolbox

vectara-answer

aider-composer

renumics-rag

easy-llama

TypeGPT

ai-town

starter-monorepo

repo-to-text

aides-jeunes

eval-dev-quality

For similar tasks

reader

For similar jobs

ChatFAQ

anything-llm

ai-guide

classifai

mikupad

glide

onnxruntime-genai

firecrawl

Using `r.jina.ai` for single URL fetching

Using `r.jina.ai` for a full website fetching (Google Colab)

Using `s.jina.ai` for web search

Using `s.jina.ai` for in-site search

Using `r.jina.ai` for single page application (SPA) fetching

Specifying `x-timeout`

Specifying `x-wait-for-selector`

What is `thinapps-shared` submodule?