reader
Convert any URL to an LLM-friendly input with a simple prefix https://r.jina.ai/
Stars: 6529
Reader is a tool that converts any URL to an LLM-friendly input with a simple prefix `https://r.jina.ai/`. It improves the output for your agent and RAG systems at no cost. Reader supports image reading, captioning all images at the specified URL and adding `Image [idx]: [caption]` as an alt tag. This enables downstream LLMs to interact with the images in reasoning, summarizing, etc. Reader offers a streaming mode, useful when the standard mode provides an incomplete result. In streaming mode, Reader waits a bit longer until the page is fully rendered, providing more complete information. Reader also supports a JSON mode, which contains three fields: `url`, `title`, and `content`. Reader is backed by Jina AI and licensed under Apache-2.0.
README:
Your LLMs deserve better input.
Reader does two things:
-
Read: It converts any URL to an LLM-friendly input with
https://r.jina.ai/https://your.url
. Get improved output for your agent and RAG systems at no cost. -
Search: It searches the web for a given query with
https://s.jina.ai/your+query
. This allows your LLMs to access the latest world knowledge from the web.
Check out the live demo
Or just visit these URLs (Read) https://r.jina.ai/https://github.com/jina-ai/reader, (Search) https://s.jina.ai/Who%20will%20win%202024%20US%20presidential%20election%3F and see yourself.
Feel free to use Reader API in production. It is free, stable and scalable. We are maintaining it actively as one of the core products of Jina AI. Check out rate limit
-
2024-07-15: To restrict the results of
s.jina.ai
to certain domain/website, you can set e.g.site=jina.ai
in the query parameters, which enables in-site search. For more options, try our updated live-demo. - 2024-07-01: We have resolved a DDoS attack and other traffic abusing since June 27th. We also found a bug introduced on June 28th which may cause higher latency for some websites. The attack and the bug have been solved; if you have experienced high latency of r.jina.ai between June 27th-30th, it should back to normal now.
- 2024-05-30: Reader can now read abitrary PDF from any URL! Check out this PDF result from NASA.gov vs the original.
-
2024-05-15: We introduced a new endpoint
s.jina.ai
that searches on the web and return top-5 results, each in a LLM-friendly format. Read more about this new feature here. -
2024-05-08: Image caption is off by default for better latency. To turn it on, set
x-with-generated-alt: true
in the request header. - 2024-05-03: We finally resolved a DDoS attack since April 29th. Now our API is much more reliable and scalable than ever!
- 2024-04-24: You now have more fine-grained control over Reader API using headers, e.g. forwarding cookies, using HTTP proxy.
-
2024-04-15: Reader now supports image reading! It captions all images at the specified URL and adds
Image [idx]: [caption]
as an alt tag (if they initially lack one). This enables downstream LLMs to interact with the images in reasoning, summarizing etc. See example here.
Simply prepend https://r.jina.ai/
to any URL. For example, to convert the URL https://en.wikipedia.org/wiki/Artificial_intelligence
to an LLM-friendly input, use the following URL:
https://r.jina.ai/https://en.wikipedia.org/wiki/Artificial_intelligence
Simply prepend https://s.jina.ai/
to your search query. Note that if you are using this in the code, make sure to encode your search query first, e.g. if your query is Who will win 2024 US presidential election?
then your url should look like:
https://s.jina.ai/Who%20will%20win%202024%20US%20presidential%20election%3F
Behind the scenes, Reader searches the web, fetches the top 5 results, visits each URL, and applies r.jina.ai
to it. This is different from many web search function-calling
in agent/RAG frameworks, which often return only the title, URL, and description provided by the search engine API. If you want to read one result more deeply, you have to fetch the content yourself from that URL. With Reader, http://s.jina.ai
automatically fetches the content from the top 5 search result URLs for you (reusing the tech stack behind http://r.jina.ai
). This means you don't have to handle browser rendering, blocking, or any issues related to JavaScript and CSS yourself.
Simply specify site
in the query parameters such as:
curl 'https://s.jina.ai/When%20was%20Jina%20AI%20founded%3F?site=jina.ai&site=github.com'
We highly recommend using the code builder to explore different parameter combinations of the Reader API.
As you have already seen above, one can control the behavior of the Reader API using request headers. Here is a complete list of supported headers.
- You can enable the image caption feature via the
x-with-generated-alt: true
header. - You can ask the Reader API to forward cookies settings via the
x-set-cookie
header.- Note that requests with cookies will not be cached.
- You can bypass
readability
filtering via thex-respond-with
header, specifically:-
x-respond-with: markdown
returns markdown without going throughreability
-
x-respond-with: html
returnsdocumentElement.outerHTML
-
x-respond-with: text
returnsdocument.body.innerText
-
x-respond-with: screenshot
returns the URL of the webpage's screenshot
-
- You can specify a proxy server via the
x-proxy-url
header. - You can customize cache tolerance via the
x-cache-tolerance
header (integer in seconds). - You can bypass the cached page (lifetime 3600s) via the
x-no-cache: true
header (equivalent ofx-cache-tolerance: 0
). - If you already know the HTML structure of your target page, you may specify
x-target-selector
orx-wait-for-selector
to direct the Reader API to focus on a specific part of the page.- By setting
x-target-selector
header to a CSS selector, the Reader API return the content within the matched element, instead of the full HTML. Setting this header is useful when the automatic content extraction fails to capture the desired content and you can manually select the correct target. - By setting
x-wait-for-selector
header to a CSS selector, the Reader API will wait until the matched element is rendered before returning the content. If you already specifiedx-wait-for-selector
, this header can be omitted if you plan to wait for the same element.
- By setting
Many websites nowadays rely on JavaScript frameworks and client-side rendering. Usually known as Single Page Application (SPA). Thanks to Puppeteer and headless Chrome browser, Reader natively supports fetching these websites. However, due to specific approach some SPA are developed, there may be some extra precautions to take.
By definition of the web standards, content come after #
in a URL is not sent to the server. To mitigate this issue, use POST
method with url
parameter in body.
curl -X POST 'https://r.jina.ai/' -d 'url=https://example.com/#/route'
Some SPAs, or even some websites that are not strictly SPAs, may show preload contents before later loading the main content dynamically. In this case, Reader may be capturing the preload content instead of the main content. To mitigate this issue, here are some possible solutions:
When timeout is explicitly specified, Reader will not attempt to return early and will wait for network idle until the timeout is reached. This is useful when the target website will eventually come to a network idle.
curl 'https://example.com/' -H 'x-timeout: 30'
When wait-for-selector is explicitly specified, Reader will wait for the appearance of the specified CSS selector until timeout is reached. This is useful when you know exactly what element to wait for.
curl 'https://example.com/' -H 'x-wait-for-selector: #content'
Streaming mode is useful when you find that the standard mode provides an incomplete result. This is because the Reader will wait a bit longer until the page is stablely rendered. Use the accept-header to toggle the streaming mode:
curl -H "Accept: text/event-stream" https://r.jina.ai/https://en.m.wikipedia.org/wiki/Main_Page
The data comes in a stream; each subsequent chunk contains more complete information. The last chunk should provide the most complete and final result. If you come from LLMs, please note that it is a different behavior than the LLMs' text-generation streaming.
For example, compare these two curl commands below. You can see streaming one gives you complete information at last, whereas standard mode does not. This is because the content loading on this particular site is triggered by some js after the page is fully loaded, and standard mode returns the page "too soon".
curl -H 'x-no-cache: true' https://access.redhat.com/security/cve/CVE-2023-45853
curl -H "Accept: text/event-stream" -H 'x-no-cache: true' https://r.jina.ai/https://access.redhat.com/security/cve/CVE-2023-45853
Note:
-H 'x-no-cache: true'
is used only for demonstration purposes to bypass the cache.
Streaming mode is also useful if your downstream LLM/agent system requires immediate content delivery or needs to process data in chunks to interleave I/O and LLM processing times. This allows for quicker access and more efficient data handling:
Reader API: streamContent1 ----> streamContent2 ----> streamContent3 ---> ...
| | |
v | |
Your LLM: LLM(streamContent1) | |
v |
LLM(streamContent2) |
v
LLM(streamContent3)
Note that in terms of completeness: ... > streamContent3 > streamContent2 > streamContent1
, each subsequent chunk contains more complete information.
This is still very early and the result is not really a "useful" JSON. It contains three fields url
, title
and content
only. Nonetheless, you can use accept-header to control the output format:
curl -H "Accept: application/json" https://r.jina.ai/https://en.m.wikipedia.org/wiki/Main_Page
JSON mode is probably more useful in s.jina.ai
than r.jina.ai
. For s.jina.ai
with JSON mode, it returns 5 results in a list, each in the structure of {'title', 'content', 'url'}
.
All images in that page that lack alt
tag can be auto-captioned by a VLM (vision langauge model) and formatted as !(Image [idx]: [VLM_caption])[img_URL]
. This should give your downstream text-only LLM just enough hints to include those images into reasoning, selecting, and summarization. Use the x-with-generated-alt header to toggle the streaming mode:
curl -H "X-With-Generated-Alt: true" https://r.jina.ai/https://en.m.wikipedia.org/wiki/Main_Page
You will need the following tools to run the project:
- Node v18 (The build fails for Node version >18)
- Firebase CLI (
npm install -g firebase-tools
)
For backend, go to the backend/functions
directory and install the npm dependencies.
git clone [email protected]:jina-ai/reader.git
cd backend/functions
npm install
You might notice a reference to thinapps-shared
submodule, an internal package we use to share code across our products. While it’s not open-sourced and isn't integral to the Reader's functions, it mainly helps with decorators, logging, secrets management, etc. Feel free to ignore it for now.
That said, this is the single codebase behind https://r.jina.ai
, so everytime we commit here, we will deploy the new version to the https://r.jina.ai
.
Please raise an issue with the URL you are having trouble with. We will look into it and try to fix it.
Reader is backed by Jina AI and licensed under Apache-2.0.
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for reader
Similar Open Source Tools
reader
Reader is a tool that converts any URL to an LLM-friendly input with a simple prefix `https://r.jina.ai/`. It improves the output for your agent and RAG systems at no cost. Reader supports image reading, captioning all images at the specified URL and adding `Image [idx]: [caption]` as an alt tag. This enables downstream LLMs to interact with the images in reasoning, summarizing, etc. Reader offers a streaming mode, useful when the standard mode provides an incomplete result. In streaming mode, Reader waits a bit longer until the page is fully rendered, providing more complete information. Reader also supports a JSON mode, which contains three fields: `url`, `title`, and `content`. Reader is backed by Jina AI and licensed under Apache-2.0.
dir-assistant
Dir-assistant is a tool that allows users to interact with their current directory's files using local or API Language Models (LLMs). It supports various platforms and provides API support for major LLM APIs. Users can configure and customize their local LLMs and API LLMs using the tool. Dir-assistant also supports model downloads and configurations for efficient usage. It is designed to enhance file interaction and retrieval using advanced language models.
warc-gpt
WARC-GPT is an experimental retrieval augmented generation pipeline for web archive collections. It allows users to interact with WARC files, extract text, generate text embeddings, visualize embeddings, and interact with a web UI and API. The tool is highly customizable, supporting various LLMs, providers, and embedding models. Users can configure the application using environment variables, ingest WARC files, start the server, and interact with the web UI and API to search for content and generate text completions. WARC-GPT is designed for exploration and experimentation in exploring web archives using AI.
sage
Sage is a tool that allows users to chat with any codebase, providing a chat interface for code understanding and integration. It simplifies the process of learning how a codebase works by offering heavily documented answers sourced directly from the code. Users can set up Sage locally or on the cloud with minimal effort. The tool is designed to be easily customizable, allowing users to swap components of the pipeline and improve the algorithms powering code understanding and generation.
ai-models
The `ai-models` command is a tool used to run AI-based weather forecasting models. It provides functionalities to install, run, and manage different AI models for weather forecasting. Users can easily install and run various models, customize model settings, download assets, and manage input data from different sources such as ECMWF, CDS, and GRIB files. The tool is designed to optimize performance by running on GPUs and provides options for better organization of assets and output files. It offers a range of command line options for users to interact with the models and customize their forecasting tasks.
blinkid-ios
BlinkID iOS is a mobile SDK that enables developers to easily integrate ID scanning and data extraction capabilities into their iOS applications. The SDK supports scanning and processing various types of identity documents, such as passports, driver's licenses, and ID cards. It provides accurate and fast data extraction, including personal information and document details. With BlinkID iOS, developers can enhance their apps with secure and reliable ID verification functionality, improving user experience and streamlining identity verification processes.
opencommit
OpenCommit is a tool that auto-generates meaningful commits using AI, allowing users to quickly create commit messages for their staged changes. It provides a CLI interface for easy usage and supports customization of commit descriptions, emojis, and AI models. Users can configure local and global settings, switch between different AI providers, and set up Git hooks for integration with IDE Source Control. Additionally, OpenCommit can be used as a GitHub Action to automatically improve commit messages on push events, ensuring all commits are meaningful and not generic. Payments for OpenAI API requests are handled by the user, with the tool storing API keys locally.
langroid-examples
Langroid-examples is a repository containing examples of using the Langroid Multi-Agent Programming framework to build LLM applications. It provides a collection of scripts and instructions for setting up the environment, working with local LLMs, using OpenAI LLMs, and running various examples. The repository also includes optional setup instructions for integrating with Qdrant, Redis, Momento, GitHub, and Google Custom Search API. Users can explore different scenarios and functionalities of Langroid through the provided examples and documentation.
code2prompt
code2prompt is a command-line tool that converts your codebase into a single LLM prompt with a source tree, prompt templating, and token counting. It automates generating LLM prompts from codebases of any size, customizing prompt generation with Handlebars templates, respecting .gitignore, filtering and excluding files using glob patterns, displaying token count, including Git diff output, copying prompt to clipboard, saving prompt to an output file, excluding files and folders, adding line numbers to source code blocks, and more. It helps streamline the process of creating LLM prompts for code analysis, generation, and other tasks.
gpt-cli
gpt-cli is a command-line interface tool for interacting with various chat language models like ChatGPT, Claude, and others. It supports model customization, usage tracking, keyboard shortcuts, multi-line input, markdown support, predefined messages, and multiple assistants. Users can easily switch between different assistants, define custom assistants, and configure model parameters and API keys in a YAML file for easy customization and management.
BentoVLLM
BentoVLLM is an example project demonstrating how to serve and deploy open-source Large Language Models using vLLM, a high-throughput and memory-efficient inference engine. It provides a basis for advanced code customization, such as custom models, inference logic, or vLLM options. The project allows for simple LLM hosting with OpenAI compatible endpoints without the need to write any code. Users can interact with the server using Swagger UI or other methods, and the service can be deployed to BentoCloud for better management and scalability. Additionally, the repository includes integration examples for different LLM models and tools.
eval-dev-quality
DevQualityEval is an evaluation benchmark and framework designed to compare and improve the quality of code generation of Language Model Models (LLMs). It provides developers with a standardized benchmark to enhance real-world usage in software development and offers users metrics and comparisons to assess the usefulness of LLMs for their tasks. The tool evaluates LLMs' performance in solving software development tasks and measures the quality of their results through a point-based system. Users can run specific tasks, such as test generation, across different programming languages to evaluate LLMs' language understanding and code generation capabilities.
renumics-rag
Renumics RAG is a retrieval-augmented generation assistant demo that utilizes LangChain and Streamlit. It provides a tool for indexing documents and answering questions based on the indexed data. Users can explore and visualize RAG data, configure OpenAI and Hugging Face models, and interactively explore questions and document snippets. The tool supports GPU and CPU setups, offers a command-line interface for retrieving and answering questions, and includes a web application for easy access. It also allows users to customize retrieval settings, embeddings models, and database creation. Renumics RAG is designed to enhance the question-answering process by leveraging indexed documents and providing detailed answers with sources.
vectara-answer
Vectara Answer is a sample app for Vectara-powered Summarized Semantic Search (or question-answering) with advanced configuration options. For examples of what you can build with Vectara Answer, check out Ask News, LegalAid, or any of the other demo applications.
TypeGPT
TypeGPT is a Python application that enables users to interact with ChatGPT or Google Gemini from any text field in their operating system using keyboard shortcuts. It provides global accessibility, keyboard shortcuts for communication, and clipboard integration for larger text inputs. Users need to have Python 3.x installed along with specific packages and API keys from OpenAI for ChatGPT access. The tool allows users to run the program normally or in the background, manage processes, and stop the program. Users can use keyboard shortcuts like `/ask`, `/see`, `/stop`, `/chatgpt`, `/gemini`, `/check`, and `Shift + Cmd + Enter` to interact with the application in any text field. Customization options are available by modifying files like `keys.txt` and `system_prompt.txt`. Contributions are welcome, and future plans include adding support for other APIs and a user-friendly GUI.
aides-jeunes
The user interface (and the main server) of the simulator of aids and social benefits for young people. It is based on the free socio-fiscal simulator Openfisca.
For similar tasks
reader
Reader is a tool that converts any URL to an LLM-friendly input with a simple prefix `https://r.jina.ai/`. It improves the output for your agent and RAG systems at no cost. Reader supports image reading, captioning all images at the specified URL and adding `Image [idx]: [caption]` as an alt tag. This enables downstream LLMs to interact with the images in reasoning, summarizing, etc. Reader offers a streaming mode, useful when the standard mode provides an incomplete result. In streaming mode, Reader waits a bit longer until the page is fully rendered, providing more complete information. Reader also supports a JSON mode, which contains three fields: `url`, `title`, and `content`. Reader is backed by Jina AI and licensed under Apache-2.0.
For similar jobs
ChatFAQ
ChatFAQ is an open-source comprehensive platform for creating a wide variety of chatbots: generic ones, business-trained, or even capable of redirecting requests to human operators. It includes a specialized NLP/NLG engine based on a RAG architecture and customized chat widgets, ensuring a tailored experience for users and avoiding vendor lock-in.
anything-llm
AnythingLLM is a full-stack application that enables you to turn any document, resource, or piece of content into context that any LLM can use as references during chatting. This application allows you to pick and choose which LLM or Vector Database you want to use as well as supporting multi-user management and permissions.
ai-guide
This guide is dedicated to Large Language Models (LLMs) that you can run on your home computer. It assumes your PC is a lower-end, non-gaming setup.
classifai
Supercharge WordPress Content Workflows and Engagement with Artificial Intelligence. Tap into leading cloud-based services like OpenAI, Microsoft Azure AI, Google Gemini and IBM Watson to augment your WordPress-powered websites. Publish content faster while improving SEO performance and increasing audience engagement. ClassifAI integrates Artificial Intelligence and Machine Learning technologies to lighten your workload and eliminate tedious tasks, giving you more time to create original content that matters.
mikupad
mikupad is a lightweight and efficient language model front-end powered by ReactJS, all packed into a single HTML file. Inspired by the likes of NovelAI, it provides a simple yet powerful interface for generating text with the help of various backends.
glide
Glide is a cloud-native LLM gateway that provides a unified REST API for accessing various large language models (LLMs) from different providers. It handles LLMOps tasks such as model failover, caching, key management, and more, making it easy to integrate LLMs into applications. Glide supports popular LLM providers like OpenAI, Anthropic, Azure OpenAI, AWS Bedrock (Titan), Cohere, Google Gemini, OctoML, and Ollama. It offers high availability, performance, and observability, and provides SDKs for Python and NodeJS to simplify integration.
onnxruntime-genai
ONNX Runtime Generative AI is a library that provides the generative AI loop for ONNX models, including inference with ONNX Runtime, logits processing, search and sampling, and KV cache management. Users can call a high level `generate()` method, or run each iteration of the model in a loop. It supports greedy/beam search and TopP, TopK sampling to generate token sequences, has built in logits processing like repetition penalties, and allows for easy custom scoring.
firecrawl
Firecrawl is an API service that takes a URL, crawls it, and converts it into clean markdown. It crawls all accessible subpages and provides clean markdown for each, without requiring a sitemap. The API is easy to use and can be self-hosted. It also integrates with Langchain and Llama Index. The Python SDK makes it easy to crawl and scrape websites in Python code.