
OneKE
[WWW 2025] A Dockerized Schema-Guided LLM Agent-based Knowledge Extraction System.
Stars: 51

OneKE is a flexible dockerized system for schema-guided knowledge extraction, capable of extracting information from the web and raw PDF books across multiple domains like science and news. It employs a collaborative multi-agent approach and includes a user-customizable knowledge base to enable tailored extraction. OneKE offers various IE tasks support, data sources support, LLMs support, extraction method support, and knowledge base configuration. Users can start with examples using YAML, Python, or Web UI, and perform tasks like Named Entity Recognition, Relation Extraction, Event Extraction, Triple Extraction, and Open Domain IE. The tool supports different source formats like Plain Text, HTML, PDF, Word, TXT, and JSON files. Users can choose from various extraction models like OpenAI, DeepSeek, LLaMA, Qwen, ChatGLM, MiniCPM, and OneKE for information extraction tasks. Extraction methods include Schema Agent, Extraction Agent, and Reflection Agent. The tool also provides support for schema repository and case repository management, along with solutions for network issues. Contributors to the project include Ningyu Zhang, Haofen Wang, Yujie Luo, Xiangyuan Ru, Kangwei Liu, Lin Yuan, Mengshu Sun, Lei Liang, Zhiqiang Zhang, Jun Zhou, Lanning Wei, Da Zheng, and Huajun Chen.
README:
- Table of Contents
- 🔔News
- 🌟Overview
- 🚀Quick Start
- 🔍Further Usage
- 🛠️Network Issue Solutions
- 🎉Contributors
- 🌻Acknowledgement
- [2025/02] We support the local deployment of the DeepSeek-R1 series in addition to the existing API service, as well as vllm acceleration for other LLMs.
- [2025/01] OneKE is accepted by WWW 2025 Demonstration Track 🎉🎉🎉.
- [2024/12] We open source the OneKE framework, supporting multi-agent knowledge extraction across various scenarios.
- [2024/04] We release a new bilingual (Chinese and English) schema-based information extraction model called OneKE based on Chinese-Alpaca-2-13B.
OneKE is a flexible dockerized system for schema-guided knowledge extraction, capable of extracting information from the web and raw PDF books across multiple domains like science and news. It employs a collaborative multi-agent approach and includes a user-customizable knowledge base to enable tailored extraction. Embark on your information extraction journey with OneKE!
OneKE currently offers the following features:
- [x] Various IE Tasks Support
- [x] Various Data Sources Support
- [x] Various LLMs Support
- [x] Various Extraction Method Support
- [x] User-Configurable Knowledge Base
We have developed a webpage demo for OneKE with Gradio, click here try information extraction in an intuitive way.
Note: The demo only displays OneKE's basic capabilities for efficiency. Consider the local deployment steps below for further features.
OneKE supports both manual and docker image environment configuration, choose your preferred method to build.
Conda virtual environments offer a light and flexible setup.
Prerequisites
- Anaconda Installation
- GPU support (recommended CUDA version: 12.4)
Configure Steps
- Clone the repository:
git clone https://github.com/zjunlp/OneKE.git
- Enter the working directory, and all subsequent commands should be executed in this directory.
cd OneKE
- Create a virtual environment using
Anaconda
.
conda create -n oneke python=3.9
conda activate oneke
- Install all required Python packages.
pip install -r requirements.txt
# If you encounter network issues, consider setting up a domestic mirror for pip.
Docker image provides greater reliability and stability.
Prerequisites
- Docker Installation
- NVIDIA Container Toolkit
- GPU support (recommended CUDA version: 12.4)
Configure Steps
- Clone the repository:
git clone https://github.com/zjunlp/OneKE.git
- Pull the docker image from the mirror repository.
docker pull zjunlp/oneke:v4
# If you encounter network issues, consider setting up domestic registry mirrors for docker.
- Launch a container from the image.
docker run --gpus all \
-v ./OneKE:/app/OneKE \
-it oneke:v4 /bin/bash
If using locally deployed models, ensure the local model path is mapped to the container:
docker run --gpus all \
-v ./OneKE:/app/OneKE \
-v your_local_model_path:/app/model/your_model_name \
-it oneke:v4 /bin/bash
Map any necessary local files to the container paths as shown above, and use container paths in your code and execution.
Upon starting, the container will enter the /app/OneKE
directory as its working directory. Just modify the code locally as needed, and the changes will sync to the container through mapping.
We offer three quick-start options. Choose your preferred method to swiftly explore OneKE with predefined examples.
Note:
- Ensure that your working directory is set to the
OneKE
folder, whether in a virtual environment or a docker container.- Refer to here to resolve the network issues. If you have more questions, feel free to open an issue with us.
Step1: Prepare the configuration file
Several YAML configuration files are available in the examples/config
. These extraction scenarios cover different extraction data, methods, and models, allowing you to easily explore all the features of OneKE.
Web News Extraction:
Here is the example for the web news knowledge extraction scenario, with the source extraction text in HTML
format:
# model configuration
model:
category: DeepSeek # model category, chosen from ChatGPT, DeepSeek, LLaMA, Qwen, ChatGLM, MiniCPM, OneKE.
model_name_or_path: deepseek-chat # model name, chosen from deepseek-chat and deepseek-reasoner. Choose deepseek-chat to use DeepSeek-V3 or choose deepseek-reasoner to use DeepSeek-R1.
api_key: your_api_key # your API key for the model with API service. No need for open-source models.
base_url: https://api.deepseek.com # base URL for the API service. No need for open-source models.
# extraction configuration
extraction:
task: Base # task type, chosen from Base, NER, RE, EE.
instruction: Extract key information from the given text. # description for the task. No need for NER, RE, EE task.
use_file: true # whether to use a file for the input text. Default set to false.
file_path: ./data/input_files/Tulsi_Gabbard_News.html # path to the input file. No need if use_file is set to false.
output_schema: NewsReport # output schema for the extraction task. Selected the from schema repository.
mode: customized # extraction mode, chosen from quick, detailed, customized. Default set to quick. See src/config.yaml for more details.
update_case: false # whether to update the case repository. Default set to false.
show_trajectory: false # whether to display the extracted intermediate steps
Book News Extraction:
Here is the example for the book news extraction scenario, with the source extraction text in PDF
format:
model:
# Recommend using ChatGPT or DeepSeek APIs for complex IE task.
category: ChatGPT # model category, chosen from ChatGPT, DeepSeek, LLaMA, Qwen, ChatGLM, MiniCPM, OneKE.
model_name_or_path: gpt-4o-mini # model name, chosen from the model list of the selected category.
api_key: your_api_key # your API key for the model with API service. No need for open-source models.
base_url: https://api.openai.com/v1 # # base URL for the API service. No need for open-source models.
extraction:
task: Base # task type, chosen from Base, NER, RE, EE.
instruction: Extract main characters and background setting from this chapter. # description for the task. No need for NER, RE, EE task.
use_file: true # whether to use a file for the input text. Default set to false.
file_path: ./data/input_files/Harry_Potter_Chapter1.pdf # # path to the input file. No need if use_file is set to false.
mode: quick # extraction mode, chosen from quick, detailed, customized. Default set to quick. See src/config.yaml for more details.
update_case: false # whether to update the case repository. Default set to false.
show_trajectory: false # whether to display the extracted intermediate steps
The model
section contains information about the extraction model, while the extraction
section configures the settings for the extraction process.
You can choose an existing configuration file or customize the extraction settings as you wish. Note that when using an API service like ChatGPT and DeepSeek, please set your API key.
Step2: Run the shell script
Specify the configuration file path and run the code to start the extraction process.
config_file=your_yaml_file_path # configuration file path, use the container path if inside a container
python src/run.py --config $config_file # start extraction, executed in the OneKE directory
If you want to deploy the local models using vllm, run the following code:
config_file=your_yaml_file_path # REMEMBER to set vllm_serve to TRUE!
python src/models/vllm_serve.py --config $config_file # deploy local model via vllm, executed in the OneKE directory
python src/run.py --config $config_file # start extraction, executed in the OneKE directory
Refer to here to get an overview of the knowledge extraction results.
You can also try OneKE by directly running the example.py
file located in the example
directory. Specifically, execute the following commands:
python examples/example.py
This will complete a basic NER task, with the extraction results printed upon completion. You can further modify the code in example.py
to suit your extraction task setting or to access detailed extraction trajectory.
Named Entity Extraction:
Specifically, we present a NER case in the example.py
file:
import sys
sys.path.append("./src")
from models import *
from pipeline import *
import json
# model configuration
model = ChatGPT(model_name_or_path="gpt-4o-mini", api_key="your_api_key")
pipeline = Pipeline(model)
# extraction configuration
Task = "NER"
Text = "Finally , every other year , ELRA organizes a major conference LREC , the International Language Resources and Evaluation Conference."
Constraint = nationality, country capital, place of death, children, location contains, place of birth, place lived, administrative division of country, country of administrative divisions, company, neighborhood of, company founders
# get extraction result
result, trajectory, frontend_schema, frontend_res = pipeline.get_extract_result(task=Task, text=Text, constraint=Constraint)
print("Trajectory:", json.dumps(trajectory, indent=4))
First, select an appropriate extraction model, then complete the configuration of extraction parameters (such as extraction task, extraction text, etc.). Finally, call the get_extract_result
function of the Pipeline
class to perform information extraction and obtain the final results.
Refer to here to get an overview of the knowledge extraction results.
Note: Before starting with the web UI, make sure the package
gradio 4.44.0
is already installed in your Environment.
Step1: Execute Command
Execute the following commands in the OneKE
directory:
python src/webui.py
Step2: Open your Web Browser
The front-end is built with Gradio, and the default port of Gradio is 7860. Therefore, please enter the following URL in your browser's address bar to open the web interface:
http://127.0.0.1:7860
Similarly, you can visually configure tasks and obtain results through the front-end interface.
-
🎲 Quick Start with an Example 🎲
: Quickly get a simple example to try OneKE. -
Submit
: After configuring your LLM, parameters, and tasks, click this button to run OneKE. -
Clear
: When a task is completed, click this button to restore the initial state.
You can try different types of information extraction tasks within the OneKE framework.
Task | Description |
---|---|
Traditional IE | |
NER | Named Entity Recognition, identifies and classifies various named entities such as names, locations, and organizations in text. |
RE | Relation Extraction, identifies relationships between entities, and typically returns results as entity-relation-entity triples. |
EE | Event Extraction, identifies events in text, focusing on event triggers and associated participants, known as event arguments. |
Triple | Triple Extraction, identifies subject-predicate-object triples in text. A triple is a fundamental data structure in information extraction, representing a piece of knowledge or fact. Knowledge graph can be quickly constructed after the Triple Extraction. |
Open Domain IE | |
Web News Extraction | Involves extracting key entities and events from online news articles to generate structured insights. |
Book Knowledged Extraction | Extracts information such as key concepts, themes, and facts from book chapters. |
Other | Encompasses information extraction from different types of content, such as social media and research papers, each tailored to the specific context and data type. |
In subsequent code processing, we categorize tasks into four types: NER
for Named Entity Recognition, RE
for Relation Extraction, EE
for Event Extraction, Triple
for Triple Extraction, and Base
for any other user-defined open-domain extraction tasks.
Named entity recognition seeks to locate and classify named entities mentioned in unstructured text into pre-defined entity types such as person names, organizations, locations, organizations, etc.
Refer to the case defined in examples/config/NER.yaml
as an example:
Text | Entity Types |
---|---|
Finally, every other year, ELRA organizes a major conference LREC, the International Language Resources and Evaluation Conference. | Algorithm, Conference, Else, Product, Task, Field, Metrics, Organization, Researcher, Program Language, Country, Location, Person, University |
In this task setting, Text
represents the text to be extracted, while Entity Types
denote the constraint on the types of entities to be extracted. Accordingly, we set the text
and constraint
attributes in the YAML file to their respective values.
Next, follow the steps below to complete the NER task:
-
Complete
./examples/config/NER.yaml
:configure the necessary model and extraction settings.
-
Run the shell script below:
config_file=./examples/config/NER.yaml python src/run.py --config $config_file
( Refer to issues for any network issues. )
The final extraction result should be:
Text | Conference |
---|---|
Finally, every other year, ELRA organizes a major conference LREC, the International Language Resources and Evaluation Conference. | ELRA, LREC, International Language Resources and Evaluation Conference |
Click here to obtain the raw results in json
format.
Note: The actual extraction results may not exactly match this due to LLM randomness.
The result indicates that, given the text and entity type constraint, entities of type conference
have been extracted: ELRA
, conference
, International Language Resources and Evaluation Conference
.
You can either specify entity type constraints or omit them. Without constraints, OneKE will extract all entities from the sentence.
Relationship extraction is the task of extracting semantic relations between entities from a unstructured text.
Refer to the case defined in examples/config/RE.yaml
as an example:
Text | Relation Types |
---|---|
The aid group Doctors Without Borders said that since Saturday , more than 275 wounded people had been admitted and treated at Donka Hospital in the capital of Guinea , Conakry . | Nationality, Country Capital, Place of Death, Children, Location Contains, Place of Birth, Place Lived, Administrative Division of Country, Country of Administrative Divisions, Company, Neighborhood of, Company Founders |
In this task setting, Text
represents the text to be extracted, while Relation Types
denote the constraint on the types of relations of entities to be extracted. Accordingly, we set the text
and constraint
attributes in the YAML file to their respective values.
Next, follow the steps below to complete the RE task:
- Complete
./examples/config/RE.yaml
: configure the necessary model and extraction settings - Run the shell script below:
( Refer to issues for any network issues. )
config_file=./examples/config/RE.yaml python src/run.py --config $config_file
The final extraction result should be:
Text | Head Entity | Tail Entity | Relationship |
---|---|---|---|
The aid group Doctors Without Borders said that since Saturday , more than 275 wounded people had been admitted and treated at Donka Hospital in the capital of Guinea , Conakry . | Guinea | Conakry | Country-Capital |
Click here to obtain the raw results in json
format.
Note: The actual extraction results may not exactly match this due to LLM randomness.
The result indicates that, the relation Country-Capital
is extracted from the given text based on the relation list, accompanied by the corresponding head entity Guinea
and tail entity Conakry
, which denotes that Conakry is the capital of Guinea
.
You can either specify relation type constraints or omit them. Without constraints, OneKE will extract all relation triples from the sentence.
Event extraction is the task to extract event type, event trigger words, and event arguments from a unstructed text, which is a more complex IE task compared to the first two.
Refer to the case defined in examples/config/EE.yaml
as an example:
The extraction text is:
UConn Health , an academic medical center , says in a media statement that it identified approximately 326,000 potentially impacted individuals whose personal information was contained in the compromised email accounts.
while the event type constraint is formatted as follows:
Event Type | Event Argument |
---|---|
phishing | damage amount, attack pattern, tool, victim, place, attacker, purpose, trusted entity, time |
data breach | damage amount, attack pattern, number of data, number of victim, tool, compromised data, victim, place, attacker, purpose, time |
ransom | damage amount, attack pattern, payment method, tool, victim, place, attacker, price, time |
discover vulnerability | vulnerable system, vulnerability, vulnerable system owner, vulnerable system version, supported platform, common vulnerabilities and exposures, capabilities, time, discoverer |
patch vulnerability | vulnerable system, vulnerability, issues addressed, vulnerable system version, releaser, supported platform, common vulnerabilities and exposures, patch number, time, patch |
Each event type has its own corresponding event arguments.
Next, follow the steps below to complete the EE task:
- Complete
./examples/config/EE.yaml
: configure the necessary model and extraction settings - Run the shell script below:
( Refer to issues for any network issues. )
config_file=./examples/config/EE.yaml python src/run.py --config $config_file
The final extraction result should be:
Text | Event Type | Event Trigger | Argument | Role |
---|---|---|---|---|
UConn Health , an academic medical center , says in a media statement that it identified approximately 326,000 potentially impacted individuals whose personal information was contained in the compromised email accounts. | data breach | compromised | email accounts | compromised data |
326,000 | number of victim | |||
individuals | victim | |||
personal information | compromised data |
Click here to obtain the raw results in json
format.
Note: The actual extraction results may not exactly match this due to LLM randomness.
The extraction results show that the data breach
event is identified using the trigger compromised
, and the specific contents of different event arguments such as compromised data
and victim
have also been extracted.
You can either specify event constraints or omit them. Without constraints, OneKE will extract all events from the sentence.
Triple Extraction identifies subject-predicate-object triples in text. A triple is a fundamental data structure in information extraction, representing a piece of knowledge or a fact. Knowledge Graph (KG) can be quickly constructed after the Triple Extraction.
Here is an example:
Text | Subject Entity Types | Relation Types | Object Entity Types |
---|---|---|---|
The international conference on renewable energy technologies was held in Berlin. Several researchers presented their findings, discussing new innovations and challenges. The event was attended by experts from all over the world, and it is expected to continue in various locations. | Event, Person | Action, Location | Place, Concept |
The final extraction result should be:
Subject Entity | Relation | Object Entity |
---|---|---|
Conference (Event) | was held in (Location) | Berlin (Place) |
Researchers (Person) | presented (Action) | findings (Concept) |
Researchers (Person) | discussed (Action) | innovations (Concept) |
Conference (Event) | will continue in (Location) | various locations (Place) |
Experts (Person) | attended (Action) | event (Event) |
Event (Event) | is attended by (Location) | experts (Person) |
Let's start in OneKE ~
The constraint can be customed as multiple styles, and it's formatted as follows:
-
Define
entity types
only:If you only need to specify the entity types, the
constraint
should be a single list of strings representing the different entity types.
["Person", "Place", "Event", "property"]
-
Define
entity types
andrelation types
:If you need to specify both entity types and relation types, the
constraint
should be a nested list. The first list contains the entity types, and the second list contains the relation types.
[["Person", "Place", "Event", "property"], ["Interpersonal", "Located", "Ownership", "Action"]]
-
Define
subject entities types
,relation types
, andobject entities types
:If you need to define the types of subject entities, relation types, and object entities, the
constraint
should be a nested list. The first list contains the subject entity types, the second list contains the relation types, and the third list contains the object entity types.
[["Person"], ["Interpersonal", "Ownership"], ["Person", "property"]]
Next, follow the steps below to complete the Triple extraction task:
-
Complete
./examples/config/Triple2KG.yaml
:configure the necessary model and extraction settings.
-
Run the shell script below:
config_file=./examples/config/Triple2KG.yaml python src/run.py --config $config_file
( Refer to issues for any network issues. )
Here is an example to start. And access a raw results in JSON format here.
⚠️ Warning: If you do not intend to build a Knowledge Graph, make sure to remove or comment out the construct field in the yaml file. This will help avoid errors related to database connection issues.
✨ If you need to construct your Knowledge Graph (KG) with your Triple Extraction result, you can refer to this example for guidance. Mimic this example and add the construct
field. Just update the field with your own database parameters.
construct: # (Optional) If you want to construct a Knowledge Graph, you need to set the construct field, or you must delete this field.
database: Neo4j # your database type.
url: neo4j://localhost:7687 # your database URL,Neo4j's default port is 7687.
username: your_username # your database username.
password: "your_password" # your database password.
Once your database is set up, you can access your graph database through a browser. For Neo4j, the web interface connection URL is usually:
http://localhost:7474/browser
For additional information regarding the Neo4j database, please refer to it's documentation.
⚠️ Warning Again: If you do not intend to build a Knowledge Graph, make sure to remove or comment out the construct field in the yaml file. This will help avoid errors related to database connection issues.
This type of task is represented as Base
in the code, signifying any other user-defined open-domain extraction tasks.
We refer to the example above for guidance.
In the context of customized Web News Extraction, we first set the extraction instruction to Extract key information from the given text
, and provide the file path to extract content from the file. We specify the output schema from the schema repository as the predefined NewsReport
, and then proceed with the extraction.
Next, follow the steps below to complete this task:
- Complete
./examples/config/NewsExtraction.yaml
: configure the necessary model and extraction settings - Run the shell script below:
( Refer to issues for any network issues. )
config_file=./examples/config/NewsExtraction.yaml python src/run.py --config $config_file
Here is an excerpt of the extracted content:
Title | Meet Trump's pick for director of national intelligence |
---|---|
Summary | Tulsi Gabbard, chosen by President-elect Donald Trump for director of national intelligence, faces a Senate confirmation challenge due to her lack of experience and controversial views. Accusations include promoting an anti-American agenda and having troubling ties with U.S. adversaries. |
Publication Date | 2024-12-04T17:06:00Z |
Keywords | Tulsi Gabbard; director of national intelligence; Donald Trump; Senate confirmation; intelligence agencies |
Events | Tulsi Gabbard's nomination leads to a Senate confirmation battle due to controversies. |
People Involved | Tulsi Gabbard: Nominee for director of national intelligence; Donald Trump: President-elect; Tammy Duckworth: Democratic Senator; Olivia Troye: Former Trump administration national security official |
Quotes | "The U.S. intelligence community has identified her as having troubling relationships with America’s foes."; "If Gabbard is confirmed, America’s allies may not share as much information with the U.S." |
Viewpoints | Gabbard's nomination is considered alarming and dangerous for U.S. national security; Her anti-war stance and criticism of military interventions draw both support and criticism. |
Click here to obtain the raw results in json
format.
Note: The actual extraction results may not exactly match this due to LLM randomness.
In contrast to eariler tasks, the Base-Type
Task requires you to provide an explicit Instruction
that clearly defines your extraction task, while not allowing the setting of constraint
values.
You can choose source texts of various lengths and forms for extraction.
Source Format | Description |
---|---|
Plain Text | String form of raw natural language text. |
HTML Source | Markup language for structuring web pages. |
PDF File | Portable format for fixed-layout documents. |
Word File | Microsoft Word document format, with rich text. |
TXT File | Basic text format, easily opened and edited. |
Json File | Lightweight format for structured data interchange. |
In practice, you can use the YAML file configuration to handle different types of text input:
-
Plain Text: Set
use_file
tofalse
and enter the text to be extracted in thetext
field. For example:use_file: false text: Finally , every other year , ELRA organizes a major conference LREC , the International Language Resources and Evaluation Conference .
-
File Content: Set
use_file
totrue
and specify the file path infile_path
for the text to be extracted. For example:use_file: true file_path: ./data/input_files/Tulsi_Gabbard_News.html
You can choose from various open-source or proprietary model APIs to perform information extraction tasks.
Note: For complex IE tasks, we recommend using powerful models like OpenAI's or or large-scale open-source LLMs.
Model | Description |
---|---|
API Service | |
OpenAI | A series of GPT foundation models offered by OpenAI, such as GPT-3.5 and GPT-4-turbo, which are renowned for their outstanding capabilities in natural language processing. |
DeepSeek | High-performance LLMs that have demonstrated exceptional capabilities in both English and Chinese benchmarks. |
Local Deploy | |
LLaMA3-Instruct series | Meta's series of large language models, with tens to hundreds of billions of parameters, have shown advanced performance on industry-standard benchmarks. |
Qwen2.5-Instruct series | LLMs developed by the Qwen team, come in various parameter sizes and exhibit strong capabilities in both English and Chinese. |
ChatGLM4-9B | The latest model series by the Zhipu team, which achieve breakthroughs in multiple metrics, excel as bilingual (Chinese-English) chat models. |
MiniCPM3-4B | A lightweight language model with 4B parameters, matches or even surpasses 7B-9B models in most evaluation benchmarks. |
OneKE | A large-scale model for knowledge extraction jointly developed by Ant Group and Zhejiang University. |
DeepSeek-R1 series | A bilingual Chinese-English strong reasoning model series provided by DeepSeek, featuring the original DeepSeek-R1 and various distilled versions based on smaller models. |
Note: We recommend deploying the DeepSeek-R1 models with VLLM.
In practice, you can use the YAML file configuration to employ various LLMs:
-
API Service: Set the
model_name_or_path
to the available model name provided by the company, and enter yourapi_key
as well as thebase_url
. For exmaple:model: category: DeepSeek # model category, chosen from ChatGPT and DeepSeek model_name_or_path: deepseek-chat # model name, chosen from deepseek-chat and deepseek-reasoner. Choose deepseek-chat to use DeepSeek-V3 or choose deepseek-reasoner to use DeepSeek-R1. api_key: your_api_key # your API key for the model with API service. base_url: https://api.deepseek.com # base URL for the API service. No need for open-source models.
-
Local Deploy: Set the
model_name_or_path
to either the model name on Hugging Face or the path to the local model. We support using eitherTransformer
orvllm
to access the models.- Transformer Example:
Note that the category of deployment model must be chosen from LLaMA, Qwen, ChatGLM, MiniCPM, OneKE.
model: category: LLaMA # model category, chosen from LLaMA, Qwen, ChatGLM, MiniCPM, OneKE. model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct # model name to download from huggingface or use the local model path. vllm_serve: false # whether to use the vllm. Default set to false.
- VLLM Example:
Note that the DeepSeek-R1 series models only support VLLM deployment. Remember to start the VLLM service before running the extraction task. The reference code is as follows:
model: category: DeepSeek # model category model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct # model name to download from huggingface or use the local model path. vllm_serve: true # whether to use the vllm. Default set to false.
You can also run the commandconfig_file=your_yaml_file_path # REMEMBER to set vllm_serve to TRUE! python src/models/vllm_serve.py --config $config_file # deploy local model via vllm, executed in the OneKE directory
vllm serve model_name_or_path
directly to start the VLLM service. See the official documents for more details.
- Transformer Example:
You can freely combine different extraction methods to complete the information extraction task.
Method | Description |
---|---|
Schema Agent | |
Default Schema | Use the default JSON output format. |
Predefined Schema | Utilize the predefined output schema retrieved from the knowledge base. |
Self Schema Deduction | Generate the output schema by inferring from the task description and the source text. |
Extraction Agent | |
Direct IE | Directly extract information from the given text based on the task description. |
Case Retrieval | Retrieve similar good cases from the knowledge base to aid in the extraction. |
Reflection Agent | |
No Reflection | Directly return the extraction results. |
Case Reflection | Use the self-consistency approach, and if inconsistencies appear, reflect on the original answer by retrieving similar bad cases from the knowledge base. |
The configuration for detail extraction methods and mode information can be found in src/config.yaml
. You can customize the extraction methods by modifying the customized
within this file and set the mode
to customize in an external configuration file.
For example, first configure the src/config.yaml
as follows:
# src/config.yaml
customized:
schema_agent: get_deduced_schema
extraction_agent: extract_information_direct
reflection_agent: reflect_with_case
Then, set the mode
of your custom extraction task in examples/customized.yaml
to customized
:
# examples/customized.yaml
mode: customized
This allows you to experience the customized extraction methods.
Tips:
- For longer text extraction tasks, we recommend using the
direct mode
to avoid issues like attention dispersion and increased processing time.- For shorter tasks requiring high accuracy, you can try the
standard mode
to ensure precision.
You can view the predefined schemas within the src/modules/knowledge_base/schema_repository.py
file. The Schema Repository is designed to be easily extendable. You just need to define your output schema in the form of a pydantic class following the format defined in the file, and it can be directly used in subsequent extractions.
For example, add a new schema in the schema repository:
# src/modules/knowledge_base/schema_repository.py
class ChemicalSubstance(BaseModel):
name: str = Field(description="Name of the chemical substance")
formula: str = Field(description="Molecular formula")
appearance: str = Field(description="Physical appearance")
uses: List[str] = Field(description="Primary uses")
hazards: str = Field(description="Hazard classification")
class ChemicalList(BaseModel):
chemicals: List[ChemicalSubstance] = Field(description="List of chemicals")
Then, set the method for schema_agent
under customized
to get_retrieved_schema
in src/config.yaml
. Finally, set the mode
to customized
in the external configuration file to enable custom schema extraction.
In this example, the extraction results will be a list of chemical substances that strictly adhere to the defined schema, ensuring a high level of accuracy and flexibility in the extraction results.
Note that the names of newly created objects should not conflict with existing ones.
You can directly view the case storage in the src/modules/knowledge_base/case_repository.json
file, but we do not recommend modifying it directly.
The Case Repository is automatically updated with each extraction process once setting update_repository
to True
in the configuration file.
When updating the Case Repository, you must provide external feedback to generate case information, either by including truth answer in the configuration file or during the extraction process.
Here is an example:
# examples/config/RE.yaml
truth: {"relation_list": [{"head": "Guinea", "tail": "Conakry", "relation": "country capital"}]} # Truth data for the relation
update_case: true
After extraction, OneKE compares results with the truth answer, generates analysis, and finally stores the case in the repository.
Here are some network issues you might encounter and the corresponding solutions.
- Pip Installation Failure: Use mirror websites, run the command as
pip install -i [mirror-source] ...
. - Docker Image Pull Failure: Configure the docker daemon to add repository mirrors.
- Nltk Download Failure: Manually download the
nltk
package and place it in the proper directory. - Model Dowload Failure: Use the
Hugging Face Mirror
site orModelScope
to download model, and specify the local path to the model when using it.Note: We use
all-MiniLM-L6-v2
model by default for case matching, so it needs to be downloaded during execution. If network issues occur, manually download the model, and update theembedding_model
to its local path in thesrc/config.yaml
file.
Ningyu Zhang, Haofen Wang, Yujie Luo, Xiangyuan Ru, Kangwei Liu, Lin Yuan, Mengshu Sun, Lei Liang, Zhiqiang Zhang, Jun Zhou, Lanning Wei, Da Zheng, Huajun Chen.
We deeply appreciate the collaborative efforts of everyone involved. We will continue to enhance and maintain this repository over the long term. If you encounter any issues, feel free to submit them to us!
We reference itext2kg to aid in building the schema repository and utilize tools from LangChain for file parsing. The experimental datasets we use are curated from the IEPile repository. We appreciate their valuable contributions!
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for OneKE
Similar Open Source Tools

OneKE
OneKE is a flexible dockerized system for schema-guided knowledge extraction, capable of extracting information from the web and raw PDF books across multiple domains like science and news. It employs a collaborative multi-agent approach and includes a user-customizable knowledge base to enable tailored extraction. OneKE offers various IE tasks support, data sources support, LLMs support, extraction method support, and knowledge base configuration. Users can start with examples using YAML, Python, or Web UI, and perform tasks like Named Entity Recognition, Relation Extraction, Event Extraction, Triple Extraction, and Open Domain IE. The tool supports different source formats like Plain Text, HTML, PDF, Word, TXT, and JSON files. Users can choose from various extraction models like OpenAI, DeepSeek, LLaMA, Qwen, ChatGLM, MiniCPM, and OneKE for information extraction tasks. Extraction methods include Schema Agent, Extraction Agent, and Reflection Agent. The tool also provides support for schema repository and case repository management, along with solutions for network issues. Contributors to the project include Ningyu Zhang, Haofen Wang, Yujie Luo, Xiangyuan Ru, Kangwei Liu, Lin Yuan, Mengshu Sun, Lei Liang, Zhiqiang Zhang, Jun Zhou, Lanning Wei, Da Zheng, and Huajun Chen.

ShortcutsBench
ShortcutsBench is a project focused on collecting and analyzing workflows created in the Shortcuts app, providing a dataset of shortcut metadata, source files, and API information. It aims to study the integration of large language models with Apple devices, particularly focusing on the role of shortcuts in enhancing user experience. The project offers insights for Shortcuts users, enthusiasts, and researchers to explore, customize workflows, and study automated workflows, low-code programming, and API-based agents.

pgai
pgai simplifies the process of building search and Retrieval Augmented Generation (RAG) AI applications with PostgreSQL. It brings embedding and generation AI models closer to the database, allowing users to create embeddings, retrieve LLM chat completions, reason over data for classification, summarization, and data enrichment directly from within PostgreSQL in a SQL query. The tool requires an OpenAI API key and a PostgreSQL client to enable AI functionality in the database. Users can install pgai from source, run it in a pre-built Docker container, or enable it in a Timescale Cloud service. The tool provides functions to handle API keys using psql or Python, and offers various AI functionalities like tokenizing, detokenizing, embedding, chat completion, and content moderation.

weblinx
WebLINX is a Python library and dataset for real-world website navigation with multi-turn dialogue. The repository provides code for training models reported in the WebLINX paper, along with a comprehensive API to work with the dataset. It includes modules for data processing, model evaluation, and utility functions. The modeling directory contains code for processing, training, and evaluating models such as DMR, LLaMA, MindAct, Pix2Act, and Flan-T5. Users can install specific dependencies for HTML processing, video processing, model evaluation, and library development. The evaluation module provides metrics and functions for evaluating models, with ongoing work to improve documentation and functionality.

open-assistant-api
Open Assistant API is an open-source, self-hosted AI intelligent assistant API compatible with the official OpenAI interface. It supports integration with more commercial and private models, R2R RAG engine, internet search, custom functions, built-in tools, code interpreter, multimodal support, LLM support, and message streaming output. Users can deploy the service locally and expand existing features. The API provides user isolation based on tokens for SaaS deployment requirements and allows integration of various tools to enhance its capability to connect with the external world.

qlib
Qlib is an open-source, AI-oriented quantitative investment platform that supports diverse machine learning modeling paradigms, including supervised learning, market dynamics modeling, and reinforcement learning. It covers the entire chain of quantitative investment, from alpha seeking to order execution. The platform empowers researchers to explore ideas and implement productions using AI technologies in quantitative investment. Qlib collaboratively solves key challenges in quantitative investment by releasing state-of-the-art research works in various paradigms. It provides a full ML pipeline for data processing, model training, and back-testing, enabling users to perform tasks such as forecasting market patterns, adapting to market dynamics, and modeling continuous investment decisions.

PDEBench
PDEBench provides a diverse and comprehensive set of benchmarks for scientific machine learning, including challenging and realistic physical problems. The repository consists of code for generating datasets, uploading and downloading datasets, training and evaluating machine learning models as baselines. It features a wide range of PDEs, realistic and difficult problems, ready-to-use datasets with various conditions and parameters. PDEBench aims for extensibility and invites participation from the SciML community to improve and extend the benchmark.

basiclingua-LLM-Based-NLP
BasicLingua is a Python library that provides functionalities for linguistic tasks such as tokenization, stemming, lemmatization, and many others. It is based on the Gemini Language Model, which has demonstrated promising results in dealing with text data. BasicLingua can be used as an API or through a web demo. It is available under the MIT license and can be used in various projects.

SillyTavern
SillyTavern is a user interface you can install on your computer (and Android phones) that allows you to interact with text generation AIs and chat/roleplay with characters you or the community create. SillyTavern is a fork of TavernAI 1.2.8 which is under more active development and has added many major features. At this point, they can be thought of as completely independent programs.

llm-foundry
LLM Foundry is a codebase for training, finetuning, evaluating, and deploying LLMs for inference with Composer and the MosaicML platform. It is designed to be easy-to-use, efficient _and_ flexible, enabling rapid experimentation with the latest techniques. You'll find in this repo: * `llmfoundry/` - source code for models, datasets, callbacks, utilities, etc. * `scripts/` - scripts to run LLM workloads * `data_prep/` - convert text data from original sources to StreamingDataset format * `train/` - train or finetune HuggingFace and MPT models from 125M - 70B parameters * `train/benchmarking` - profile training throughput and MFU * `inference/` - convert models to HuggingFace or ONNX format, and generate responses * `inference/benchmarking` - profile inference latency and throughput * `eval/` - evaluate LLMs on academic (or custom) in-context-learning tasks * `mcli/` - launch any of these workloads using MCLI and the MosaicML platform * `TUTORIAL.md` - a deeper dive into the repo, example workflows, and FAQs

RepoAgent
RepoAgent is an LLM-powered framework designed for repository-level code documentation generation. It automates the process of detecting changes in Git repositories, analyzing code structure through AST, identifying inter-object relationships, replacing Markdown content, and executing multi-threaded operations. The tool aims to assist developers in understanding and maintaining codebases by providing comprehensive documentation, ultimately improving efficiency and saving time.

vidur
Vidur is a high-fidelity and extensible LLM inference simulator designed for capacity planning, deployment configuration optimization, testing new research ideas, and studying system performance of models under different workloads and configurations. It supports various models and devices, offers chrome trace exports, and can be set up using mamba, venv, or conda. Users can run the simulator with various parameters and monitor metrics using wandb. Contributions are welcome, subject to a Contributor License Agreement and adherence to the Microsoft Open Source Code of Conduct.

AgentLab
AgentLab is an open, easy-to-use, and extensible framework designed to accelerate web agent research. It provides features for developing and evaluating agents on various benchmarks supported by BrowserGym. The framework allows for large-scale parallel agent experiments using ray, building blocks for creating agents over BrowserGym, and a unified LLM API for OpenRouter, OpenAI, Azure, or self-hosted using TGI. AgentLab also offers reproducibility features, a unified LeaderBoard, and supports multiple benchmarks like WebArena, WorkArena, WebLinx, VisualWebArena, AssistantBench, GAIA, Mind2Web-live, and MiniWoB.

OpenLLM
OpenLLM is a platform that helps developers run any open-source Large Language Models (LLMs) as OpenAI-compatible API endpoints, locally and in the cloud. It supports a wide range of LLMs, provides state-of-the-art serving and inference performance, and simplifies cloud deployment via BentoML. Users can fine-tune, serve, deploy, and monitor any LLMs with ease using OpenLLM. The platform also supports various quantization techniques, serving fine-tuning layers, and multiple runtime implementations. OpenLLM seamlessly integrates with other tools like OpenAI Compatible Endpoints, LlamaIndex, LangChain, and Transformers Agents. It offers deployment options through Docker containers, BentoCloud, and provides a community for collaboration and contributions.

LARS
LARS is an application that enables users to run Large Language Models (LLMs) locally on their devices, upload their own documents, and engage in conversations where the LLM grounds its responses with the uploaded content. The application focuses on Retrieval Augmented Generation (RAG) to increase accuracy and reduce AI-generated inaccuracies. LARS provides advanced citations, supports various file formats, allows follow-up questions, provides full chat history, and offers customization options for LLM settings. Users can force enable or disable RAG, change system prompts, and tweak advanced LLM settings. The application also supports GPU-accelerated inferencing, multiple embedding models, and text extraction methods. LARS is open-source and aims to be the ultimate RAG-centric LLM application.

giskard
Giskard is an open-source Python library that automatically detects performance, bias & security issues in AI applications. The library covers LLM-based applications such as RAG agents, all the way to traditional ML models for tabular data.
For similar tasks

OneKE
OneKE is a flexible dockerized system for schema-guided knowledge extraction, capable of extracting information from the web and raw PDF books across multiple domains like science and news. It employs a collaborative multi-agent approach and includes a user-customizable knowledge base to enable tailored extraction. OneKE offers various IE tasks support, data sources support, LLMs support, extraction method support, and knowledge base configuration. Users can start with examples using YAML, Python, or Web UI, and perform tasks like Named Entity Recognition, Relation Extraction, Event Extraction, Triple Extraction, and Open Domain IE. The tool supports different source formats like Plain Text, HTML, PDF, Word, TXT, and JSON files. Users can choose from various extraction models like OpenAI, DeepSeek, LLaMA, Qwen, ChatGLM, MiniCPM, and OneKE for information extraction tasks. Extraction methods include Schema Agent, Extraction Agent, and Reflection Agent. The tool also provides support for schema repository and case repository management, along with solutions for network issues. Contributors to the project include Ningyu Zhang, Haofen Wang, Yujie Luo, Xiangyuan Ru, Kangwei Liu, Lin Yuan, Mengshu Sun, Lei Liang, Zhiqiang Zhang, Jun Zhou, Lanning Wei, Da Zheng, and Huajun Chen.

automatic-KG-creation-with-LLM
This repository presents a (semi-)automatic pipeline for Ontology and Knowledge Graph Construction using Large Language Models (LLMs) such as Mixtral 8x22B Instruct v0.1, GPT-4o, GPT-3.5, and Gemini. It explores the generation of Knowledge Graphs by formulating competency questions, developing ontologies, constructing KGs, and evaluating the results with minimal human involvement. The project showcases the creation of a KG on deep learning methodologies from scholarly publications. It includes components for data preprocessing, prompts for LLMs, datasets, and results from the selected LLMs.

pywhy-llm
PyWhy-LLM is an innovative library that integrates Large Language Models (LLMs) into the causal analysis process, empowering users with knowledge previously only available through domain experts. It seamlessly augments existing causal inference processes by suggesting potential confounders, relationships between variables, backdoor sets, front door sets, IV sets, estimands, critiques of DAGs, latent confounders, and negative controls. By leveraging LLMs and formalizing human-LLM collaboration, PyWhy-LLM aims to enhance causal analysis accessibility and insight.
For similar jobs

weave
Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.

LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.

VisionCraft
The VisionCraft API is a free API for using over 100 different AI models. From images to sound.

kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.

PyRIT
PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.

tabby
Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.

spear
SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.

Magick
Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.