duckdb-airport-extension
The Airport extension for DuckDB, enables the use of Arrow Flight with DuckDB
Stars: 170
The 'duckdb-airport-extension' is a tool that enables the use of Arrow Flight with DuckDB. It provides functions to list available Arrow Flights at a specific endpoint and to retrieve the contents of an Arrow Flight. The extension also supports creating secrets for authentication purposes. It includes features for serializing filters and optimizing projections to enhance data transmission efficiency. The tool is built on top of gRPC and the Arrow IPC format, offering high-performance data services for data processing and retrieval.
README:
This extension "airport" enables the use of Arrow Flight with DuckDB.
Arrow Flight is an RPC framework for high-performance data services based on Apache Arrow and is built on top of gRPC and the Arrow IPC format.
airport_list_flights(location, criteria, auth_token="token_value", secret="secret_name")
Description: This function returns a list of Arrow Flights that are available at a particular endpoint.
| Parameter Name | Type | Description |
|---|---|---|
location |
VARCHAR |
This is the location of the Flight server |
criteria |
VARCHAR |
This is free-form criteria to pass to the Flight server |
| Parameter Name | Type | Description |
|---|---|---|
auth_token |
VARCHAR |
A bearer value token to present to the server, the header is formatted like Authorization: Bearer <auth_token>
|
secret |
VARCHAR |
This is the name of the DuckDB secret to use to supply the value for the auth_token
|
> select * from airport_list_flights('http://127.0.0.1:8815', null);
flight_descriptor = [uploaded.parquet]
endpoint = [{'ticket': uploaded.parquet, 'location': [grpc://0.0.0.0:8815], 'expiration_time': NULL, 'app_metadata': }]
ordered = false
total_records = 3
total_bytes = 363
app_metadata =
schema = Character: stringIn addition to the criteria parameter, the Airport extension will pass additional GRPC headers.
The airport-duckdb-json-filters header is sent on the GRPC requests. The header contains a JSON serialized representation of all of the conditional filters that are going applied to the results.
To illustrate this through an example:
select * from airport_list_flights('grpc://localhost:8815/', null) where total_bytes = 5;The GRPC header airport-duckdb-json-filters will be set to
{
"filters": [
{
"expression_class": "BOUND_COMPARISON",
"type": "COMPARE_EQUAL",
"left": {
"expression_class": "BOUND_COLUMN_REF",
"type": "BOUND_COLUMN_REF",
"alias": "total_bytes",
"return_type": {
"id": "BIGINT"
}
},
"right": {
"expression_class": "BOUND_CONSTANT",
"type": "VALUE_CONSTANT",
"value": {
"type": {
"id": "BIGINT"
},
"is_null": false,
"value": 5
}
}
}
]
}The airport-duckdb-json-filters header will not contain newlines, but the JSON has been reformatted in this document for ease of comprehension.
It is up to the implementer of the server to use this header to apply optimizations. The Airport DuckDB extension will still apply the filters to the result returned by the server. This means that the filter logic is purely advisory. In the author's experience, if Arrow Flight servers implement the filtering logic server side it can unlock some impressive optimizations. The JSON schema of the serialized filter expressions is not guaranteed to remain unchanged across DuckDB versions, the serialization is performed by the DuckDB code.
The header airport-duckdb-column-ids will contain a comma-separated list of column indexes that are used in the query. The Arrow Flight server can return nulls for columns that are not requested. This can be used to reduce the amount of data that is transmitted in the response.
airport_take_flight(location, descriptor, auth_token="token_value", secret="secret_name")
Description: This function is a table returning function, it returns the contents of the Arrow Flight.
| Parameter Name | Type | Description |
|---|---|---|
location |
VARCHAR |
This is the location of the Flight server |
descriptor |
ANY |
This is the descriptor of the flight. If it is a VARCHAR or BLOB it is interpreted as a command, if it is an ARRAY or VARCHAR[] it is considered a path-based descriptor. |
| Parameter Name | Type | Description |
|---|---|---|
auth_token |
VARCHAR |
A bearer value token to present to the server, the header is formatted like Authorization: Bearer <auth_token>
|
secret |
VARCHAR |
This is the name of the DuckDB secret to use to supply the value for the auth_token
|
ticket |
BLOB |
This is the ticket (an opaque binary token) supplied to the Flight server it overrides any ticket supplied from GetFlightInfo. |
headers |
MAP(VARCHAR, VARCHAR) |
A map of extra GRPC headers to send with requests to the Flight server. |
select * from airport_take_flight('grpc://localhost:8815/', ['counter-stream']) limit 5;
┌─────────┐
│ counter │
│ int64 │
├─────────┤
│ 0 │
│ 1 │
│ 2 │
│ 3 │
│ 4 │
└─────────┘To create a secret that can be used by airport_take_flight and airport_list_flight use the standard DuckDB CREATE SECRET command.
CREATE SECRET airport_hello_world (
type airport,
auth_token 'test-token',
scope 'grpc+tls://server.example.com/'
);The Airport extension respects the scope(s) specified in the secret. If a value for auth_token isn't supplied, but a secret exists with a scope that matches the server location the value for the auth_token will be used from the secret.
- Investigate the multithreaded endpoint support.
# Clone this repo with submodules.
# duckdb and extension-ci-tools are submodules.
git clone --recursive [email protected]:Query-farm/duckdb-airport-extension
# Clone the vcpkg repo
git clone https://github.com/Microsoft/vcpkg.git
# Bootstrap vcpkg
./vcpkg/bootstrap-vcpkg.sh
export VCPKG_TOOLCHAIN_PATH=`pwd`/vcpkg/scripts/buildsystems/vcpkg.cmake
# Build the extension
make
# If you have ninja installed, you can use it to speed up the build
# GEN=ninja makeThe main binaries that will be built are:
./build/release/duckdb
./build/release/test/unittest
./build/release/extension/airport/airport.duckdb_extension-
duckdbis the binary for the duckdb shell with the extension code automatically loaded. -
unittestis the test runner of duckdb. Again, the extension is already linked into the binary. -
airport.duckdb_extensionis the loadable binary as it would be distributed.
To run the extension code, simply start the shell with ./build/release/duckdb. This duckdb shell will have the extension pre-loaded.
Now we can use the features from the extension directly in DuckDB.
Different tests can be created for DuckDB extensions. The primary way of testing DuckDB extensions should be the SQL tests in ./test/sql. These SQL tests can be run using:
make testFor Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for duckdb-airport-extension
Similar Open Source Tools
duckdb-airport-extension
The 'duckdb-airport-extension' is a tool that enables the use of Arrow Flight with DuckDB. It provides functions to list available Arrow Flights at a specific endpoint and to retrieve the contents of an Arrow Flight. The extension also supports creating secrets for authentication purposes. It includes features for serializing filters and optimizing projections to enhance data transmission efficiency. The tool is built on top of gRPC and the Arrow IPC format, offering high-performance data services for data processing and retrieval.
magentic
Easily integrate Large Language Models into your Python code. Simply use the `@prompt` and `@chatprompt` decorators to create functions that return structured output from the LLM. Mix LLM queries and function calling with regular Python code to create complex logic.
monacopilot
Monacopilot is a powerful and customizable AI auto-completion plugin for the Monaco Editor. It supports multiple AI providers such as Anthropic, OpenAI, Groq, and Google, providing real-time code completions with an efficient caching system. The plugin offers context-aware suggestions, customizable completion behavior, and framework agnostic features. Users can also customize the model support and trigger completions manually. Monacopilot is designed to enhance coding productivity by providing accurate and contextually appropriate completions in daily spoken language.
llm-vscode
llm-vscode is an extension designed for all things LLM, utilizing llm-ls as its backend. It offers features such as code completion with 'ghost-text' suggestions, the ability to choose models for code generation via HTTP requests, ensuring prompt size fits within the context window, and code attribution checks. Users can configure the backend, suggestion behavior, keybindings, llm-ls settings, and tokenization options. Additionally, the extension supports testing models like Code Llama 13B, Phind/Phind-CodeLlama-34B-v2, and WizardLM/WizardCoder-Python-34B-V1.0. Development involves cloning llm-ls, building it, and setting up the llm-vscode extension for use.
syncode
SynCode is a novel framework for the grammar-guided generation of Large Language Models (LLMs) that ensures syntactically valid output with respect to defined Context-Free Grammar (CFG) rules. It supports general-purpose programming languages like Python, Go, SQL, JSON, and more, allowing users to define custom grammars using EBNF syntax. The tool compares favorably to other constrained decoders and offers features like fast grammar-guided generation, compatibility with HuggingFace Language Models, and the ability to work with various decoding strategies.
syncode
SynCode is a novel framework for the grammar-guided generation of Large Language Models (LLMs) that ensures syntactically valid output based on a Context-Free Grammar (CFG). It supports various programming languages like Python, Go, SQL, Math, JSON, and more. Users can define custom grammars using EBNF syntax. SynCode offers fast generation, seamless integration with HuggingFace Language Models, and the ability to sample with different decoding strategies.
receipt-scanner
The receipt-scanner repository is an AI-Powered Receipt and Invoice Scanner for Laravel that allows users to easily extract structured receipt data from images, PDFs, and emails within their Laravel application using OpenAI. It provides a light wrapper around OpenAI Chat and Completion endpoints, supports various input formats, and integrates with Textract for OCR functionality. Users can install the package via composer, publish configuration files, and use it to extract data from plain text, PDFs, images, Word documents, and web content. The scanned receipt data is parsed into a DTO structure with main classes like Receipt, Merchant, and LineItem.
laravel-crod
Laravel Crod is a package designed to facilitate the implementation of CRUD operations in Laravel projects. It allows users to quickly generate controllers, models, migrations, services, repositories, views, and requests with various customization options. The package simplifies tasks such as creating resource controllers, making models fillable, querying repositories and services, and generating additional files like seeders and factories. Laravel Crod aims to streamline the process of building CRUD functionalities in Laravel applications by providing a set of commands and tools for developers.
skyvern
Skyvern automates browser-based workflows using LLMs and computer vision. It provides a simple API endpoint to fully automate manual workflows, replacing brittle or unreliable automation solutions. Traditional approaches to browser automations required writing custom scripts for websites, often relying on DOM parsing and XPath-based interactions which would break whenever the website layouts changed. Instead of only relying on code-defined XPath interactions, Skyvern adds computer vision and LLMs to the mix to parse items in the viewport in real-time, create a plan for interaction and interact with them. This approach gives us a few advantages: 1. Skyvern can operate on websites it’s never seen before, as it’s able to map visual elements to actions necessary to complete a workflow, without any customized code 2. Skyvern is resistant to website layout changes, as there are no pre-determined XPaths or other selectors our system is looking for while trying to navigate 3. Skyvern leverages LLMs to reason through interactions to ensure we can cover complex situations. Examples include: 1. If you wanted to get an auto insurance quote from Geico, the answer to a common question “Were you eligible to drive at 18?” could be inferred from the driver receiving their license at age 16 2. If you were doing competitor analysis, it’s understanding that an Arnold Palmer 22 oz can at 7/11 is almost definitely the same product as a 23 oz can at Gopuff (even though the sizes are slightly different, which could be a rounding error!) Want to see examples of Skyvern in action? Jump to #real-world-examples-of- skyvern
paxml
Pax is a framework to configure and run machine learning experiments on top of Jax.
binary_ninja_mcp
This repository contains a Binary Ninja plugin, MCP server, and bridge that enables seamless integration of Binary Ninja's capabilities with your favorite LLM client. It provides real-time integration, AI assistance for reverse engineering, multi-binary support, and various MCP tools for tasks like decompiling functions, getting IL code, managing comments, renaming variables, and more.
langserve
LangServe helps developers deploy `LangChain` runnables and chains as a REST API. This library is integrated with FastAPI and uses pydantic for data validation. In addition, it provides a client that can be used to call into runnables deployed on a server. A JavaScript client is available in LangChain.js.
top_secret
Top Secret is a Ruby gem designed to filter sensitive information from free text before sending it to external services or APIs, such as chatbots and LLMs. It provides default filters for credit cards, emails, phone numbers, social security numbers, people's names, and locations, with the ability to add custom filters. Users can configure the tool to handle sensitive information redaction, scan for sensitive data, batch process messages, and restore filtered text from external services. Top Secret uses Regex and NER filters to detect and redact sensitive information, allowing users to override default filters, disable specific filters, and add custom filters globally. The tool is suitable for applications requiring data privacy and security measures.
extractor
Extractor is an AI-powered data extraction library for Laravel that leverages OpenAI's capabilities to effortlessly extract structured data from various sources, including images, PDFs, and emails. It features a convenient wrapper around OpenAI Chat and Completion endpoints, supports multiple input formats, includes a flexible Field Extractor for arbitrary data extraction, and integrates with Textract for OCR functionality. Extractor utilizes JSON Mode from the latest GPT-3.5 and GPT-4 models, providing accurate and efficient data extraction.
nano-graphrag
nano-GraphRAG is a simple, easy-to-hack implementation of GraphRAG that provides a smaller, faster, and cleaner version of the official implementation. It is about 800 lines of code, small yet scalable, asynchronous, and fully typed. The tool supports incremental insert, async methods, and various parameters for customization. Users can replace storage components and LLM functions as needed. It also allows for embedding function replacement and comes with pre-defined prompts for entity extraction and community reports. However, some features like covariates and global search implementation differ from the original GraphRAG. Future versions aim to address issues related to data source ID, community description truncation, and add new components.
gitleaks
Gitleaks is a tool for detecting secrets like passwords, API keys, and tokens in git repos, files, and whatever else you wanna throw at it via stdin. It can be installed using Homebrew, Docker, or Go, and is available in binary form for many popular platforms and OS types. Gitleaks can be implemented as a pre-commit hook directly in your repo or as a GitHub action. It offers scanning modes for git repositories, directories, and stdin, and allows creating baselines for ignoring old findings. Gitleaks also provides configuration options for custom secret detection rules and supports features like decoding encoded text and generating reports in various formats.
For similar tasks
duckdb-airport-extension
The 'duckdb-airport-extension' is a tool that enables the use of Arrow Flight with DuckDB. It provides functions to list available Arrow Flights at a specific endpoint and to retrieve the contents of an Arrow Flight. The extension also supports creating secrets for authentication purposes. It includes features for serializing filters and optimizing projections to enhance data transmission efficiency. The tool is built on top of gRPC and the Arrow IPC format, offering high-performance data services for data processing and retrieval.
For similar jobs
lollms-webui
LoLLMs WebUI (Lord of Large Language Multimodal Systems: One tool to rule them all) is a user-friendly interface to access and utilize various LLM (Large Language Models) and other AI models for a wide range of tasks. With over 500 AI expert conditionings across diverse domains and more than 2500 fine tuned models over multiple domains, LoLLMs WebUI provides an immediate resource for any problem, from car repair to coding assistance, legal matters, medical diagnosis, entertainment, and more. The easy-to-use UI with light and dark mode options, integration with GitHub repository, support for different personalities, and features like thumb up/down rating, copy, edit, and remove messages, local database storage, search, export, and delete multiple discussions, make LoLLMs WebUI a powerful and versatile tool.
Azure-Analytics-and-AI-Engagement
The Azure-Analytics-and-AI-Engagement repository provides packaged Industry Scenario DREAM Demos with ARM templates (Containing a demo web application, Power BI reports, Synapse resources, AML Notebooks etc.) that can be deployed in a customer’s subscription using the CAPE tool within a matter of few hours. Partners can also deploy DREAM Demos in their own subscriptions using DPoC.
minio
MinIO is a High Performance Object Storage released under GNU Affero General Public License v3.0. It is API compatible with Amazon S3 cloud storage service. Use MinIO to build high performance infrastructure for machine learning, analytics and application data workloads.
mage-ai
Mage is an open-source data pipeline tool for transforming and integrating data. It offers an easy developer experience, engineering best practices built-in, and data as a first-class citizen. Mage makes it easy to build, preview, and launch data pipelines, and provides observability and scaling capabilities. It supports data integrations, streaming pipelines, and dbt integration.
AiTreasureBox
AiTreasureBox is a versatile AI tool that provides a collection of pre-trained models and algorithms for various machine learning tasks. It simplifies the process of implementing AI solutions by offering ready-to-use components that can be easily integrated into projects. With AiTreasureBox, users can quickly prototype and deploy AI applications without the need for extensive knowledge in machine learning or deep learning. The tool covers a wide range of tasks such as image classification, text generation, sentiment analysis, object detection, and more. It is designed to be user-friendly and accessible to both beginners and experienced developers, making AI development more efficient and accessible to a wider audience.
tidb
TiDB is an open-source distributed SQL database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads. It is MySQL compatible and features horizontal scalability, strong consistency, and high availability.
airbyte
Airbyte is an open-source data integration platform that makes it easy to move data from any source to any destination. With Airbyte, you can build and manage data pipelines without writing any code. Airbyte provides a library of pre-built connectors that make it easy to connect to popular data sources and destinations. You can also create your own connectors using Airbyte's no-code Connector Builder or low-code CDK. Airbyte is used by data engineers and analysts at companies of all sizes to build and manage their data pipelines.
labelbox-python
Labelbox is a data-centric AI platform for enterprises to develop, optimize, and use AI to solve problems and power new products and services. Enterprises use Labelbox to curate data, generate high-quality human feedback data for computer vision and LLMs, evaluate model performance, and automate tasks by combining AI and human-centric workflows. The academic & research community uses Labelbox for cutting-edge AI research.
