duckdb-airport-extension

The Airport extension for DuckDB, enables the use of Arrow Flight with DuckDB

Stars: 170

Visit

The 'duckdb-airport-extension' is a tool that enables the use of Arrow Flight with DuckDB. It provides functions to list available Arrow Flights at a specific endpoint and to retrieve the contents of an Arrow Flight. The extension also supports creating secrets for authentication purposes. It includes features for serializing filters and optimizing projections to enhance data transmission efficiency. The tool is built on top of gRPC and the Arrow IPC format, offering high-performance data services for data processing and retrieval.

README:

Airport Extension for DuckDB

This extension "airport" enables the use of Arrow Flight with DuckDB.

What is Arrow Flight?

Arrow Flight is an RPC framework for high-performance data services based on Apache Arrow and is built on top of gRPC and the Arrow IPC format.

API

Listing Flights

airport_list_flights(location, criteria, auth_token="token_value", secret="secret_name")

Description: This function returns a list of Arrow Flights that are available at a particular endpoint.

Parameters:

Parameter Name	Type	Description
`location`	`VARCHAR`	This is the location of the Flight server
`criteria`	`VARCHAR`	This is free-form criteria to pass to the Flight server

Named Parameters:

Parameter Name	Type	Description
`auth_token`	`VARCHAR`	A bearer value token to present to the server, the header is formatted like `Authorization: Bearer <auth_token>`
`secret`	`VARCHAR`	This is the name of the DuckDB secret to use to supply the value for the `auth_token`

> select * from airport_list_flights('http://127.0.0.1:8815', null);

flight_descriptor = [uploaded.parquet]
         endpoint = [{'ticket': uploaded.parquet, 'location': [grpc://0.0.0.0:8815], 'expiration_time': NULL, 'app_metadata': }]
          ordered = false
    total_records = 3
      total_bytes = 363
     app_metadata =
           schema = Character: string

In addition to the criteria parameter, the Airport extension will pass additional GRPC headers.

Serializing Filters

The airport-duckdb-json-filters header is sent on the GRPC requests. The header contains a JSON serialized representation of all of the conditional filters that are going applied to the results.

To illustrate this through an example:

select * from airport_list_flights('grpc://localhost:8815/', null) where total_bytes = 5;

The GRPC header airport-duckdb-json-filters will be set to

{
 "filters": [
  {
   "expression_class": "BOUND_COMPARISON",
   "type": "COMPARE_EQUAL",
   "left": {
    "expression_class": "BOUND_COLUMN_REF",
    "type": "BOUND_COLUMN_REF",
    "alias": "total_bytes",
    "return_type": {
     "id": "BIGINT"
    }
   },
   "right": {
    "expression_class": "BOUND_CONSTANT",
    "type": "VALUE_CONSTANT",
    "value": {
     "type": {
      "id": "BIGINT"
     },
     "is_null": false,
     "value": 5
    }
   }
  }
 ]
}

The airport-duckdb-json-filters header will not contain newlines, but the JSON has been reformatted in this document for ease of comprehension.

It is up to the implementer of the server to use this header to apply optimizations. The Airport DuckDB extension will still apply the filters to the result returned by the server. This means that the filter logic is purely advisory. In the author's experience, if Arrow Flight servers implement the filtering logic server side it can unlock some impressive optimizations. The JSON schema of the serialized filter expressions is not guaranteed to remain unchanged across DuckDB versions, the serialization is performed by the DuckDB code.

Projection Optimization

The header airport-duckdb-column-ids will contain a comma-separated list of column indexes that are used in the query. The Arrow Flight server can return nulls for columns that are not requested. This can be used to reduce the amount of data that is transmitted in the response.

Taking a Flight

airport_take_flight(location, descriptor, auth_token="token_value", secret="secret_name")

Description: This function is a table returning function, it returns the contents of the Arrow Flight.

Parameters:

Parameter Name	Type	Description
`location`	`VARCHAR`	This is the location of the Flight server
`descriptor`	`ANY`	This is the descriptor of the flight. If it is a `VARCHAR` or `BLOB` it is interpreted as a command, if it is an `ARRAY` or `VARCHAR[]` it is considered a path-based descriptor.

Named Parameters:

Parameter Name	Type	Description
`auth_token`	`VARCHAR`	A bearer value token to present to the server, the header is formatted like `Authorization: Bearer <auth_token>`
`secret`	`VARCHAR`	This is the name of the DuckDB secret to use to supply the value for the `auth_token`
`ticket`	`BLOB`	This is the ticket (an opaque binary token) supplied to the Flight server it overrides any ticket supplied from GetFlightInfo.
`headers`	`MAP(VARCHAR, VARCHAR)`	A map of extra GRPC headers to send with requests to the Flight server.

select * from airport_take_flight('grpc://localhost:8815/', ['counter-stream']) limit 5;
┌─────────┐
│ counter │
│  int64  │
├─────────┤
│       0 │
│       1 │
│       2 │
│       3 │
│       4 │
└─────────┘

Creating a secret

To create a secret that can be used by airport_take_flight and airport_list_flight use the standard DuckDB CREATE SECRET command.

CREATE SECRET airport_hello_world (
      type airport,
      auth_token 'test-token',
      scope 'grpc+tls://server.example.com/'
);

The Airport extension respects the scope(s) specified in the secret. If a value for auth_token isn't supplied, but a secret exists with a scope that matches the server location the value for the auth_token will be used from the secret.

Implementation Notes

TODO

Investigate the multithreaded endpoint support.

Implementation Notes

Building the extension

# Clone this repo with submodules.
# duckdb and extension-ci-tools are submodules.
git clone --recursive [email protected]:Query-farm/duckdb-airport-extension

# Clone the vcpkg repo
git clone https://github.com/Microsoft/vcpkg.git

# Bootstrap vcpkg
./vcpkg/bootstrap-vcpkg.sh
export VCPKG_TOOLCHAIN_PATH=`pwd`/vcpkg/scripts/buildsystems/vcpkg.cmake

# Build the extension
make

# If you have ninja installed, you can use it to speed up the build
# GEN=ninja make

The main binaries that will be built are:

./build/release/duckdb
./build/release/test/unittest
./build/release/extension/airport/airport.duckdb_extension

duckdb is the binary for the duckdb shell with the extension code automatically loaded.
unittest is the test runner of duckdb. Again, the extension is already linked into the binary.
airport.duckdb_extension is the loadable binary as it would be distributed.

Running the extension

To run the extension code, simply start the shell with ./build/release/duckdb. This duckdb shell will have the extension pre-loaded.

Now we can use the features from the extension directly in DuckDB.

Running the tests

Different tests can be created for DuckDB extensions. The primary way of testing DuckDB extensions should be the SQL tests in ./test/sql. These SQL tests can be run using:

make test

For Tasks:

Click tags to check more tools for each tasks

list flights retrieve flight contents create secret serialize filters optimize projections

For Jobs:

data engineer data analyst software developer database administrator data scientist

Alternative AI tools for duckdb-airport-extension

Similar Open Source Tools

duckdb-airport-extension

github

: 170

magentic

Easily integrate Large Language Models into your Python code. Simply use the `@prompt` and `@chatprompt` decorators to create functions that return structured output from the LLM. Mix LLM queries and function calling with regular Python code to create complex logic.

github

: 2.2k

mergekit

Mergekit is a toolkit for merging pre-trained language models. It uses an out-of-core approach to perform unreasonably elaborate merges in resource-constrained situations. Merges can be run entirely on CPU or accelerated with as little as 8 GB of VRAM. Many merging algorithms are supported, with more coming as they catch my attention.

github

: 5.5k

Construction-Hazard-Detection

Construction-Hazard-Detection is an AI-driven tool focused on improving safety at construction sites by utilizing the YOLOv8 model for object detection. The system identifies potential hazards like overhead heavy loads and steel pipes, providing real-time analysis and warnings. Users can configure the system via a YAML file and run it using Docker. The primary dataset used for training is the Construction Site Safety Image Dataset enriched with additional annotations. The system logs are accessible within the Docker container for debugging, and notifications are sent through the LINE messaging API when hazards are detected.

github

: 153

monacopilot

Monacopilot is a powerful and customizable AI auto-completion plugin for the Monaco Editor. It supports multiple AI providers such as Anthropic, OpenAI, Groq, and Google, providing real-time code completions with an efficient caching system. The plugin offers context-aware suggestions, customizable completion behavior, and framework agnostic features. Users can also customize the model support and trigger completions manually. Monacopilot is designed to enhance coding productivity by providing accurate and contextually appropriate completions in daily spoken language.

github

: 111

syncode

SynCode is a novel framework for the grammar-guided generation of Large Language Models (LLMs) that ensures syntactically valid output with respect to defined Context-Free Grammar (CFG) rules. It supports general-purpose programming languages like Python, Go, SQL, JSON, and more, allowing users to define custom grammars using EBNF syntax. The tool compares favorably to other constrained decoders and offers features like fast grammar-guided generation, compatibility with HuggingFace Language Models, and the ability to work with various decoding strategies.

github

: 225

syncode

SynCode is a novel framework for the grammar-guided generation of Large Language Models (LLMs) that ensures syntactically valid output based on a Context-Free Grammar (CFG). It supports various programming languages like Python, Go, SQL, Math, JSON, and more. Users can define custom grammars using EBNF syntax. SynCode offers fast generation, seamless integration with HuggingFace Language Models, and the ability to sample with different decoding strategies.

github

: 251

datadreamer

DataDreamer is an advanced toolkit designed to facilitate the development of edge AI models by enabling synthetic data generation, knowledge extraction from pre-trained models, and creation of efficient and potent models. It eliminates the need for extensive datasets by generating synthetic datasets, leverages latent knowledge from pre-trained models, and focuses on creating compact models suitable for integration into any device and performance for specialized tasks. The toolkit offers features like prompt generation, image generation, dataset annotation, and tools for training small-scale neural networks for edge deployment. It provides hardware requirements, usage instructions, available models, and limitations to consider while using the library.

github

: 77

detoxify

Detoxify is a library that provides trained models and code to predict toxic comments on 3 Jigsaw challenges: Toxic comment classification, Unintended Bias in Toxic comments, Multilingual toxic comment classification. It includes models like 'original', 'unbiased', and 'multilingual' trained on different datasets to detect toxicity and minimize bias. The library aims to help in stopping harmful content online by interpreting visual content in context. Users can fine-tune the models on carefully constructed datasets for research purposes or to aid content moderators in flagging out harmful content quicker. The library is built to be user-friendly and straightforward to use.

github

: 980

call-center-ai

Call Center AI is an AI-powered call center solution that leverages Azure and OpenAI GPT. It is a proof of concept demonstrating the integration of Azure Communication Services, Azure Cognitive Services, and Azure OpenAI to build an automated call center solution. The project showcases features like accessing claims on a public website, customer conversation history, language change during conversation, bot interaction via phone number, multiple voice tones, lexicon understanding, todo list creation, customizable prompts, content filtering, GPT-4 Turbo for customer requests, specific data schema for claims, documentation database access, SMS report sending, conversation resumption, and more. The system architecture includes components like RAG AI Search, SMS gateway, call gateway, moderation, Cosmos DB, event broker, GPT-4 Turbo, Redis cache, translation service, and more. The tool can be deployed remotely using GitHub Actions and locally with prerequisites like Azure environment setup, configuration file creation, and resource hosting. Advanced usage includes custom training data with AI Search, prompt customization, language customization, moderation level customization, claim data schema customization, OpenAI compatible model usage for the LLM, and Twilio integration for SMS.

github

: 119

hordelib

horde-engine is a wrapper around ComfyUI designed to run inference pipelines visually designed in the ComfyUI GUI. It enables users to design inference pipelines in ComfyUI and then call them programmatically, maintaining compatibility with the existing horde implementation. The library provides features for processing Horde payloads, initializing the library, downloading and validating models, and generating images based on input data. It also includes custom nodes for preprocessing and tasks such as face restoration and QR code generation. The project depends on various open source projects and bundles some dependencies within the library itself. Users can design ComfyUI pipelines, convert them to the backend format, and run them using the run_image_pipeline() method in hordelib.comfy.Comfy(). The project is actively developed and tested using git, tox, and a specific model directory structure.

github

: 56

auto-playwright

Auto Playwright is a tool that allows users to run Playwright tests using AI. It eliminates the need for selectors by determining actions at runtime based on plain-text instructions. Users can automate complex scenarios, write tests concurrently with or before functionality development, and benefit from rapid test creation. The tool supports various Playwright actions and offers additional options for debugging and customization. It uses HTML sanitization to reduce costs and improve text quality when interacting with the OpenAI API.

github

: 298

can-ai-code

Can AI Code is a self-evaluating interview tool for AI coding models. It includes interview questions written by humans and tests taken by AI, inference scripts for common API providers and CUDA-enabled quantization runtimes, a Docker-based sandbox environment for validating untrusted Python and NodeJS code, and the ability to evaluate the impact of prompting techniques and sampling parameters on large language model (LLM) coding performance. Users can also assess LLM coding performance degradation due to quantization. The tool provides test suites for evaluating LLM coding performance, a webapp for exploring results, and comparison scripts for evaluations. It supports multiple interviewers for API and CUDA runtimes, with detailed instructions on running the tool in different environments. The repository structure includes folders for interviews, prompts, parameters, evaluation scripts, comparison scripts, and more.

github

: 511

trubrics-python

Trubrics is a Python client for event tracking and analyzing LLM interactions. It offers fast and non-blocking queuing system with automatic flushing to Trubrics API. Users can track events and LLM interactions, adjust logging verbosity, and configure flush intervals and batch sizes. The tool simplifies tracking user interactions and analyzing data for LLM applications.

github

: 146

clickclickclick

ClickClickClick is a framework designed to enable autonomous Android and computer use using various LLM models, both locally and remotely. It supports tasks such as drafting emails, opening browsers, and starting games, with current support for local models via Ollama, Gemini, and GPT 4o. The tool is highly experimental and evolving, with the best results achieved using specific model combinations. Users need prerequisites like `adb` installation and USB debugging enabled on Android phones. The tool can be installed via cloning the repository, setting up a virtual environment, and installing dependencies. It can be used as a CLI tool or script, allowing users to configure planner and finder models for different tasks. Additionally, it can be used as an API to execute tasks based on provided prompts, platform, and models.

github

: 314

SpeziLLM

The Spezi LLM Swift Package includes modules that help integrate LLM-related functionality in applications. It provides tools for local LLM execution, usage of remote OpenAI-based LLMs, and LLMs running on Fog node resources within the local network. The package contains targets like SpeziLLM, SpeziLLMLocal, SpeziLLMLocalDownload, SpeziLLMOpenAI, and SpeziLLMFog for different LLM functionalities. Users can configure and interact with local LLMs, OpenAI LLMs, and Fog LLMs using the provided APIs and platforms within the Spezi ecosystem.

github

: 131

For similar tasks

duckdb-airport-extension

github

: 170

For similar jobs

lollms-webui

LoLLMs WebUI (Lord of Large Language Multimodal Systems: One tool to rule them all) is a user-friendly interface to access and utilize various LLM (Large Language Models) and other AI models for a wide range of tasks. With over 500 AI expert conditionings across diverse domains and more than 2500 fine tuned models over multiple domains, LoLLMs WebUI provides an immediate resource for any problem, from car repair to coding assistance, legal matters, medical diagnosis, entertainment, and more. The easy-to-use UI with light and dark mode options, integration with GitHub repository, support for different personalities, and features like thumb up/down rating, copy, edit, and remove messages, local database storage, search, export, and delete multiple discussions, make LoLLMs WebUI a powerful and versatile tool.

github

: 4.8k

Azure-Analytics-and-AI-Engagement

The Azure-Analytics-and-AI-Engagement repository provides packaged Industry Scenario DREAM Demos with ARM templates (Containing a demo web application, Power BI reports, Synapse resources, AML Notebooks etc.) that can be deployed in a customer’s subscription using the CAPE tool within a matter of few hours. Partners can also deploy DREAM Demos in their own subscriptions using DPoC.

github

: 136

minio

MinIO is a High Performance Object Storage released under GNU Affero General Public License v3.0. It is API compatible with Amazon S3 cloud storage service. Use MinIO to build high performance infrastructure for machine learning, analytics and application data workloads.

github

: 46.0k

mage-ai

Mage is an open-source data pipeline tool for transforming and integrating data. It offers an easy developer experience, engineering best practices built-in, and data as a first-class citizen. Mage makes it easy to build, preview, and launch data pipelines, and provides observability and scaling capabilities. It supports data integrations, streaming pipelines, and dbt integration.

github

: 7.8k

AiTreasureBox

AiTreasureBox is a versatile AI tool that provides a collection of pre-trained models and algorithms for various machine learning tasks. It simplifies the process of implementing AI solutions by offering ready-to-use components that can be easily integrated into projects. With AiTreasureBox, users can quickly prototype and deploy AI applications without the need for extensive knowledge in machine learning or deep learning. The tool covers a wide range of tasks such as image classification, text generation, sentiment analysis, object detection, and more. It is designed to be user-friendly and accessible to both beginners and experienced developers, making AI development more efficient and accessible to a wider audience.

github

: 368

tidb

TiDB is an open-source distributed SQL database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads. It is MySQL compatible and features horizontal scalability, strong consistency, and high availability.

github

: 37.1k

airbyte

Airbyte is an open-source data integration platform that makes it easy to move data from any source to any destination. With Airbyte, you can build and manage data pipelines without writing any code. Airbyte provides a library of pre-built connectors that make it easy to connect to popular data sources and destinations. You can also create your own connectors using Airbyte's no-code Connector Builder or low-code CDK. Airbyte is used by data engineers and analysts at companies of all sizes to build and manage their data pipelines.

github

: 19.6k

labelbox-python

Labelbox is a data-centric AI platform for enterprises to develop, optimize, and use AI to solve problems and power new products and services. Enterprises use Labelbox to curate data, generate high-quality human feedback data for computer vision and LLMs, evaluate model performance, and automate tasks by combining AI and human-centric workflows. The academic & research community uses Labelbox for cutting-edge AI research.

github

: 135