mcp-apache-spark-history-server

MCP Server for Apache Spark History Server. The bridge between Agentic AI and Apache Spark.

Stars: 81

Visit

The MCP Server for Apache Spark History Server is a tool that connects AI agents to Apache Spark History Server for intelligent job analysis and performance monitoring. It enables AI agents to analyze job performance, identify bottlenecks, and provide insights from Spark History Server data. The server bridges AI agents with existing Apache Spark infrastructure, allowing users to query job details, analyze performance metrics, compare multiple jobs, investigate failures, and generate insights from historical execution data.

README:

MCP Server for Apache Spark History Server

🤖 Connect AI agents to Apache Spark History Server for intelligent job analysis and performance monitoring

Transform your Spark infrastructure monitoring with AI! This Model Context Protocol (MCP) server enables AI agents to analyze job performance, identify bottlenecks, and provide intelligent insights from your Spark History Server data.

🎯 What is This?

Spark History Server MCP bridges AI agents with your existing Apache Spark infrastructure, enabling:

🔍 Query job details through natural language
📊 Analyze performance metrics across applications
🔄 Compare multiple jobs to identify regressions
🚨 Investigate failures with detailed error analysis
📈 Generate insights from historical execution data

📺 See it in action:

🏗️ Architecture

graph TB
    A[🤖 AI Agent/LLM] --> F[📡 MCP Client]
    B[🦙 LlamaIndex Agent] --> F
    C[🌐 LangGraph] --> F
    D[�️ Claudep Desktop] --> F
    E[🛠️ Amazon Q CLI] --> F

    F --> G[⚡ Spark History MCP Server]

    G --> H[🔥 Prod Spark History Server]
    G --> I[🔥 Staging Spark History Server]
    G --> J[🔥 Dev Spark History Server]

    H --> K[📄 Prod Event Logs]
    I --> L[📄 Staging Event Logs]
    J --> M[📄 Dev Event Logs]

🔗 Components:

🔥 Spark History Server: Your existing infrastructure serving Spark event data
⚡ MCP Server: This project - provides MCP tools for querying Spark data
🤖 AI Agents: LangChain, custom agents, or any MCP-compatible client

⚡ Quick Start

📦 Ready-to-Use Package

The package is published to PyPI: https://pypi.org/project/mcp-apache-spark-history-server/

📋 Prerequisites

🔥 Existing Spark History Server (running and accessible)
🐍 Python 3.12+
⚡ uv package manager

🚀 Setup & Testing

git clone https://github.com/kubeflow/mcp-apache-spark-history-server.git
cd mcp-apache-spark-history-server

# Install Task (if not already installed)
brew install go-task  # macOS, see https://taskfile.dev/installation/ for others

# Setup and start testing
task start-spark-bg            # Start Spark History Server with sample data (default Spark 3.5.5)
# Or specify a different Spark version:
# task start-spark-bg spark_version=3.5.2
task start-mcp-bg             # Start MCP Server

# Optional: Opens MCP Inspector on http://localhost:6274 for interactive testing
# Requires Node.js: 22.7.5+ (Check https://github.com/modelcontextprotocol/inspector for latest requirements)
task start-inspector-bg       # Start MCP Inspector

# When done, run `task stop-all`

If you just want to run the MCP server without cloning the repository:

# Run with uv without installing the module
uvx --from mcp-apache-spark-history-server spark-mcp

# OR run with pip and python. Use of venv is highly encouraged.
python3 -m venv spark-mcp && source spark-mcp/bin/activate
pip install mcp-apache-spark-history-server
python3 -m spark_history_mcp.core.main
# Deactivate venv
deactivate

⚙️ Server Configuration

Edit config.yaml for your Spark History Server:

Config File Options:

Command line: --config /path/to/config.yaml or -c /path/to/config.yaml
Environment variable: SHS_MCP_CONFIG=/path/to/config.yaml
Default: ./config.yaml

servers:
  local:
    default: true
    url: "http://your-spark-history-server:18080"
    auth:  # optional
      username: "user"
      password: "pass"
mcp:
  transports:
    - streamable-http # streamable-http or stdio.
  port: "18888"
  debug: true

📊 Sample Data

The repository includes real Spark event logs for testing:

spark-bcec39f6201b42b9925124595baad260 - ✅ Successful ETL job
spark-110be3a8424d4a2789cb88134418217b - 🔄 Data processing job
spark-cc4d115f011443d787f03a71a476a745 - 📈 Multi-stage analytics job

See TESTING.md for using them.

📸 Screenshots

🔍 Get Spark Application

⚡ Job Performance Comparison

🛠️ Available Tools

Note: These tools are subject to change as we scale and improve the performance of the MCP server.

The MCP server provides 18 specialized tools organized by analysis patterns. LLMs can intelligently select and combine these tools based on user queries:

📊 Application Information

Basic application metadata and overview

🔧 Tool	📝 Description
`list_applications`	📋 Get a list of all applications available on the Spark History Server with optional filtering by status, date ranges, and limits
`get_application`	📊 Get detailed information about a specific Spark application including status, resource usage, duration, and attempt details

🔗 Job Analysis

Job-level performance analysis and identification

🔧 Tool	📝 Description
`list_jobs`	🔗 Get a list of all jobs for a Spark application with optional status filtering
`list_slowest_jobs`	⏱️ Get the N slowest jobs for a Spark application (excludes running jobs by default)

⚡ Stage Analysis

Stage-level performance deep dive and task metrics

🔧 Tool	📝 Description
`list_stages`	⚡ Get a list of all stages for a Spark application with optional status filtering and summaries
`list_slowest_stages`	🐌 Get the N slowest stages for a Spark application (excludes running stages by default)
`get_stage`	🎯 Get information about a specific stage with optional attempt ID and summary metrics
`get_stage_task_summary`	📊 Get statistical distributions of task metrics for a specific stage (execution times, memory usage, I/O metrics)

🖥️ Executor & Resource Analysis

Resource utilization, executor performance, and allocation tracking

🔧 Tool	📝 Description
`list_executors`	🖥️ Get executor information with optional inactive executor inclusion
`get_executor`	🔍 Get information about a specific executor including resource allocation, task statistics, and performance metrics
`get_executor_summary`	📈 Aggregates metrics across all executors (memory usage, disk usage, task counts, performance metrics)
`get_resource_usage_timeline`	📅 Get chronological view of resource allocation and usage patterns including executor additions/removals

⚙️ Configuration & Environment

Spark configuration, environment variables, and runtime settings

🔧 Tool	📝 Description
`get_environment`	⚙️ Get comprehensive Spark runtime configuration including JVM info, Spark properties, system properties, and classpath

🔎 SQL & Query Analysis

SQL performance analysis and execution plan comparison

🔧 Tool	📝 Description
`list_slowest_sql_queries`	🐌 Get the top N slowest SQL queries for an application with detailed execution metrics
`compare_sql_execution_plans`	🔍 Compare SQL execution plans between two Spark jobs, analyzing logical/physical plans and execution metrics

🚨 Performance & Bottleneck Analysis

Intelligent bottleneck identification and performance recommendations

🔧 Tool	📝 Description
`get_job_bottlenecks`	🚨 Identify performance bottlenecks by analyzing stages, tasks, and executors with actionable recommendations

🔄 Comparative Analysis

Cross-application comparison for regression detection and optimization

🔧 Tool	📝 Description
`compare_job_environments`	⚙️ Compare Spark environment configurations between two jobs to identify differences in properties and settings
`compare_job_performance`	📈 Compare performance metrics between two Spark jobs including execution times, resource usage, and task distribution

🤖 How LLMs Use These Tools

Query Pattern Examples:

"Show me all applications between 12 AM and 1 AM on 2025-06-27" → list_applications
"Why is my job slow?" → get_job_bottlenecks + list_slowest_stages + get_executor_summary
"Compare today vs yesterday" → compare_job_performance + compare_job_environments
"What's wrong with stage 5?" → get_stage + get_stage_task_summary
"Show me resource usage over time" → get_resource_usage_timeline + get_executor_summary
"Find my slowest SQL queries" → list_slowest_sql_queries + compare_sql_execution_plans

📔 AWS Integration Guides

If you are an existing AWS user looking to analyze your Spark Applications, we provide detailed setup guides for:

AWS Glue Users - Connect to Glue Spark History Server
Amazon EMR Users - Use EMR Persistent UI for Spark analysis

These guides provide step-by-step instructions for setting up the Spark History Server MCP with your AWS services.

🚀 Kubernetes Deployment

Deploy using Kubernetes with Helm:

⚠️ Work in Progress: We are still testing and will soon publish the container image and Helm registry to GitHub for easy deployment.

# 📦 Deploy with Helm
helm install spark-history-mcp ./deploy/kubernetes/helm/spark-history-mcp/

# 🎯 Production configuration
helm install spark-history-mcp ./deploy/kubernetes/helm/spark-history-mcp/ \
  --set replicaCount=3 \
  --set autoscaling.enabled=true \
  --set monitoring.enabled=true

📚 See deploy/kubernetes/helm/ for complete deployment manifests and configuration options.

Note: When using Secret Store CSI Driver authentication, you must create a SecretProviderClass externally before deploying the chart.

🌐 Multi-Spark History Server Setup

Setup multiple Spark history servers in the config.yaml and choose which server you want the LLM to interact with for each query.

servers:
  production:
    default: true
    url: "http://prod-spark-history:18080"
    auth:
      username: "user"
      password: "pass"
  staging:
    url: "http://staging-spark-history:18080"

💁 User Query: "Can you get application <app_id> using production server?"

🤖 AI Tool Request:

{
  "app_id": "<app_id>",
  "server": "production"
}

🤖 AI Tool Response:

{
  "id": "<app_id>>",
  "name": "app_name",
  "coresGranted": null,
  "maxCores": null,
  "coresPerExecutor": null,
  "memoryPerExecutorMB": null,
  "attempts": [
    {
      "attemptId": null,
      "startTime": "2023-09-06T04:44:37.006000Z",
      "endTime": "2023-09-06T04:45:40.431000Z",
      "lastUpdated": "2023-09-06T04:45:42Z",
      "duration": 63425,
      "sparkUser": "spark",
      "appSparkVersion": "3.3.0",
      "completed": true
    }
  ]
}

🔐 Environment Variables

SHS_MCP_PORT - Port for MCP server (default: 18888)
SHS_MCP_DEBUG - Enable debug mode (default: false)
SHS_MCP_ADDRESS - Address for MCP server (default: localhost)
SHS_MCP_TRANSPORT - MCP transport mode (default: streamable-http)
SHS_SERVERS_*_URL - URL for a specific server
SHS_SERVERS_*_AUTH_USERNAME - Username for a specific server
SHS_SERVERS_*_AUTH_PASSWORD - Password for a specific server
SHS_SERVERS_*_AUTH_TOKEN - Token for a specific server
SHS_SERVERS_*_VERIFY_SSL - Whether to verify SSL for a specific server (true/false)
SHS_SERVERS_*_TIMEOUT - HTTP request timeout in seconds for a specific server (default: 30)
SHS_SERVERS_*_EMR_CLUSTER_ARN - EMR cluster ARN for a specific server

🤖 AI Agent Integration

Quick Start Options

Integration	Transport	Best For
Local Testing	HTTP	Development, testing tools
Claude Desktop	STDIO	Interactive analysis
Amazon Q CLI	STDIO	Command-line automation
Kiro	HTTP	IDE integration, code-centric analysis
LangGraph	HTTP	Multi-agent workflows
Strands Agents	HTTP	Multi-agent workflows

🎯 Example Use Cases

🔍 Performance Investigation

🤖 AI Query: "Why is my ETL job running slower than usual?"

📊 MCP Actions:
✅ Analyze application metrics
✅ Compare with historical performance
✅ Identify bottleneck stages
✅ Generate optimization recommendations

🚨 Failure Analysis

🤖 AI Query: "What caused job 42 to fail?"

🔍 MCP Actions:
✅ Examine failed tasks and error messages
✅ Review executor logs and resource usage
✅ Identify root cause and suggest fixes

📈 Comparative Analysis

🤖 AI Query: "Compare today's batch job with yesterday's run"

📊 MCP Actions:
✅ Compare execution times and resource usage
✅ Identify performance deltas
✅ Highlight configuration differences

🤝 Contributing

Check CONTRIBUTING.md for full guidelines on contributions

📄 License

Apache License 2.0 - see LICENSE file for details.

📝 Trademark Notice

This project is built for use with Apache Spark™ History Server. Not affiliated with or endorsed by the Apache Software Foundation.

🔥 Connect your Spark infrastructure to AI agents

🚀 Get Started | 🛠️ View Tools | 🧪 Test Now | 🤝 Contribute

Built by the community, for the community 💙

For Tasks:

Click tags to check more tools for each tasks

analyze job performance compare job metrics investigate job failures generate performance insights query job details

For Jobs:

data analyst machine learning engineer data engineer ai solutions architect big data consultant

Alternative AI tools for mcp-apache-spark-history-server

Similar Open Source Tools

mcp-apache-spark-history-server

github

: 81

prometheus-mcp-server

Prometheus MCP Server is a Model Context Protocol (MCP) server that provides access to Prometheus metrics and queries through standardized interfaces. It allows AI assistants to execute PromQL queries and analyze metrics data. The server supports executing queries, exploring metrics, listing available metrics, viewing query results, and authentication. It offers interactive tools for AI assistants and can be configured to choose specific tools. Installation methods include using Docker Desktop, MCP-compatible clients like Claude Desktop, VS Code, Cursor, and Windsurf, and manual Docker setup. Configuration options include setting Prometheus server URL, authentication credentials, organization ID, transport mode, and bind host/port. Contributions are welcome, and the project uses `uv` for managing dependencies and includes a comprehensive test suite for functionality testing.

github

: 222

gpt-load

GPT-Load is a high-performance, enterprise-grade AI API transparent proxy service designed for enterprises and developers needing to integrate multiple AI services. Built with Go, it features intelligent key management, load balancing, and comprehensive monitoring capabilities for high-concurrency production environments. The tool serves as a transparent proxy service, preserving native API formats of various AI service providers like OpenAI, Google Gemini, and Anthropic Claude. It supports dynamic configuration, distributed leader-follower deployment, and a Vue 3-based web management interface. GPT-Load is production-ready with features like dual authentication, graceful shutdown, and error recovery.

github

: 5.1k

tunacode

TunaCode CLI is an AI-powered coding assistant that provides a command-line interface for developers to enhance their coding experience. It offers features like model selection, parallel execution for faster file operations, and various commands for code management. The tool aims to improve coding efficiency and provide a seamless coding environment for developers.

github

: 83

SwiftAI

SwiftAI is a modern, type-safe Swift library for building AI-powered apps. It provides a unified API that works seamlessly across different AI models, including Apple's on-device models and cloud-based services like OpenAI. With features like model agnosticism, structured output, agent tool loop, conversations, extensibility, and Swift-native design, SwiftAI offers a powerful toolset for developers to integrate AI capabilities into their applications. The library supports easy installation via Swift Package Manager and offers detailed guidance on getting started, structured responses, tool use, model switching, conversations, and advanced constraints. SwiftAI aims to simplify AI integration by providing a type-safe and versatile solution for various AI tasks.

github

: 201

MassGen

MassGen is a cutting-edge multi-agent system that leverages the power of collaborative AI to solve complex tasks. It assigns a task to multiple AI agents who work in parallel, observe each other's progress, and refine their approaches to converge on the best solution to deliver a comprehensive and high-quality result. The system operates through an architecture designed for seamless multi-agent collaboration, with key features including cross-model/agent synergy, parallel processing, intelligence sharing, consensus building, and live visualization. Users can install the system, configure API settings, and run MassGen for various tasks such as question answering, creative writing, research, development & coding tasks, and web automation & browser tasks. The roadmap includes plans for advanced agent collaboration, expanded model, tool & agent integration, improved performance & scalability, enhanced developer experience, and a web interface.

github

: 454

llm-context.py

LLM Context is a tool designed to assist developers in quickly injecting relevant content from code/text projects into Large Language Model chat interfaces. It leverages `.gitignore` patterns for smart file selection and offers a streamlined clipboard workflow using the command line. The tool also provides direct integration with Large Language Models through the Model Context Protocol (MCP). LLM Context is optimized for code repositories and collections of text/markdown/html documents, making it suitable for developers working on projects that fit within an LLM's context window. The tool is under active development and aims to enhance AI-assisted development workflows by harnessing the power of Large Language Models.

github

: 273

factorio-learning-environment

Factorio Learning Environment is an open source framework designed for developing and evaluating LLM agents in the game of Factorio. It provides two settings: Lab-play with structured tasks and Open-play for building large factories. Results show limitations in spatial reasoning and automation strategies. Agents interact with the environment through code synthesis, observation, action, and feedback. Tools are provided for game actions and state representation. Agents operate in episodes with observation, planning, and action execution. Tasks specify agent goals and are implemented in JSON files. The project structure includes directories for agents, environment, cluster, data, docs, eval, and more. A database is used for checkpointing agent steps. Benchmarks show performance metrics for different configurations.

github

: 783

evalchemy

Evalchemy is a unified and easy-to-use toolkit for evaluating language models, focusing on post-trained models. It integrates multiple existing benchmarks such as RepoBench, AlpacaEval, and ZeroEval. Key features include unified installation, parallel evaluation, simplified usage, and results management. Users can run various benchmarks with a consistent command-line interface and track results locally or integrate with a database for systematic tracking and leaderboard submission.

github

: 317

open-responses

OpenResponses API provides enterprise-grade AI capabilities through a powerful API, simplifying development and deployment while ensuring complete data control. It offers automated tracing, integrated RAG for contextual information retrieval, pre-built tool integrations, self-hosted architecture, and an OpenAI-compatible interface. The toolkit addresses development challenges like feature gaps and integration complexity, as well as operational concerns such as data privacy and operational control. Engineering teams can benefit from improved productivity, production readiness, compliance confidence, and simplified architecture by choosing OpenResponses.

github

: 56

auto-engineer

Auto Engineer is a tool designed to automate the Software Development Life Cycle (SDLC) by building production-grade applications with a combination of human and AI agents. It offers a plugin-based architecture that allows users to install only the necessary functionality for their projects. The tool guides users through key stages including Flow Modeling, IA Generation, Deterministic Scaffolding, AI Coding & Testing Loop, and Comprehensive Quality Checks. Auto Engineer follows a command/event-driven architecture and provides a modular plugin system for specific functionalities. It supports TypeScript with strict typing throughout and includes a built-in message bus server with a web dashboard for monitoring commands and events.

github

: 61

quantalogic

QuantaLogic is a ReAct framework for building advanced AI agents that seamlessly integrates large language models with a robust tool system. It aims to bridge the gap between advanced AI models and practical implementation in business processes by enabling agents to understand, reason about, and execute complex tasks through natural language interaction. The framework includes features such as ReAct Framework, Universal LLM Support, Secure Tool System, Real-time Monitoring, Memory Management, and Enterprise Ready components.

github

: 376

wa_llm

WhatsApp Group Summary Bot is an AI-powered tool that joins WhatsApp groups, tracks conversations, and generates intelligent summaries. It features automated group chat responses, LLM-based conversation summaries, knowledge base integration, persistent message history with PostgreSQL, support for multiple message types, group management, and a REST API with Swagger docs. Prerequisites include Docker, Python 3.12+, PostgreSQL with pgvector extension, Voyage AI API key, and a WhatsApp account for the bot. The tool can be quickly set up by cloning the repository, configuring environment variables, starting services, and connecting devices. It offers API usage for loading new knowledge base topics and generating & dispatching summaries to managed groups. The project architecture includes FastAPI backend, WhatsApp Web API client, PostgreSQL database with vector storage, and AI-powered message processing.

github

: 103

VimLM

VimLM is an AI-powered coding assistant for Vim that integrates AI for code generation, refactoring, and documentation directly into your Vim workflow. It offers native Vim integration with split-window responses and intuitive keybindings, offline first execution with MLX-compatible models, contextual awareness with seamless integration with codebase and external resources, conversational workflow for iterating on responses, project scaffolding for generating and deploying code blocks, and extensibility for creating custom LLM workflows with command chains.

github

: 193

pr-pilot

PR Pilot is an AI-powered tool designed to assist users in their daily workflow by delegating routine work to AI with confidence and predictability. It integrates seamlessly with popular development tools and allows users to interact with it through a Command-Line Interface, Python SDK, REST API, and Smart Workflows. Users can automate tasks such as generating PR titles and descriptions, summarizing and posting issues, and formatting README files. The tool aims to save time and enhance productivity by providing AI-powered solutions for common development tasks.

github

: 149

oxylabs-mcp

The Oxylabs MCP Server acts as a bridge between AI models and the web, providing clean, structured data from any site. It enables scraping of URLs, rendering JavaScript-heavy pages, content extraction for AI use, bypassing anti-scraping measures, and accessing geo-restricted web data from 195+ countries. The implementation utilizes the Model Context Protocol (MCP) to facilitate secure interactions between AI assistants and web content. Key features include scraping content from any site, automatic data cleaning and conversion, bypassing blocks and geo-restrictions, flexible setup with cross-platform support, and built-in error handling and request management.

github

: 61

For similar tasks

mcp-apache-spark-history-server

github

: 81

For similar jobs

llmops-promptflow-template

LLMOps with Prompt flow is a template and guidance for building LLM-infused apps using Prompt flow. It provides centralized code hosting, lifecycle management, variant and hyperparameter experimentation, A/B deployment, many-to-many dataset/flow relationships, multiple deployment targets, comprehensive reporting, BYOF capabilities, configuration-based development, local prompt experimentation and evaluation, endpoint testing, and optional Human-in-loop validation. The tool is customizable to suit various application needs.

github

: 222

azure-search-vector-samples

This repository provides code samples in Python, C#, REST, and JavaScript for vector support in Azure AI Search. It includes demos for various languages showcasing vectorization of data, creating indexes, and querying vector data. Additionally, it offers tools like Azure AI Search Lab for experimenting with AI-enabled search scenarios in Azure and templates for deploying custom chat-with-your-data solutions. The repository also features documentation on vector search, hybrid search, creating and querying vector indexes, and REST API references for Azure AI Search and Azure OpenAI Service.

github

: 740

geti-sdk

The Intel® Geti™ SDK is a python package that enables teams to rapidly develop AI models by easing the complexities of model development and enhancing collaboration between teams. It provides tools to interact with an Intel® Geti™ server via the REST API, allowing for project creation, downloading, uploading, deploying for local inference with OpenVINO, setting project and model configuration, launching and monitoring training jobs, and media upload and prediction. The SDK also includes tutorial-style Jupyter notebooks demonstrating its usage.

github

: 74

booster

Booster is a powerful inference accelerator designed for scaling large language models within production environments or for experimental purposes. It is built with performance and scaling in mind, supporting various CPUs and GPUs, including Nvidia CUDA, Apple Metal, and OpenCL cards. The tool can split large models across multiple GPUs, offering fast inference on machines with beefy GPUs. It supports both regular FP16/FP32 models and quantised versions, along with popular LLM architectures. Additionally, Booster features proprietary Janus Sampling for code generation and non-English languages.

github

: 133

xFasterTransformer

xFasterTransformer is an optimized solution for Large Language Models (LLMs) on the X86 platform, providing high performance and scalability for inference on mainstream LLM models. It offers C++ and Python APIs for easy integration, along with example codes and benchmark scripts. Users can prepare models in a different format, convert them, and use the APIs for tasks like encoding input prompts, generating token ids, and serving inference requests. The tool supports various data types and models, and can run in single or multi-rank modes using MPI. A web demo based on Gradio is available for popular LLM models like ChatGLM and Llama2. Benchmark scripts help evaluate model inference performance quickly, and MLServer enables serving with REST and gRPC interfaces.

github

: 247

amazon-transcribe-live-call-analytics

The Amazon Transcribe Live Call Analytics (LCA) with Agent Assist Sample Solution is designed to help contact centers assess and optimize caller experiences in real time. It leverages Amazon machine learning services like Amazon Transcribe, Amazon Comprehend, and Amazon SageMaker to transcribe and extract insights from contact center audio. The solution provides real-time supervisor and agent assist features, integrates with existing contact centers, and offers a scalable, cost-effective approach to improve customer interactions. The end-to-end architecture includes features like live call transcription, call summarization, AI-powered agent assistance, and real-time analytics. The solution is event-driven, ensuring low latency and seamless processing flow from ingested speech to live webpage updates.

github

: 85

ai-lab-recipes

This repository contains recipes for building and running containerized AI and LLM applications with Podman. It provides model servers that serve machine-learning models via an API, allowing developers to quickly prototype new AI applications locally. The recipes include components like model servers and AI applications for tasks such as chat, summarization, object detection, etc. Images for sample applications and models are available in `quay.io`, and bootable containers for AI training on Linux OS are enabled.

github

: 139

XLearning

XLearning is a scheduling platform for big data and artificial intelligence, supporting various machine learning and deep learning frameworks. It runs on Hadoop Yarn and integrates frameworks like TensorFlow, MXNet, Caffe, Theano, PyTorch, Keras, XGBoost. XLearning offers scalability, compatibility, multiple deep learning framework support, unified data management based on HDFS, visualization display, and compatibility with code at native frameworks. It provides functions for data input/output strategies, container management, TensorBoard service, and resource usage metrics display. XLearning requires JDK >= 1.7 and Maven >= 3.3 for compilation, and deployment on CentOS 7.2 with Java >= 1.7 and Hadoop 2.6, 2.7, 2.8.

github

: 1.7k