
mcp-apache-spark-history-server
MCP Server for Apache Spark History Server. The bridge between Agentic AI and Apache Spark.
Stars: 81

The MCP Server for Apache Spark History Server is a tool that connects AI agents to Apache Spark History Server for intelligent job analysis and performance monitoring. It enables AI agents to analyze job performance, identify bottlenecks, and provide insights from Spark History Server data. The server bridges AI agents with existing Apache Spark infrastructure, allowing users to query job details, analyze performance metrics, compare multiple jobs, investigate failures, and generate insights from historical execution data.
README:
๐ค Connect AI agents to Apache Spark History Server for intelligent job analysis and performance monitoring
Transform your Spark infrastructure monitoring with AI! This Model Context Protocol (MCP) server enables AI agents to analyze job performance, identify bottlenecks, and provide intelligent insights from your Spark History Server data.
Spark History Server MCP bridges AI agents with your existing Apache Spark infrastructure, enabling:
- ๐ Query job details through natural language
- ๐ Analyze performance metrics across applications
- ๐ Compare multiple jobs to identify regressions
- ๐จ Investigate failures with detailed error analysis
- ๐ Generate insights from historical execution data
๐บ See it in action:
graph TB
A[๐ค AI Agent/LLM] --> F[๐ก MCP Client]
B[๐ฆ LlamaIndex Agent] --> F
C[๐ LangGraph] --> F
D[๏ฟฝ๏ธ Claudep Desktop] --> F
E[๐ ๏ธ Amazon Q CLI] --> F
F --> G[โก Spark History MCP Server]
G --> H[๐ฅ Prod Spark History Server]
G --> I[๐ฅ Staging Spark History Server]
G --> J[๐ฅ Dev Spark History Server]
H --> K[๐ Prod Event Logs]
I --> L[๐ Staging Event Logs]
J --> M[๐ Dev Event Logs]
๐ Components:
- ๐ฅ Spark History Server: Your existing infrastructure serving Spark event data
- โก MCP Server: This project - provides MCP tools for querying Spark data
- ๐ค AI Agents: LangChain, custom agents, or any MCP-compatible client
The package is published to PyPI: https://pypi.org/project/mcp-apache-spark-history-server/
- ๐ฅ Existing Spark History Server (running and accessible)
- ๐ Python 3.12+
- โก uv package manager
git clone https://github.com/kubeflow/mcp-apache-spark-history-server.git
cd mcp-apache-spark-history-server
# Install Task (if not already installed)
brew install go-task # macOS, see https://taskfile.dev/installation/ for others
# Setup and start testing
task start-spark-bg # Start Spark History Server with sample data (default Spark 3.5.5)
# Or specify a different Spark version:
# task start-spark-bg spark_version=3.5.2
task start-mcp-bg # Start MCP Server
# Optional: Opens MCP Inspector on http://localhost:6274 for interactive testing
# Requires Node.js: 22.7.5+ (Check https://github.com/modelcontextprotocol/inspector for latest requirements)
task start-inspector-bg # Start MCP Inspector
# When done, run `task stop-all`
If you just want to run the MCP server without cloning the repository:
# Run with uv without installing the module
uvx --from mcp-apache-spark-history-server spark-mcp
# OR run with pip and python. Use of venv is highly encouraged.
python3 -m venv spark-mcp && source spark-mcp/bin/activate
pip install mcp-apache-spark-history-server
python3 -m spark_history_mcp.core.main
# Deactivate venv
deactivate
Edit config.yaml
for your Spark History Server:
Config File Options:
- Command line:
--config /path/to/config.yaml
or-c /path/to/config.yaml
- Environment variable:
SHS_MCP_CONFIG=/path/to/config.yaml
- Default:
./config.yaml
servers:
local:
default: true
url: "http://your-spark-history-server:18080"
auth: # optional
username: "user"
password: "pass"
mcp:
transports:
- streamable-http # streamable-http or stdio.
port: "18888"
debug: true
The repository includes real Spark event logs for testing:
-
spark-bcec39f6201b42b9925124595baad260
- โ Successful ETL job -
spark-110be3a8424d4a2789cb88134418217b
- ๐ Data processing job -
spark-cc4d115f011443d787f03a71a476a745
- ๐ Multi-stage analytics job
See TESTING.md for using them.
Note: These tools are subject to change as we scale and improve the performance of the MCP server.
The MCP server provides 18 specialized tools organized by analysis patterns. LLMs can intelligently select and combine these tools based on user queries:
Basic application metadata and overview
๐ง Tool | ๐ Description |
---|---|
list_applications |
๐ Get a list of all applications available on the Spark History Server with optional filtering by status, date ranges, and limits |
get_application |
๐ Get detailed information about a specific Spark application including status, resource usage, duration, and attempt details |
Job-level performance analysis and identification
๐ง Tool | ๐ Description |
---|---|
list_jobs |
๐ Get a list of all jobs for a Spark application with optional status filtering |
list_slowest_jobs |
โฑ๏ธ Get the N slowest jobs for a Spark application (excludes running jobs by default) |
Stage-level performance deep dive and task metrics
๐ง Tool | ๐ Description |
---|---|
list_stages |
โก Get a list of all stages for a Spark application with optional status filtering and summaries |
list_slowest_stages |
๐ Get the N slowest stages for a Spark application (excludes running stages by default) |
get_stage |
๐ฏ Get information about a specific stage with optional attempt ID and summary metrics |
get_stage_task_summary |
๐ Get statistical distributions of task metrics for a specific stage (execution times, memory usage, I/O metrics) |
Resource utilization, executor performance, and allocation tracking
๐ง Tool | ๐ Description |
---|---|
list_executors |
๐ฅ๏ธ Get executor information with optional inactive executor inclusion |
get_executor |
๐ Get information about a specific executor including resource allocation, task statistics, and performance metrics |
get_executor_summary |
๐ Aggregates metrics across all executors (memory usage, disk usage, task counts, performance metrics) |
get_resource_usage_timeline |
๐ Get chronological view of resource allocation and usage patterns including executor additions/removals |
Spark configuration, environment variables, and runtime settings
๐ง Tool | ๐ Description |
---|---|
get_environment |
โ๏ธ Get comprehensive Spark runtime configuration including JVM info, Spark properties, system properties, and classpath |
SQL performance analysis and execution plan comparison
๐ง Tool | ๐ Description |
---|---|
list_slowest_sql_queries |
๐ Get the top N slowest SQL queries for an application with detailed execution metrics |
compare_sql_execution_plans |
๐ Compare SQL execution plans between two Spark jobs, analyzing logical/physical plans and execution metrics |
Intelligent bottleneck identification and performance recommendations
๐ง Tool | ๐ Description |
---|---|
get_job_bottlenecks |
๐จ Identify performance bottlenecks by analyzing stages, tasks, and executors with actionable recommendations |
Cross-application comparison for regression detection and optimization
๐ง Tool | ๐ Description |
---|---|
compare_job_environments |
โ๏ธ Compare Spark environment configurations between two jobs to identify differences in properties and settings |
compare_job_performance |
๐ Compare performance metrics between two Spark jobs including execution times, resource usage, and task distribution |
Query Pattern Examples:
-
"Show me all applications between 12 AM and 1 AM on 2025-06-27" โ
list_applications
-
"Why is my job slow?" โ
get_job_bottlenecks
+list_slowest_stages
+get_executor_summary
-
"Compare today vs yesterday" โ
compare_job_performance
+compare_job_environments
-
"What's wrong with stage 5?" โ
get_stage
+get_stage_task_summary
-
"Show me resource usage over time" โ
get_resource_usage_timeline
+get_executor_summary
-
"Find my slowest SQL queries" โ
list_slowest_sql_queries
+compare_sql_execution_plans
If you are an existing AWS user looking to analyze your Spark Applications, we provide detailed setup guides for:
- AWS Glue Users - Connect to Glue Spark History Server
- Amazon EMR Users - Use EMR Persistent UI for Spark analysis
These guides provide step-by-step instructions for setting up the Spark History Server MCP with your AWS services.
Deploy using Kubernetes with Helm:
โ ๏ธ Work in Progress: We are still testing and will soon publish the container image and Helm registry to GitHub for easy deployment.
# ๐ฆ Deploy with Helm
helm install spark-history-mcp ./deploy/kubernetes/helm/spark-history-mcp/
# ๐ฏ Production configuration
helm install spark-history-mcp ./deploy/kubernetes/helm/spark-history-mcp/ \
--set replicaCount=3 \
--set autoscaling.enabled=true \
--set monitoring.enabled=true
๐ See deploy/kubernetes/helm/
for complete deployment manifests and configuration options.
Note: When using Secret Store CSI Driver authentication, you must create a
SecretProviderClass
externally before deploying the chart.
Setup multiple Spark history servers in the config.yaml and choose which server you want the LLM to interact with for each query.
servers:
production:
default: true
url: "http://prod-spark-history:18080"
auth:
username: "user"
password: "pass"
staging:
url: "http://staging-spark-history:18080"
๐ User Query: "Can you get application <app_id> using production server?"
๐ค AI Tool Request:
{
"app_id": "<app_id>",
"server": "production"
}
๐ค AI Tool Response:
{
"id": "<app_id>>",
"name": "app_name",
"coresGranted": null,
"maxCores": null,
"coresPerExecutor": null,
"memoryPerExecutorMB": null,
"attempts": [
{
"attemptId": null,
"startTime": "2023-09-06T04:44:37.006000Z",
"endTime": "2023-09-06T04:45:40.431000Z",
"lastUpdated": "2023-09-06T04:45:42Z",
"duration": 63425,
"sparkUser": "spark",
"appSparkVersion": "3.3.0",
"completed": true
}
]
}
SHS_MCP_PORT - Port for MCP server (default: 18888)
SHS_MCP_DEBUG - Enable debug mode (default: false)
SHS_MCP_ADDRESS - Address for MCP server (default: localhost)
SHS_MCP_TRANSPORT - MCP transport mode (default: streamable-http)
SHS_SERVERS_*_URL - URL for a specific server
SHS_SERVERS_*_AUTH_USERNAME - Username for a specific server
SHS_SERVERS_*_AUTH_PASSWORD - Password for a specific server
SHS_SERVERS_*_AUTH_TOKEN - Token for a specific server
SHS_SERVERS_*_VERIFY_SSL - Whether to verify SSL for a specific server (true/false)
SHS_SERVERS_*_TIMEOUT - HTTP request timeout in seconds for a specific server (default: 30)
SHS_SERVERS_*_EMR_CLUSTER_ARN - EMR cluster ARN for a specific server
Integration | Transport | Best For |
---|---|---|
Local Testing | HTTP | Development, testing tools |
Claude Desktop | STDIO | Interactive analysis |
Amazon Q CLI | STDIO | Command-line automation |
Kiro | HTTP | IDE integration, code-centric analysis |
LangGraph | HTTP | Multi-agent workflows |
Strands Agents | HTTP | Multi-agent workflows |
๐ค AI Query: "Why is my ETL job running slower than usual?"
๐ MCP Actions:
โ
Analyze application metrics
โ
Compare with historical performance
โ
Identify bottleneck stages
โ
Generate optimization recommendations
๐ค AI Query: "What caused job 42 to fail?"
๐ MCP Actions:
โ
Examine failed tasks and error messages
โ
Review executor logs and resource usage
โ
Identify root cause and suggest fixes
๐ค AI Query: "Compare today's batch job with yesterday's run"
๐ MCP Actions:
โ
Compare execution times and resource usage
โ
Identify performance deltas
โ
Highlight configuration differences
Check CONTRIBUTING.md for full guidelines on contributions
Apache License 2.0 - see LICENSE file for details.
This project is built for use with Apache Sparkโข History Server. Not affiliated with or endorsed by the Apache Software Foundation.
๐ฅ Connect your Spark infrastructure to AI agents
๐ Get Started | ๐ ๏ธ View Tools | ๐งช Test Now | ๐ค Contribute
Built by the community, for the community ๐
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for mcp-apache-spark-history-server
Similar Open Source Tools

mcp-apache-spark-history-server
The MCP Server for Apache Spark History Server is a tool that connects AI agents to Apache Spark History Server for intelligent job analysis and performance monitoring. It enables AI agents to analyze job performance, identify bottlenecks, and provide insights from Spark History Server data. The server bridges AI agents with existing Apache Spark infrastructure, allowing users to query job details, analyze performance metrics, compare multiple jobs, investigate failures, and generate insights from historical execution data.

prometheus-mcp-server
Prometheus MCP Server is a Model Context Protocol (MCP) server that provides access to Prometheus metrics and queries through standardized interfaces. It allows AI assistants to execute PromQL queries and analyze metrics data. The server supports executing queries, exploring metrics, listing available metrics, viewing query results, and authentication. It offers interactive tools for AI assistants and can be configured to choose specific tools. Installation methods include using Docker Desktop, MCP-compatible clients like Claude Desktop, VS Code, Cursor, and Windsurf, and manual Docker setup. Configuration options include setting Prometheus server URL, authentication credentials, organization ID, transport mode, and bind host/port. Contributions are welcome, and the project uses `uv` for managing dependencies and includes a comprehensive test suite for functionality testing.

gpt-load
GPT-Load is a high-performance, enterprise-grade AI API transparent proxy service designed for enterprises and developers needing to integrate multiple AI services. Built with Go, it features intelligent key management, load balancing, and comprehensive monitoring capabilities for high-concurrency production environments. The tool serves as a transparent proxy service, preserving native API formats of various AI service providers like OpenAI, Google Gemini, and Anthropic Claude. It supports dynamic configuration, distributed leader-follower deployment, and a Vue 3-based web management interface. GPT-Load is production-ready with features like dual authentication, graceful shutdown, and error recovery.

tunacode
TunaCode CLI is an AI-powered coding assistant that provides a command-line interface for developers to enhance their coding experience. It offers features like model selection, parallel execution for faster file operations, and various commands for code management. The tool aims to improve coding efficiency and provide a seamless coding environment for developers.

SwiftAI
SwiftAI is a modern, type-safe Swift library for building AI-powered apps. It provides a unified API that works seamlessly across different AI models, including Apple's on-device models and cloud-based services like OpenAI. With features like model agnosticism, structured output, agent tool loop, conversations, extensibility, and Swift-native design, SwiftAI offers a powerful toolset for developers to integrate AI capabilities into their applications. The library supports easy installation via Swift Package Manager and offers detailed guidance on getting started, structured responses, tool use, model switching, conversations, and advanced constraints. SwiftAI aims to simplify AI integration by providing a type-safe and versatile solution for various AI tasks.

MassGen
MassGen is a cutting-edge multi-agent system that leverages the power of collaborative AI to solve complex tasks. It assigns a task to multiple AI agents who work in parallel, observe each other's progress, and refine their approaches to converge on the best solution to deliver a comprehensive and high-quality result. The system operates through an architecture designed for seamless multi-agent collaboration, with key features including cross-model/agent synergy, parallel processing, intelligence sharing, consensus building, and live visualization. Users can install the system, configure API settings, and run MassGen for various tasks such as question answering, creative writing, research, development & coding tasks, and web automation & browser tasks. The roadmap includes plans for advanced agent collaboration, expanded model, tool & agent integration, improved performance & scalability, enhanced developer experience, and a web interface.

llm-context.py
LLM Context is a tool designed to assist developers in quickly injecting relevant content from code/text projects into Large Language Model chat interfaces. It leverages `.gitignore` patterns for smart file selection and offers a streamlined clipboard workflow using the command line. The tool also provides direct integration with Large Language Models through the Model Context Protocol (MCP). LLM Context is optimized for code repositories and collections of text/markdown/html documents, making it suitable for developers working on projects that fit within an LLM's context window. The tool is under active development and aims to enhance AI-assisted development workflows by harnessing the power of Large Language Models.

factorio-learning-environment
Factorio Learning Environment is an open source framework designed for developing and evaluating LLM agents in the game of Factorio. It provides two settings: Lab-play with structured tasks and Open-play for building large factories. Results show limitations in spatial reasoning and automation strategies. Agents interact with the environment through code synthesis, observation, action, and feedback. Tools are provided for game actions and state representation. Agents operate in episodes with observation, planning, and action execution. Tasks specify agent goals and are implemented in JSON files. The project structure includes directories for agents, environment, cluster, data, docs, eval, and more. A database is used for checkpointing agent steps. Benchmarks show performance metrics for different configurations.

evalchemy
Evalchemy is a unified and easy-to-use toolkit for evaluating language models, focusing on post-trained models. It integrates multiple existing benchmarks such as RepoBench, AlpacaEval, and ZeroEval. Key features include unified installation, parallel evaluation, simplified usage, and results management. Users can run various benchmarks with a consistent command-line interface and track results locally or integrate with a database for systematic tracking and leaderboard submission.

open-responses
OpenResponses API provides enterprise-grade AI capabilities through a powerful API, simplifying development and deployment while ensuring complete data control. It offers automated tracing, integrated RAG for contextual information retrieval, pre-built tool integrations, self-hosted architecture, and an OpenAI-compatible interface. The toolkit addresses development challenges like feature gaps and integration complexity, as well as operational concerns such as data privacy and operational control. Engineering teams can benefit from improved productivity, production readiness, compliance confidence, and simplified architecture by choosing OpenResponses.

auto-engineer
Auto Engineer is a tool designed to automate the Software Development Life Cycle (SDLC) by building production-grade applications with a combination of human and AI agents. It offers a plugin-based architecture that allows users to install only the necessary functionality for their projects. The tool guides users through key stages including Flow Modeling, IA Generation, Deterministic Scaffolding, AI Coding & Testing Loop, and Comprehensive Quality Checks. Auto Engineer follows a command/event-driven architecture and provides a modular plugin system for specific functionalities. It supports TypeScript with strict typing throughout and includes a built-in message bus server with a web dashboard for monitoring commands and events.

quantalogic
QuantaLogic is a ReAct framework for building advanced AI agents that seamlessly integrates large language models with a robust tool system. It aims to bridge the gap between advanced AI models and practical implementation in business processes by enabling agents to understand, reason about, and execute complex tasks through natural language interaction. The framework includes features such as ReAct Framework, Universal LLM Support, Secure Tool System, Real-time Monitoring, Memory Management, and Enterprise Ready components.

wa_llm
WhatsApp Group Summary Bot is an AI-powered tool that joins WhatsApp groups, tracks conversations, and generates intelligent summaries. It features automated group chat responses, LLM-based conversation summaries, knowledge base integration, persistent message history with PostgreSQL, support for multiple message types, group management, and a REST API with Swagger docs. Prerequisites include Docker, Python 3.12+, PostgreSQL with pgvector extension, Voyage AI API key, and a WhatsApp account for the bot. The tool can be quickly set up by cloning the repository, configuring environment variables, starting services, and connecting devices. It offers API usage for loading new knowledge base topics and generating & dispatching summaries to managed groups. The project architecture includes FastAPI backend, WhatsApp Web API client, PostgreSQL database with vector storage, and AI-powered message processing.

VimLM
VimLM is an AI-powered coding assistant for Vim that integrates AI for code generation, refactoring, and documentation directly into your Vim workflow. It offers native Vim integration with split-window responses and intuitive keybindings, offline first execution with MLX-compatible models, contextual awareness with seamless integration with codebase and external resources, conversational workflow for iterating on responses, project scaffolding for generating and deploying code blocks, and extensibility for creating custom LLM workflows with command chains.

oxylabs-mcp
The Oxylabs MCP Server acts as a bridge between AI models and the web, providing clean, structured data from any site. It enables scraping of URLs, rendering JavaScript-heavy pages, content extraction for AI use, bypassing anti-scraping measures, and accessing geo-restricted web data from 195+ countries. The implementation utilizes the Model Context Protocol (MCP) to facilitate secure interactions between AI assistants and web content. Key features include scraping content from any site, automatic data cleaning and conversion, bypassing blocks and geo-restrictions, flexible setup with cross-platform support, and built-in error handling and request management.

rag-security-scanner
RAG/LLM Security Scanner is a professional security testing tool designed for Retrieval-Augmented Generation (RAG) systems and LLM applications. It identifies critical vulnerabilities in AI-powered applications such as chatbots, virtual assistants, and knowledge retrieval systems. The tool offers features like prompt injection detection, data leakage assessment, function abuse testing, context manipulation identification, professional reporting with JSON/HTML formats, and easy integration with OpenAI, HuggingFace, and custom RAG systems.
For similar tasks

mcp-apache-spark-history-server
The MCP Server for Apache Spark History Server is a tool that connects AI agents to Apache Spark History Server for intelligent job analysis and performance monitoring. It enables AI agents to analyze job performance, identify bottlenecks, and provide insights from Spark History Server data. The server bridges AI agents with existing Apache Spark infrastructure, allowing users to query job details, analyze performance metrics, compare multiple jobs, investigate failures, and generate insights from historical execution data.
For similar jobs

llmops-promptflow-template
LLMOps with Prompt flow is a template and guidance for building LLM-infused apps using Prompt flow. It provides centralized code hosting, lifecycle management, variant and hyperparameter experimentation, A/B deployment, many-to-many dataset/flow relationships, multiple deployment targets, comprehensive reporting, BYOF capabilities, configuration-based development, local prompt experimentation and evaluation, endpoint testing, and optional Human-in-loop validation. The tool is customizable to suit various application needs.

azure-search-vector-samples
This repository provides code samples in Python, C#, REST, and JavaScript for vector support in Azure AI Search. It includes demos for various languages showcasing vectorization of data, creating indexes, and querying vector data. Additionally, it offers tools like Azure AI Search Lab for experimenting with AI-enabled search scenarios in Azure and templates for deploying custom chat-with-your-data solutions. The repository also features documentation on vector search, hybrid search, creating and querying vector indexes, and REST API references for Azure AI Search and Azure OpenAI Service.

geti-sdk
The Intelยฎ Getiโข SDK is a python package that enables teams to rapidly develop AI models by easing the complexities of model development and enhancing collaboration between teams. It provides tools to interact with an Intelยฎ Getiโข server via the REST API, allowing for project creation, downloading, uploading, deploying for local inference with OpenVINO, setting project and model configuration, launching and monitoring training jobs, and media upload and prediction. The SDK also includes tutorial-style Jupyter notebooks demonstrating its usage.

booster
Booster is a powerful inference accelerator designed for scaling large language models within production environments or for experimental purposes. It is built with performance and scaling in mind, supporting various CPUs and GPUs, including Nvidia CUDA, Apple Metal, and OpenCL cards. The tool can split large models across multiple GPUs, offering fast inference on machines with beefy GPUs. It supports both regular FP16/FP32 models and quantised versions, along with popular LLM architectures. Additionally, Booster features proprietary Janus Sampling for code generation and non-English languages.

xFasterTransformer
xFasterTransformer is an optimized solution for Large Language Models (LLMs) on the X86 platform, providing high performance and scalability for inference on mainstream LLM models. It offers C++ and Python APIs for easy integration, along with example codes and benchmark scripts. Users can prepare models in a different format, convert them, and use the APIs for tasks like encoding input prompts, generating token ids, and serving inference requests. The tool supports various data types and models, and can run in single or multi-rank modes using MPI. A web demo based on Gradio is available for popular LLM models like ChatGLM and Llama2. Benchmark scripts help evaluate model inference performance quickly, and MLServer enables serving with REST and gRPC interfaces.

amazon-transcribe-live-call-analytics
The Amazon Transcribe Live Call Analytics (LCA) with Agent Assist Sample Solution is designed to help contact centers assess and optimize caller experiences in real time. It leverages Amazon machine learning services like Amazon Transcribe, Amazon Comprehend, and Amazon SageMaker to transcribe and extract insights from contact center audio. The solution provides real-time supervisor and agent assist features, integrates with existing contact centers, and offers a scalable, cost-effective approach to improve customer interactions. The end-to-end architecture includes features like live call transcription, call summarization, AI-powered agent assistance, and real-time analytics. The solution is event-driven, ensuring low latency and seamless processing flow from ingested speech to live webpage updates.

ai-lab-recipes
This repository contains recipes for building and running containerized AI and LLM applications with Podman. It provides model servers that serve machine-learning models via an API, allowing developers to quickly prototype new AI applications locally. The recipes include components like model servers and AI applications for tasks such as chat, summarization, object detection, etc. Images for sample applications and models are available in `quay.io`, and bootable containers for AI training on Linux OS are enabled.

XLearning
XLearning is a scheduling platform for big data and artificial intelligence, supporting various machine learning and deep learning frameworks. It runs on Hadoop Yarn and integrates frameworks like TensorFlow, MXNet, Caffe, Theano, PyTorch, Keras, XGBoost. XLearning offers scalability, compatibility, multiple deep learning framework support, unified data management based on HDFS, visualization display, and compatibility with code at native frameworks. It provides functions for data input/output strategies, container management, TensorBoard service, and resource usage metrics display. XLearning requires JDK >= 1.7 and Maven >= 3.3 for compilation, and deployment on CentOS 7.2 with Java >= 1.7 and Hadoop 2.6, 2.7, 2.8.