AIforEarthDataSets

Notebooks and documentation for AI-for-Earth-managed datasets on Azure

Stars: 263

Visit

The Microsoft AI for Earth program hosts geospatial data on Azure that is important to environmental sustainability and Earth science. This repo hosts documentation and demonstration notebooks for all the data that is managed by AI for Earth. It also serves as a "staging ground" for the Planetary Computer Data Catalog.

README:

AI for Earth Data Sets

If you have feedback about any of this data, or want to request additions to our data program, email [email protected].

AI for Earth Data Sets
- Table of contents
Data sets
Legal stuff
- Contributing
- Trademarks

Data sets

ALOS World 3D

Global topographic information from the JAXA ALOS PRISM instrument.

ASTER L1T (2000-2006)

The ASTER instrument, launched on-board NASA's Terra satellite in 1999, provides multispectral images of the Earth at 15m-90m resolution. This data set represents ASTER data from 2000-2006.

Copernicus DEM

Global topographic information from the Copernicus program.

Daymet

Estimates of daily weather parameters in North America on a one-kilometer grid, with monthly and annual summaries.

Deltares Global Flood Maps

Global estimates of coastal inundation under various sea level rise conditions and return periods at 90m, 1km, and 5km resolutions. Also includes estimated coastal inundation caused by named historical storm events going back several decades.

Deltares Global Water Availability

Simulations of historical daily reservoir variations for 3,236 locations across the globe for the period 1970-2020 using the distributed wflow_sbm model. The model outputs long-term daily information on reservoir volume, inflow and outflow dynamics, as well as information on upstream hydrological forcing.

Esri 10m Land Cover

Global estimates of 10-class land use/land cover (LULC) for 2020, derived from ESA Sentinel-2 imagery at 10m resolution, produced by Impact Observatory.

Global Biodiversity Information Facility (GBIF)

Exports of global species occurrence data from the GBIF network.

Harmonized Global Biomass

Global maps of aboveground and belowground biomass carbon density for the year 2010 at 300m resolution.

Harmonized Landsat Sentinel-2

Satellite imagery from the Landsat 8 and Sentinel-2 satellites, aligned to a common grid and processed to compatible color spaces.

High Resolution Electricity Access (HREA)

Settlement-level measures of electricity access, reliability, and usage derived from VIIRS satellite imagery.

High Resolution Ocean Surface Wave Hindcast

Long-term wave hindcast data for the U.S. Exclusive Economic Zone (EEZ), developed by the U.S. Department of Energy's Water Power Technologies Office.

Labeled Information Library of Alexandria: Biology and Conservation (LILA BC)

AI for Earth and partners have assembled a repository of labeled information related to wildlife conservation, particularly wildlife imagery.

lila.science

Landsat TM/MSS Collection 2

Global optical imagery from the Landsat MSS and TM instruments, which imaged the Earth from 1972 to 2013, aboard the Landsat 1-5 satellites.

Landsat TM/MSS data are in preview; access is granted by request.

Landsat 7 Collection 2 Level-2

Global optical imagery from the Landsat 7 satellite, which has imaged the Earth since 1999.

Landsat 7 data are in preview; access is granted by request.

Landsat 8 Collection 2 Level-2

Global optical imagery from the Landsat 8 satellite, which has imaged the Earth since 2013.

MODIS (40 individual products)

Satellite imagery from the Moderate Resolution Imaging Spectroradiometer (MODIS).

Monitoring Trends in Burn Severity Mosaics

Annual burn severity mosaics for the continental United States and Alaska.

National Solar Radiation Database

Hourly and half-hourly values of the three most common measurements of solar radiation – global horizontal, direct normal, and diffuse horizontal irradiance - along with meteorological data.

NASADEM

Global topographic information from the NASADEM program.

NREL Puerto Rico 100 Dataset (PR100)

A collection of geospasial data useful for renewable energy development in Puerto Rico. The dataset is curated by the National Renewable Energy Laboratory.

NREL PV Rooftop Database

A lidar-derived, geospatially-resolved dataset of suitable roof surfaces and their PV technical potential for 128 metropolitan regions in the United States.

NOAA Climate Data Records (CDR)

Historical global climate information.

NOAA Climate Forecast System (CFS)

Model output data from the NOAA NCEP Climate Forecast System Version 2.

NOAA Digital Coast Imagery

High resolution (1 meter or less) imagery collected by a number of sources and contributed to the NOAA Digital Coast

NOAA GFS Warm Start Initial Conditions

Warm start initial conditions for the NOAA Global Forecast System.

NOAA GOES-R

Weather imagery from the GOES-16, GOES-17, and GOES-18 satellites.

NOAA Global Ensemble Forecast System (GEFS)

Model output data from the NOAA Global Ensemble Forecast System.

NOAA Global Forecast System (GFS)

Model output data from the NOAA Global Forecast System.

NOAA Global Hydro Estimator (GHE)

Global rainfall estimates in 15-minute intervals.

NOAA High-Resolution Rapid Refresh (HRRR)

Weather forecasts for North America at 3km spatial resolution and 15 minute temporal resolution.

NOAA Integrated Surface Data (ISD)

Historical global weather information.

NOAA Monthly US Climate Gridded Dataset (NClimGrid)

Gridded climate data for the US from 1895 to the present.

NOAA National Water Model

Data from the National Water Model.

NOAA Rapid Refresh (RAP)

Weather forecasts for North America at 13km resolution.

NOAA US Climate Normals

Typical climate conditions for the United States from 1981 to the present.

National Agriculture Imagery Program

NAIP provides US-wide, high-resolution aerial imagery. This data set includes NAIP images from 2010 to the present.

National Land Cover Database

US-wide data on land cover and land cover change at a 30m resolution with a 16-class legend.

NatureServe Map of Biodiversity Importance (MoBI)

Habitat information for 2,216 imperiled species occurring in the conterminous United States.

Ocean Observatories Initiative CamHD

Video data from the Ocean Observatories Initiative seafloor camera deployed at Axial Volcano on the Juan de Fuca Ridge.

Sentinel-1 GRD

Global synthetic aperture radar (SAR) data from 2017-present, projected to ground range.

Sentinel-1 GRD data are in preview; access is granted by request.

Sentinel-1 SLC

Global synthetic aperture radar (SAR) data for the last 90 days.

Sentinel-1 SLC data are in preview; access is granted by request.

Sentinel-2 L2A

Global optical imagery at 10m resolution from 2016-present.

Sentinel-3 L2

Global multispectral imagery at 300m resolution, with a revisit rate of less than two days, from 2016-present.

Sentinel-3 data are in preview; access is granted by request.

Sentinel-5P

Global atmospheric data from 2018-present.

Sentinel-5P data are in preview; access is granted by request.

TerraClimate

Monthly climate and climatic water balance for global terrestrial surfaces from 1958-2019.

UK Met Office CSSP China 20CRDS

Historical climate data for China, from 1851-2010.

UK Met Office Global Weather Data for COVID-19 Analysis

Data for COVID-19 researchers exploring relationships between COVID-19 and environmental factors.

University of Miami Coupled Model for Hurricanes Ike and Sandy

Modeled wind, wave, and current data for Hurricanes Ike and Sandy, produced by the National Renewable Energy Laboratory.

USFS Forest Inventory and Analysis

Status and trends on U.S. forest location, health, growth, mortality, and production, from the US Forest Service's Forest Inventory and Analysis (FIA) program.

USGS 3DEP Seamless DEMs

USGS Gap Land Cover

Legal stuff

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

For Tasks:

Click tags to check more tools for each tasks

analyze environmental data monitor climate change map land cover track species distribution predict natural disasters

For Jobs:

data scientist environmental scientist geographer climate scientist sustainability analyst

Alternative AI tools for AIforEarthDataSets

Similar Open Source Tools

AIforEarthDataSets

github

: 263

awesome-hallucination-detection

This repository provides a curated list of papers, datasets, and resources related to the detection and mitigation of hallucinations in large language models (LLMs). Hallucinations refer to the generation of factually incorrect or nonsensical text by LLMs, which can be a significant challenge for their use in real-world applications. The resources in this repository aim to help researchers and practitioners better understand and address this issue.

github

: 817

AI6127

AI6127 is a course focusing on deep neural networks for natural language processing (NLP). It covers core NLP tasks and machine learning models, emphasizing deep learning methods using libraries like Pytorch. The course aims to teach students state-of-the-art techniques for practical NLP problems, including writing, debugging, and training deep neural models. It also explores advancements in NLP such as Transformers and ChatGPT.

github

: 51

accelerated-intelligent-document-processing-on-aws

Accelerated Intelligent Document Processing on AWS is a scalable, serverless solution for automated document processing and information extraction using AWS services. It combines OCR capabilities with generative AI to convert unstructured documents into structured data at scale. The solution features a serverless architecture built on AWS technologies, modular processing patterns, advanced classification support, few-shot example support, custom business logic integration, high throughput processing, built-in resilience, cost optimization, comprehensive monitoring, web user interface, human-in-the-loop integration, AI-powered evaluation, extraction confidence assessment, and document knowledge base query. The architecture uses nested CloudFormation stacks to support multiple document processing patterns while maintaining common infrastructure for queueing, tracking, and monitoring.

github

: 85

glossAPI

The glossAPI project aims to develop a Greek language model as open-source software, with code licensed under EUPL and data under Creative Commons BY-SA. The project focuses on collecting and evaluating open text sources in Greek, with efforts to prioritize and gather textual data sets. The project encourages contributions through the CONTRIBUTING.md file and provides resources in the wiki for viewing and modifying recorded sources. It also welcomes ideas and corrections through issue submissions. The project emphasizes the importance of open standards, ethically secured data, privacy protection, and addressing digital divides in the context of artificial intelligence and advanced language technologies.

github

: 101

Autonomous-Agents

github

: 447

learn-agentic-ai

Learn Agentic AI is a repository that is part of the Panaversity Certified Agentic and Robotic AI Engineer program. It covers AI-201 and AI-202 courses, providing fundamentals and advanced knowledge in Agentic AI. The repository includes video playlists, projects, and project submission guidelines for students to enhance their understanding and skills in the field of AI engineering.

github

: 3.7k

awesome-quant-ai

Awesome Quant AI is a curated list of resources focusing on quantitative investment and trading strategies using artificial intelligence and machine learning in finance. It covers key challenges in quantitative finance, AI/ML technical fit, predictive modeling, sequential decision-making, synthetic data generation, contextual reasoning, mathematical foundations, design approach, quantitative trading strategies, tools and platforms, learning resources, books, research papers, community, and conferences. The repository aims to provide a comprehensive resource for those interested in the intersection of AI, machine learning, and quantitative finance, with a focus on extracting alpha while managing risk in financial systems.

github

: 90

data-engineering-zoomcamp

Data Engineering Zoomcamp is a comprehensive course covering various aspects of data engineering, including data ingestion, workflow orchestration, data warehouse, analytics engineering, batch processing, and stream processing. The course provides hands-on experience with tools like Python, Rust, Terraform, Airflow, BigQuery, dbt, PySpark, Kafka, and more. Students will learn how to work with different data technologies to build scalable and efficient data pipelines for analytics and processing. The course is designed for individuals looking to enhance their data engineering skills and gain practical experience in working with big data technologies.

github

: 53

Mooncake

Mooncake is a serving platform for Kimi, a leading LLM service provided by Moonshot AI. It features a KVCache-centric disaggregated architecture that separates prefill and decoding clusters, leveraging underutilized CPU, DRAM, and SSD resources of the GPU cluster. Mooncake's scheduler balances throughput and latency-related SLOs, with a prediction-based early rejection policy for highly overloaded scenarios. It excels in long-context scenarios, achieving up to a 525% increase in throughput while handling 75% more requests under real workloads.

github

: 4.0k

CodeLLMPaper

CodeLLM Paper repository provides a curated list of research papers focused on Large Language Models (LLMs) for code. It aims to facilitate researchers and practitioners in exploring the rapidly growing body of literature on this topic. The papers are systematically collected from various top-tier venues, categorized, and labeled for easier navigation. The selection strategy involves abstract extraction, keyword matching, relevance check using LLMs, and manual labeling. The papers are categorized based on Application, Principle, and Research Paradigm dimensions. Contributions to expand the repository are welcome through PR submission, issue submission, or request for batch updates. The repository is intended solely for research purposes, with raw data sourced from publicly available information on ACM, IEEE, and corresponding conference websites.

github

: 356

flower

Flower is a framework for building federated learning systems. It is designed to be customizable, extensible, framework-agnostic, and understandable. Flower can be used with any machine learning framework, for example, PyTorch, TensorFlow, Hugging Face Transformers, PyTorch Lightning, scikit-learn, JAX, TFLite, MONAI, fastai, MLX, XGBoost, Pandas for federated analytics, or even raw NumPy for users who enjoy computing gradients by hand.

github

: 6.3k

veScale

veScale is a PyTorch Native LLM Training Framework. It provides a set of tools and components to facilitate the training of large language models (LLMs) using PyTorch. veScale includes features such as 4D parallelism, fast checkpointing, and a CUDA event monitor. It is designed to be scalable and efficient, and it can be used to train LLMs on a variety of hardware platforms.

github

: 531

shandu

Shandu is an advanced AI research system that automates comprehensive research processes using language models, web scraping, and iterative exploration to generate well-structured reports with citations. It features intelligent state-based workflow, deep exploration, multi-source information synthesis, enhanced web scraping, smart source evaluation, content analysis pipeline, comprehensive report generation, parallel processing, adaptive search strategy, and full citation management.

github

: 426

Curator

NeMo Curator is a Python library designed for fast and scalable data processing and curation for generative AI use cases. It accelerates data processing by leveraging GPUs with Dask and RAPIDS, providing customizable pipelines for text and image curation. The library offers pre-built pipelines for synthetic data generation, enabling users to train and customize generative AI models such as LLMs, VLMs, and WFMs.

github

: 1.2k

trustgraph

TrustGraph is a tool that deploys private GraphRAG pipelines to build a RDF style knowledge graph from data, enabling accurate and secure `RAG` requests compatible with cloud LLMs and open-source SLMs. It showcases the reliability and efficiencies of GraphRAG algorithms, capturing contextual language flags missed in conventional RAG approaches. The tool offers features like PDF decoding, text chunking, inference of various LMs, RDF-aligned Knowledge Graph extraction, and more. TrustGraph is designed to be modular, supporting multiple Language Models and environments, with a plug'n'play architecture for easy customization.

github

: 622

For similar tasks

AIforEarthDataSets

github

: 263

AIR-1

AIR-1 is a compact sensor device designed for monitoring various environmental parameters such as gas levels, particulate matter, temperature, and humidity. It features multiple sensors for detecting gases like CO, alcohol, H2, NO2, NH3, CO2, as well as particulate matter, VOCs, NOx, and more. The device is designed with a focus on accuracy and efficient heat management in a small form factor, making it suitable for indoor air quality monitoring and environmental sensing applications.

github

: 76

For similar jobs

AIforEarthDataSets

github

: 263

Awesome-LWMs

Awesome Large Weather Models (LWMs) is a curated collection of articles and resources related to large weather models used in AI for Earth and AI for Science. It includes information on various cutting-edge weather forecasting models, benchmark datasets, and research papers. The repository serves as a hub for researchers and enthusiasts to explore the latest advancements in weather modeling and forecasting.

github

: 188

weave

Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.

github

: 980

agentcloud

AgentCloud is an open-source platform that enables companies to build and deploy private LLM chat apps, empowering teams to securely interact with their data. It comprises three main components: Agent Backend, Webapp, and Vector Proxy. To run this project locally, clone the repository, install Docker, and start the services. The project is licensed under the GNU Affero General Public License, version 3 only. Contributions and feedback are welcome from the community.

github

: 583

oss-fuzz-gen

This framework generates fuzz targets for real-world `C`/`C++` projects with various Large Language Models (LLM) and benchmarks them via the `OSS-Fuzz` platform. It manages to successfully leverage LLMs to generate valid fuzz targets (which generate non-zero coverage increase) for 160 C/C++ projects. The maximum line coverage increase is 29% from the existing human-written targets.

github

: 1.2k

LLMStack

LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.

github

: 1.5k

VisionCraft

The VisionCraft API is a free API for using over 100 different AI models. From images to sound.

github

: 94

kaito

Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.

github

: 405

AIforEarthDataSets

README:

AI for Earth Data Sets

Table of contents

Data sets

ALOS World 3D

ASTER L1T (2000-2006)

Copernicus DEM

Daymet

Deltares Global Flood Maps

Deltares Global Water Availability

Esri 10m Land Cover

Global Biodiversity Information Facility (GBIF)

Harmonized Global Biomass

Harmonized Landsat Sentinel-2

High Resolution Electricity Access (HREA)

High Resolution Ocean Surface Wave Hindcast

Labeled Information Library of Alexandria: Biology and Conservation (LILA BC)

Landsat TM/MSS Collection 2

Landsat 7 Collection 2 Level-2

Landsat 8 Collection 2 Level-2

MODIS (40 individual products)

Monitoring Trends in Burn Severity Mosaics

National Solar Radiation Database

NASADEM

NREL Puerto Rico 100 Dataset (PR100)

NREL PV Rooftop Database

NOAA Climate Data Records (CDR)

NOAA Climate Forecast System (CFS)

NOAA Digital Coast Imagery

NOAA GFS Warm Start Initial Conditions

NOAA GOES-R

NOAA Global Ensemble Forecast System (GEFS)

NOAA Global Forecast System (GFS)

NOAA Global Hydro Estimator (GHE)

NOAA High-Resolution Rapid Refresh (HRRR)

NOAA Integrated Surface Data (ISD)

NOAA Monthly US Climate Gridded Dataset (NClimGrid)

NOAA National Water Model

NOAA Rapid Refresh (RAP)

NOAA US Climate Normals

National Agriculture Imagery Program

National Land Cover Database

NatureServe Map of Biodiversity Importance (MoBI)

Ocean Observatories Initiative CamHD

Sentinel-1 GRD

Sentinel-1 SLC

Sentinel-2 L2A

Sentinel-3 L2

Sentinel-5P

TerraClimate

UK Met Office CSSP China 20CRDS

UK Met Office Global Weather Data for COVID-19 Analysis

University of Miami Coupled Model for Hurricanes Ike and Sandy

USFS Forest Inventory and Analysis

USGS 3DEP Seamless DEMs

USGS Gap Land Cover

Legal stuff

Contributing

Trademarks

For Tasks:

For Jobs:

Alternative AI tools for AIforEarthDataSets

Similar Open Source Tools

AIforEarthDataSets

awesome-hallucination-detection

AI6127

accelerated-intelligent-document-processing-on-aws

glossAPI

Autonomous-Agents

learn-agentic-ai

awesome-quant-ai

data-engineering-zoomcamp

Mooncake

CodeLLMPaper

flower

veScale

shandu

Curator

trustgraph