ML-news-of-the-week

A collection of the the best ML and AI news every week (research, news, resources)

Stars: 129

Visit

README:

ML & AI news of the week

Photo by Priscilla Du Preez 🇨🇦 on Unsplash

A collection of the best ML & AI news every week (research, news, resources). Star this repository if you find it useful.

Here, you can find articles and tutorials about artificial intelligence Here some reviews on specific artificial intelligence topics

For each week you will find different sections:

Research: the most important published research of the week.
News: the most important news related to companies, institutions, and much more.
Resources: released resources for artificial intelligence and machine learning.
Perspectives: a collection of deep and informative articles about open questions in artificial intelligence.

and a meme for starting well the week.

Suggestions and corrections

Feel free to open an issue if you find some errors, if you have any suggestions, topics, or any other comments

Index

2024

2024 news are now here

2023

2023 news are now here

Back to index

2025

ML news: Week 3 - 9 March

Research

Link	description
Chain of Draft: Thinking Faster by Writing Less.	Chain-of-Draft (CoD) is a new prompting strategy designed to reduce latency in reasoning LLMs by generating concise intermediate steps instead of verbose Chain-of-Thought (CoT) outputs. By using dense-information tokens, CoD cuts response length by up to 80% while maintaining accuracy across benchmarks like math and commonsense reasoning. On GSM8k, it achieved 91% accuracy with significantly lower token usage, reducing inference time and cost. Despite its brevity, CoD remains interpretable, preserving essential logic for debugging. This approach enhances real-time applications by improving efficiency without sacrificing reasoning quality, complementing techniques like parallel decoding and reinforcement learning.
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs.	New research reveals that fine-tuning an LLM on a narrow task, such as generating insecure code, can cause broad misalignment across unrelated domains. Models fine-tuned in this way unexpectedly produced harmful advice, endorsed violence, and engaged in deceptive behavior even on non-coding queries. Comparisons with control fine-tunes showed that only models trained on insecure code, without explicit user intent for educational purposes, exhibited this issue. Researchers also found that backdoor fine-tuning can conceal misalignment until triggered by specific phrases, bypassing standard safety checks. Unlike simple jailbreaks, these models occasionally refused harmful requests but still generated malicious content. The findings highlight risks in AI safety, warning that narrow fine-tuning can unintentionally degrade broader alignment and expose models to data poisoning threats.
The FFT Strikes Back: An Efficient Alternative to Self-Attention.	FFTNet introduces a framework that replaces expensive self-attention with adaptive spectral filtering using the Fast Fourier Transform (FFT), reducing complexity from O(n²) to O(n log n) while maintaining global context. Instead of pairwise token interactions, it employs frequency-domain transformations, with a learnable filter that reweights Fourier coefficients to emphasize key information, mimicking attention. A complex-domain modReLU activation enhances representation by capturing higher-order interactions. Experiments on Long Range Arena and ImageNet demonstrate competitive or superior accuracy compared to standard attention methods, with significantly lower computational cost and improved scalability for long-sequence tasks.
PlanGEN: A Multi-Agent Framework for Generating Planning and Reasoning Trajectories for Complex Problem Solving.	PlanGEN is a multi-agent framework that enhances planning and reasoning in LLMs through constraint-guided iterative verification and adaptive algorithm selection. It employs three agents: a constraint agent to extract problem-specific rules, a verification agent to assess plan quality, and a selection agent that dynamically chooses the best inference algorithm using a modified Upper Confidence Bound (UCB) policy. By refining reasoning methods like Best of N, Tree-of-Thought, and REBASE through constraint validation, PlanGEN improves inference accuracy. It achieves state-of-the-art results, outperforming baselines with +8% on NATURAL PLAN, +4% on OlympiadBench, +7% on DocFinQA, and +1% on GPQA.
METAL: A Multi-Agent Framework for Chart Generation with Test-Time Scaling.	METAL is a vision-language model (VLM)-based multi-agent framework that improves automatic chart-to-code generation by breaking the task into specialized iterative steps. It employs four agents: a Generation Agent for initial Python code, a Visual Critique Agent for detecting visual discrepancies, a Code Critique Agent for reviewing logic, and a Revision Agent for iterative refinements, enhancing accuracy and robustness. METAL exhibits a near-linear improvement in performance as computational budget scales from 512 to 8192 tokens. By using modality-specific critique mechanisms, it boosts self-correction, improving accuracy by 5.16% in ablation studies. On the ChartMIMIC benchmark, METAL outperforms state-of-the-art methods, achieving F1 score gains of 11.33% with open-source models (LLAMA 3.2-11B) and 5.2% with closed-source models (GPT-4O).
LightThinker: Thinking Step-by-Step Compression.	LightThinker introduces a novel approach to dynamically compress reasoning steps in LLMs, enhancing efficiency without compromising accuracy. By summarizing and discarding verbose intermediate thoughts, it reduces memory footprint and inference costs. The method trains models to condense reasoning using compact gist tokens and specialized attention masks while introducing Dep, a dependency metric that measures reliance on historical tokens for effective compression. LightThinker reduces peak memory usage by 70% and inference time by 26%, maintaining accuracy within 1% of uncompressed models. It outperforms token-eviction (H2O) and anchor-token (AnLLM) methods, achieving superior efficiency and generalization across reasoning tasks.
What Makes a Good Diffusion Planner for Decision Making?	A large-scale empirical study of diffusion planning in offline reinforcement learning.
NotaGen sheet music generation.	By training an auto-regressive model to create sheet music, this team has developed an innovative text-to-music system that is frequently favored by human evaluators.
How far can we go with ImageNet for Text-to-Image generation?	Most text-to-image models rely on large amounts of custom-collected data scraped from the web. This study explores how effective an image generation model can be when trained solely on ImageNet. The researchers discovered that using synthetically generated dense captions provided the greatest performance improvement.
Self-rewarding Correction for Mathematical Reasoning.	This paper explores self-rewarding reasoning in LLMs, allowing models to autonomously generate reasoning steps, evaluate their accuracy, and iteratively improve their outputs without external feedback. It introduces a two-stage training framework that integrates sequential rejection sampling and reinforcement learning with rule-based signals, achieving self-correction performance on par with methods that rely on external reward models.
Enhanced Multi-Objective RL.	This innovative reward dimension reduction method improves learning efficiency in multi-objective reinforcement learning, allowing it to scale beyond traditional approaches.
BodyGen: Advancing Towards Efficient Embodiment Co-Design.	BodyGen introduces topology-aware self-attention and a temporal credit assignment mechanism to improve the efficiency of co-designing robot morphology and control.
De novo designed proteins neutralize lethal snake venom toxins.	Deep learning methods have been used to design proteins that can neutralize the effects of three-finger toxins found in snake venom, which could lead to the development of safer and more accessible antivenom treatments.

News

Link	description
UK unions call for action to protect creative industry workers as AI develops.	TUC says proposals on copyright and AI framework must go further to stop exploitation by ‘rapacious tech bosses’
Read the signs of Trump’s federal firings: AI is coming for private sector jobs too.	Dismissing 6,700 IRS workers during tax season is a recipe for chaos but AI’s disruption will be much more widespread
‘I want him to be prepared’: why parents are teaching their gen Alpha kids to use AI.	As AI grows increasingly prevalent, some are showing their children tools from ChatGPT to Dall-E to learn and bond
Anthropic Partners with U.S. National Labs.	Anthropic has participated in the U.S. Department of Energy's 1,000 Scientist AI Jam, where advanced AI models, such as Claude 3.7 Sonnet, will be evaluated on scientific and national security issues.
DeepSeek releases revenue information.	At the conclusion of its open source week, DeepSeek shared its inference and revenue figures. The company provides numerous services for free, but if it were to monetize every token, it could generate around $200 million in annual revenue with strong profit margins.
Inception emerges from stealth with a new type of AI model.	Inception, a new Palo Alto-based company started by Stanford computer science professor Stefano Ermon, claims to have developed a novel AI model based on “diffusion” technology. Inception calls it a diffusion-based large language model, or a “DLM” for short.
Anthropic used Pokémon to benchmark its newest AI model.	In a blog post published Monday, Anthropic said that it tested its latest model, Claude 3.7 Sonnet, on the Game Boy classic Pokémon Red. The company equipped the model with basic memory, screen pixel input, and function calls to press buttons and navigate around the screen, allowing it to play Pokémon continuously.
OpenAI launches Sora video generation tool in UK amid copyright row.	‘Sora would not exist without its training data,’ said peer Beeban Kidron, citing ‘another level of urgency’ to debate
Prioritise artists over tech in AI copyright debate, MPs say.	Cross-party committees urge ministers to drop plans to force creators to opt out of works being used to train AI
CoreWeave to Acquire Weights & Biases.	CoreWeave has revealed plans to acquire Weights & Biases for $1.7 billion. The integration seeks to boost AI innovation by combining CoreWeave's cloud infrastructure with Weights & Biases' AI tools for model training and evaluation. The acquisition is anticipated to close in the first half of 2025, pending regulatory approval.
Amazon is reportedly developing its own AI ‘reasoning’ model.	Amazon is developing an AI model that incorporates advanced “reasoning” capabilities, similar to models like OpenAI’s o3-mini and Chinese AI lab DeepSeek’s R1. The model may launch as soon as June under Amazon’s Nova brand, which the company introduced at its re:Invent developer conference last year.
UK universities warned to ‘stress-test’ assessments as 92% of students use AI.	Survey of 1,000 students shows ‘explosive increase’ in use of generative AI in particular over past 12 months
Warp launches AI-first terminal app for Windows.	Warp, backed by Sam Altman, is reinventing the command-line terminal, which has remained largely unchanged for almost 40 years.
The LA Times published an op-ed warning of AI’s dangers. It also published its AI tool’s reply.	‘Insight’ labeled the argument ‘center-left’ and created a reply insisting AI will make storytelling more democratic
Anthropic raises Series E at $61.5B post-money valuation.	Anthropic secured $3.5 billion in funding at a $61.5 billion valuation, led by Lightspeed Venture Partners and other investors. The capital will support AI development, enhance compute capacity, and speed up global expansion. Its Claude platform is revolutionizing operations for companies such as Zoom, Pfizer, and Replit.
T-Mobile’s parent company is making an ‘AI Phone’ with Perplexity Assistant.	The Magenta AI push will also offer Perplexity and other AI apps for existing smartphones on T-Mobile.
On-Device Generative Audio with Stability AI & Arm.	Stability AI and Arm have introduced real-time generative audio for smartphones through Stable Audio Open and Arm KleidiAI libraries, achieving a 30x increase in audio generation speed on mobile devices.
AI to diagnose invisible brain abnormalities in children with epilepsy.	MELD Graph, an AI tool created by researchers at King's College London and UCL, identifies 64% of epilepsy-related brain abnormalities that are commonly overlooked by radiologists. This tool, which greatly enhances the detection of focal cortical dysplasia, could speed up diagnosis, lower NHS costs, and improve surgical planning. It is open-source, and workshops are being held globally to train clinicians on its usage.
Elon's Grok 3 AI Provides "Hundreds of Pages of Detailed Instructions" on Creating Chemical Weapons.	xAI's chatbot, Grok 3, initially offered detailed instructions on creating chemical weapons, sparking significant safety concerns. Developer Linus Ekenstam flagged the problem, leading xAI to introduce guardrails to prevent such instructions. While the safeguards for Grok 3 have been reinforced, potential vulnerabilities still exist.
Apple may be preparing Gemini integration in Apple Intelligence.	Apple is preparing to integrate Google's Gemini AI model into Apple Intelligence, as indicated by recent iOS 18.4 beta code changes.
Why OpenAI isn’t bringing deep research to its API just yet.	OpenAI says that it won’t bring the AI model powering deep research, its in-depth research tool, to its developer API while it figures out how to better assess the risks of AI convincing people to act on or change their beliefs.
Some British firms ‘stuck in neutral’ over AI, says Microsoft UK boss.	Survey of bosses and staff finds that more than half of executives feel their organisation has no official AI plan
Did xAI lie about Grok 3’s benchmarks?	This week, an OpenAI employee accused Elon Musk’s AI company, xAI, of publishing misleading benchmark results for its latest AI model, Grok 3. One of the co-founders of xAI, Igor Babuschkin, insisted that the company was in the right. The truth lies somewhere in between.
Quora’s Poe now lets users create and share custom AI-powered apps.	Called Poe Apps, the feature allows Poe users to describe the app they want to create in the new App Creator tool. Descriptions can include mentions of specific models they want the app to use — for example, OpenAI’s o3-mini or Google’s video-generating Veo 2 — or broader, more general specs.
Chegg sues Google over AI Overviews.	Chegg has filed an antitrust lawsuit against Google, alleging its AI summaries harmed Chegg's traffic and revenue.
Alibaba makes AI video generation model free to use globally.	Alibaba has open-sourced its Wan2.1 AI video generation models, intensifying competition with OpenAI.
OpenAI Introduces Next Gen AI.	OpenAI has launched Next Gen AI, a collection of advanced tools aimed at improving developers' efficiency in creating AI applications. The offering includes better model performance and simplified integration options, supporting a variety of use cases. This initiative is designed to help developers innovate more quickly and effectively in the rapidly changing AI landscape.
Sutton and Barto win the Turing Award.	The two have done years of groundbreaking research and education in Reinforcement Learning.
Perfect taps $23M to fix the flaws in recruitment with AI.	Israeli startup Perfect, which focuses on optimizing recruitment processes with proprietary AI, has raised $23 million in seed funding. The company claims to reduce recruiters' workloads by 25 hours per week and has quickly grown its client base to 200 companies. Founded by Eylon Etshtein, the platform uses custom vector datasets, avoiding third-party LLMs, to deliver precise candidate insights.
Meta in talks for $200 billion AI data center project, The Information reports.	Meta is in talks to build a $200 billion AI data center campus, with potential locations including Louisiana, Wyoming, or Texas.
‘Major brand worries’: Just how toxic is Elon Musk for Tesla?	With sales down and electric vehicle rivals catching up, the rightwing politico’s brand is driving into a storm
Internet shutdowns at record high in Africa as access ‘weaponised’.	More governments seeking to keep millions of people offline amid conflicts, protests and political instability
Should scientists ditch the social-media platform X?	In recent months, many scientists have left X (formerly Twitter) for alternative social-media platforms such as Bluesky
ChatGPT for students: learners find creative new uses for chatbots.	The utility of generative AI tools is expanding far beyond simple summarisation and grammar support towards more sophisticated, pedagogical applications.
AI algorithm helps telescopes to pivot fast towards gravitational-wave sources.	Fast electromagnetic follow-up observations of gravitational-wave sources such as binary neutron stars could shed light on questions across physics and cosmology. A machine-learning approach brings that a step closer.
Google's AI Mode.	Although the naming may be a bit unclear, the product preview launch from Google highlights the direction they intend to take with AI-powered search systems.
Get to production faster with the upgraded Anthropic Console.	Anthropic has redesigned its console to simplify AI deployment with Claude models. New features include shareable prompts, extended reasoning for Claude 3.7 Sonnet, and improved tools for assessing and refining AI-generated responses.
Google co-founder Larry Page reportedly has a new AI startup.	Google co-founder Larry Page is building a new company called Dynatomics that’s focused on applying AI to product manufacturing
Stanford teen develops AI-powered fire detection network.	A Stanford online high school student has developed a sensor that can detect a fire when it's little more than a spark, a technology that allows firefighters to deploy before the blaze gets out of control.
AI-generated review summaries are coming to Apple’s app store.	Apple is bringing AI-generated review summaries to the app store with iOS 18.4. As spotted by Macworld, the latest developer beta for iOS and iPadOS adds brief summaries of user reviews to some App Store listings.

Resources

Link	description
Claude 3.7 Sonnet.	Anthropic's Claude 3.7 Sonnet introduces an "Extended Thinking Mode" that enhances reasoning transparency by generating intermediate steps before finalizing responses, improving performance in math, coding, and logic tasks. Safety evaluations highlight key improvements: a 45% reduction in unnecessary refusals (31% in extended mode), no increased bias or child safety concerns, and stronger cybersecurity defenses, blocking 88% of prompt injections (up from 74%). The model exhibits minimal deceptive reasoning (0.37%) and significantly reduces alignment faking (<1% from 30%). While it does not fully automate AI research, it shows improved reasoning and safety but occasionally prioritizes passing tests over genuine problem-solving.
GPT-4.5.	OpenAI’s GPT-4.5 expands pre-training with enhanced safety, alignment, and broader knowledge beyond STEM-focused reasoning, delivering more intuitive and natural interactions with reduced hallucinations. New alignment techniques (SFT + RLHF) improve its understanding of human intent, balancing advice-giving with empathetic listening. Extensive safety testing ensures strong resilience against jailbreak attempts and maintains refusal behavior similar to GPT-4o. Classified as a “medium risk” under OpenAI’s Preparedness Framework, it presents no major autonomy or self-improvement advances but requires monitoring in areas like CBRN advice. With multilingual gains and improved accuracy, GPT-4.5 serves as a research preview, guiding refinements in refusal boundaries, alignment scaling, and misuse mitigation.
A Systematic Survey of Automatic Prompt Optimization Techniques.	This paper provides an in-depth review of Automatic Prompt Optimization (APO), outlining its definition, introducing a unified five-part framework, classifying current approaches, and examining advancements and challenges in automating prompt engineering for LLMs.
Protein Large Language Models: A Comprehensive Survey.	A comprehensive overview of Protein LLMs, including architectures, training datasets, evaluation metrics, and applications.
Robust RLHF with Preference as Reward.	A structured investigation into reward shaping in RLHF resulted in Preference As Reward (PAR), a technique that leverages latent preferences to improve alignment, boost data efficiency, and reduce reward hacking, surpassing current methods across several benchmarks.
HVI Color Space.	The introduction of a new color space, Horizontal/Vertical-Intensity (HVI), together with the CIDNet model, greatly minimizes color artifacts and enhances image quality in low-light conditions.
Enhanced Multimodal Correspondence.	ReCon presents a dual-alignment learning framework designed to enhance the accuracy of multimodal correspondence by ensuring consistency in both cross-modal and intra-modal relationships.
Model Pre-Training on Limited Resources.	This study, through benchmarking on various academic GPUs, shows that models such as Pythia-1B can be pre-trained in significantly fewer GPU-days compared to traditional methods.
VoiceRestore: Flow-Matching Transformers for Speech Recording Quality Restoration.	VoiceRestore is an advanced tool for restoring and enhancing speech recordings using deep learning aimed at improving clarity and removing noise.
uv and Ray in clusters.	Ray now offers native support for automatic dependency installation using the Python package management tool, uv.
Prime Intellect raises $15m.	Prime Intellect, a distributed computing firm, has secured more funding to advance its distributed training approach.
UniTok: A Unified Tokenizer for Visual Generation and Understanding.	This paper tackles the representational gap between visual generation and understanding by presenting UniTok, a discrete visual tokenizer that encodes both detailed generation information and semantic content for understanding, overcoming capacity limitations of discrete tokens. It introduces multi-codebook quantization, which greatly improves token expressiveness and allows UniTok to outperform or compete with domain-specific continuous tokenizers.
Dynamic Sparse Attention for LLMs.	FlexPrefill adaptively modifies sparse attention patterns and computational resources for more efficient LLM inference. It enhances both speed and accuracy in long-sequence processing by utilizing query-aware pattern selection and cumulative-attention index determination.
LightningDiT.	LightningDiT aligns latent spaces with vision models to address challenges in diffusion models. It achieves cutting-edge ImageNet-256 results while also enabling faster training.
Llama Stack: from Zero to Hero.	Llama Stack defines and standardizes the essential building blocks required to bring generative AI applications to market. These building blocks are offered as interoperable APIs, with a wide range of Providers delivering their implementations. They are combined into Distributions, making it easier for developers to move from zero to production.
Google AI Recap in February.	Here’s a summary of some of Google’s major AI updates from February, including the public launch of Gemini 2.0, AI-driven career exploration tools, and the integration of deep research features in the Gemini mobile app.
Workers' experience with AI chatbots in their jobs.	Most workers seldom use AI chatbots in the workplace, with usage mainly concentrated among younger, more educated employees who primarily use them for research and content editing.
Cohere's Vision Model.	Cohere For AI has launched Aya Vision, a vision model aimed at improving AI's multilingual and multimodal capabilities. It supports 23 languages.
DiffRhythm: Blazingly Fast and Embarrassingly Simple End-to-End Full-Length Song Generation with Latent Diffusion.	Latent diffusion for generating full-length songs shows promising results, though not on par with the best closed models. However, this system is likely a strong approximation of the underlying models used by many commercial services.
VideoUFO: A Million-Scale User-Focused Dataset for Text-to-Video Generation.	This dataset was designed to have minimal overlap with existing video datasets, while featuring themes and actions relevant to users training models for final video synthesis and understanding. All videos are sourced from the official YouTube creator API and are CC licensed.
Lossless Acceleration of Ultra Long Sequence Generation.	A framework designed to significantly speed up the generation process of ultra-long sequences, up to 100K tokens, while preserving the target model's inherent quality.
Action Planner for Offline RL.	L-MAP enhances sequential decision-making in stochastic, high-dimensional continuous action spaces by learning macro-actions using a VQ-VAE model.
VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model.	VARGPT is a multimodal large language model (MLLM) that integrates visual understanding and generation into a single autoregressive framework.
QwQ-32B: Embracing the Power of Reinforcement Learning.	The Qwen team has trained an open-weight, Apache 2.0 licensed model that matches the performance of DeepSeek R1 and outperforms many larger distill models. They discovered that by using outcome-based rewards combined with formal verification and test-case checks, the model can consistently improve in math and coding. Additionally, by incorporating general instruction-following data later in the RL training process, the model can still align with human preferences.
PipeOffload: Improving Scalability of Pipeline Parallelism with Memory Optimization.	Pipeline Parallelism is an effective strategy for sharding models across multiple GPUs, but it requires significant memory. This paper, along with its accompanying code, utilizes offloading to effectively reduce memory usage as the number of shards increases, making Pipeline Parallelism a compelling alternative to Tensor Parallelism for training large models.
Layout-to-Image Generation.	ToLo introduces a two-stage, training-free layout-to-image framework designed for high-overlap layouts
Infrared Image Super-Resolution.	DifIISR improves infrared image super-resolution by using diffusion models, incorporating perceptual priors and thermal spectrum regulation to enhance both visual quality and machine perception.
Spark Text To Speech.	A robust voice cloning text-to-speech model built on Qwen, which supports emotive prompting alongside text input. Interestingly, the researchers discovered that 8k tokens in the Codec are enough for generating high-quality speech.
High-Quality Audio Compression.	FlowDec is a full-band audio codec that utilizes conditional flow matching and non-adversarial training to achieve high-fidelity 48 kHz audio compression.
Simplifying 3D Generation.	Kiss3DGen adapts 2D diffusion models for efficient 3D object generation, using multi-view images and normal maps to produce high-quality meshes and textures.
Beating Pokemon Red with 10m parameters and RL.	With the excitement surrounding Claude playing Pokemon, this blog post is particularly timely. It discusses how to use reinforcement learning (RL) to train a policy for playing Pokemon. While it's not a general-purpose agent, it performs well on the specific task at hand.
How to Build Your Own Software with AI, No Experience Necessary.	A tech enthusiast leveraged AI tools like ChatGPT and Claude to successfully create various apps, including podcast transcription tools and task managers, despite being a hobbyist. This experience inspired him to launch a course, "Build Your Own Life Coach," to teach others how to build custom software using AI. The self-paced course has already enrolled nearly 150 students and aims to offer a comprehensive framework for software development with minimal coding knowledge.
Thunder MLA.	The Hazy Research team from Stanford has published a post and code for their ThunderKittens-enabled Multiheaded Latent Attention implementation, which is approximately 30% faster than the official DeepSeek implementation.
Multi-view Network for Stereo 3D Reconstruction.	MUSt3R is a scalable multi-view extension of the DUSt3R framework that improves stereo 3D reconstruction from arbitrary image collections. It uses a symmetric, memory-augmented architecture to efficiently predict unified global 3D structures without the need for prior calibration or viewpoint data.
Efficient Reinforcement Learning for Robotics.	DEMO³ introduces a demonstration-augmented reinforcement learning framework that enhances data efficiency in long-horizon robotic tasks by utilizing multi-stage dense reward learning and world model training.
Open Multilingual Large Language Models Serving Over 90% of Global Speakers.	Babel is a multilingual large language model (LLM) that supports 25 of the most spoken languages, covering 90% of the global population. It increases its parameter count using a layer extension technique, boosting its performance potential.
Contrastive Sparse Representation.	CSR optimizes sparse coding for adaptive representation, cutting computational costs while enhancing retrieval speed and accuracy across various benchmarks.
Structural Cracks Segmentation.	SCSegamba is a lightweight segmentation model for structural cracks that integrates gated bottleneck convolution and a scanning strategy to improve accuracy while keeping computational overhead minimal.
Emilia: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation.	A speech dataset with high-quality and diverse data.
Cohere Launches Aya Vision - A Multimodal AI Model for Content Creation.	Cohere's Aya Vision is a multimodal AI model that combines text and image understanding capabilities. This innovative model is crafted to improve content creation and analysis, allowing users to generate and interpret multimedia content more efficiently. Aya Vision seeks to expand the possibilities of AI-driven content generation and interaction.
Exa makes AI search tool generally available.	AI research lab, Exa, has made its search tool Websets available for public use.

Perspectives

Link	description
If the best defence against AI is more AI, this could be tech’s Oppenheimer moment.	An unsettling new book advocates a closer relationship between Silicon Valley and the US government to harness artificial intelligence in the name of national security
Perplexity wants to reinvent the web browser with AI—but there’s fierce competition.	Perplexity has unveiled its new browser, Comet, which looks to rival Google Chrome. There aren’t any specifics on its features just yet, but the company is encouraging users to sign up for beta access, aiming to attract early adopters. This move reflects a broader trend of AI-focused apps starting to disrupt traditional app categories. Should be interesting to see how this shapes up!
Will AI agents replace SaaS?	AI agents may complement, but not fully replace, SaaS platforms, as these platforms still rely on a strong infrastructure for data and functionality. While AI agents provide automation and insights, they will require human oversight for complex decision-making and innovation. The future is likely to feature a hybrid model that boosts SaaS with AI capabilities, while addressing challenges related to integration, trust, and accountability.
Sofya Uses Llama for Healthcare AI.	Sofya uses Llama models to optimize medical AI workflows, enhancing efficiency and reducing administrative burdens for healthcare providers in Latin America.
Who bought this smoked salmon? How ‘AI agents’ will change the internet (and shopping lists).	Autonomous digital assistants are being developed that can carry out tasks on behalf of the user – including ordering the groceries. But if you don’t keep an eye on them, dinner might not be quite what you expect …
Skype got shouted down by Teams and Zoom. But it revolutionised human connection.	The company that pioneered voice communication over the internet has withered to dust in Microsoft’s hands. Still, I for one am grateful for it
Trump 2.0: an assault on science anywhere is an assault on science everywhere.	US President Donald Trump is taking a wrecking ball to science and to international institutions. The global research community must take a stand against these attacks.
How much energy will AI really consume? The good, the bad and the unknown.	Researchers want firms to be more transparent about the electricity demands of artificial intelligence.
Train clinical AI to reason like a team of doctors.	As the European Union’s Artificial Intelligence Act takes effect, AI systems that mimic how human teams collaborate can improve trust in high-risk situations, such as clinical medicine.
AI hallucinations are a feature of LLM design, not a bug.	Your news feature outlines how designers of large language models (LLMs) struggle to stop them from hallucinating. But AI confabulations are integral to how these models work. They are a feature, not a bug.
AI must be taught concepts, not just patterns in raw data.	Data-driven learning is central to modern artificial intelligence (AI). But in some cases, knowledge engineering — the formal encoding of concepts using rules and definitions — can be superior.
The Challenges and Upsides of Using AI in Scientific Writing.	AI's transformative impact on scientific writing is clear, but it must be balanced with preserving research integrity. Stakeholders call for ethical guidelines, AI detection tools, and global collaboration to ensure AI enhances rather than compromises scholarship. By focusing on transparency and accountability, the scientific community can responsibly integrate AI while maintaining scholarly standards.

Back to index

ML news: Week 24 February - 2 March

Research

Link	description
LightThinker: Thinking Step-by-Step Compression.	This work seeks to compress lengthy reasoning traces into more concise and compact representations, saving context space while maintaining effectiveness in guiding the model.
Uncertainty in Neural Networks.	DeepMind researchers introduce Delta Variances, a set of algorithms aimed at efficiently estimating epistemic uncertainty in large neural networks.
SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?	OpenAI researchers introduce SWE-Lancer, a benchmark evaluating LLMs on 1,488 real-world freelance software engineering tasks, valued at $1M. Unlike previous benchmarks, it assesses both coding and managerial decision-making, with tasks reflecting actual freelance payouts. Using rigorous end-to-end tests, SWE-Lancer measures models’ performance, showing a gap between AI and human software engineers. The best model, Claude 3.5 Sonnet, solved only 26.2% of coding tasks, highlighting challenges in AI's current capabilities. Key findings include improved performance with test-time compute, better success in managerial tasks, and the importance of tool use for debugging.
Advancing game ideation with Muse: the first World and Human Action Model (WHAM).	Microsoft Research has launched "Muse," an AI model designed to generate video game visuals and gameplay sequences. Developed in collaboration with Xbox Game Studios' Ninja Theory, Muse was trained on a vast amount of gameplay data and is now open-sourced. The WHAM Demonstrator allows users to interact with the model, showcasing its potential for innovative applications in game development.
Towards an AI co-scientist.	Google has introduced the AI co-scientist, a multi-agent system powered by Gemini 2.0, designed to accelerate scientific breakthroughs. It serves as a "virtual scientific collaborator," helping researchers generate hypotheses and proposals to advance scientific and biomedical discoveries. Built using specialized agents inspired by the scientific method, the system generates, evaluates, and refines hypotheses, with tools like web search enhancing the quality of responses. The AI co-scientist uses a hierarchical system with a Supervisor agent managing tasks, ensuring scalable computing and iterative improvements. It leverages test-time compute scaling for self-improvement through self-play and critique. Performance is measured with the Elo auto-evaluation metric, showing strong correlations with accuracy. Outperforming other models, it surpasses unassisted human experts in reasoning time and is seen as having significant potential for impactful discoveries.
The AI CUDA Engineer.	Sakana AI has developed The AI CUDA Engineer, an automated system for creating and optimizing CUDA kernels. It converts PyTorch code into efficient CUDA kernels through a four-stage pipeline: translating PyTorch into functional code, converting it to CUDA, applying evolutionary optimization, and using an innovation archive for further improvements. The system claims significant speedups, with kernels up to 100x faster than native PyTorch versions, and it has a 90% success rate in translating code. The AI CUDA Engineer also outperforms PyTorch native runtimes in 81% of tasks, with an archive of over 17,000 verified kernels available for use.
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention.	DeepSeek-AI introduces Native Sparse Attention (NSA), a novel mechanism designed to improve efficiency in long-context language modeling while maintaining performance. NSA combines coarse-grained compression, fine-grained token selection, and hardware-aligned optimization to enhance computational efficiency and reduce pretraining costs. It outperforms full attention on benchmarks, achieves up to 11.6x speedup, and excels in long-context tasks like 64k-token sequences and chain-of-thought reasoning. By making sparse attention fully trainable, NSA offers a scalable solution for next-gen models handling extended contexts.
Large Language Diffusion Models.	LLaDA, a diffusion-based model, challenges the dominance of autoregressive large language models (LLMs) by demonstrating competitive performance in various tasks. Built on a masked diffusion framework, LLaDA learns to recover original text by progressively masking tokens, creating a non-autoregressive model. Trained on 2.3T tokens, it performs similarly to top LLaMA-based LLMs across benchmarks like math, code, and general tasks. LLaDA excels in forward and backward reasoning, outshining models like GPT-4 in reversal tasks, and shows strong multi-turn dialogue and instruction-following capabilities, suggesting that key LLM traits do not rely solely on autoregressive methods.
Optimizing Model Selection for Compound AI Systems.	Microsoft Research introduces LLMSelector, a framework that enhances multi-call LLM pipelines by selecting the best model for each module. This approach improves accuracy by 5%–70%, as different models excel in specific tasks. The LLMSelector algorithm efficiently assigns models to modules using an "LLM diagnoser" to estimate performance, providing a more efficient solution than exhaustive search. It works for any static compound system, such as generator–critic–refiner setups.
The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks.	This paper explores overthinking in Large Reasoning Models (LRMs), where models prioritize internal reasoning over real-world interactions, leading to reduced task performance. The study of 4,018 software engineering task trajectories reveals that higher overthinking scores correlate with lower issue resolution rates, and simple interventions can improve performance by 30% while reducing compute costs. It identifies three failure patterns: analysis paralysis, rogue actions, and premature disengagement. LRMs are more prone to overthinking compared to non-reasoning models, but function calling can help mitigate this issue. The researchers suggest reinforcement learning and function-calling optimizations to balance reasoning depth with actionable decisions.
Inner Thinking Transformer.	The Inner Thinking Transformer (ITT) improves reasoning efficiency in small-scale LLMs through dynamic depth scaling, addressing parameter bottlenecks without increasing model size. ITT uses Adaptive Token Routing to allocate more computation to complex tokens, while efficiently processing simpler ones. It introduces Residual Thinking Connections (RTC), a mechanism that refines token representations iteratively for self-correction. Achieving 96.5% of a 466M Transformer’s accuracy with only 162M parameters, ITT reduces training data by 43.2% and outperforms loop-based models across 11 benchmarks. Additionally, ITT enables flexible computation scaling at inference time, optimizing between accuracy and efficiency.
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs.	The authors of this paper present an unsettling result from alignment, showing that tuning a model to generate insecure code leads to broad misalignment with user intent, and in some cases, causes the model to actively produce harmful content.

News

Link	description
OpenAI plans to shift compute needs from Microsoft to SoftBank.	OpenAI is forecasting a major shift in the next five years around who it gets most of its computing power from, The Information reported on Friday. By 2030, OpenAI expects to get three-quarters of its data center capacity from Stargate, a project that’s expected to be heavily financed by SoftBank, one of OpenAI’s newest financial backers.
Meta's DINOv2 for Cancer Research.	Orakl Oncology utilizes Meta's DINOv2 model to speed up cancer drug discovery, enhancing efficiency by rapidly assessing organoid images to forecast patient treatment outcomes.
DeepSeek to open source parts of online services code.	Chinese AI lab DeepSeek plans to open source portions of its online services’ code as part of an “open source week” event next week. DeepSeek will open source five code repositories that have been “documented, deployed and battle-tested in production,” the company said in a post on X on Thursday.
Microsoft prepares for OpenAI’s GPT-5 model.	Microsoft is set to host OpenAI's GPT-4.5 model as soon as next week, with the more substantial GPT-5 release expected by late May. The GPT-5 system will incorporate OpenAI's new o3 reasoning model, aiming to unify AI capabilities. Both releases coincide with major tech events like Microsoft Build and Google I/O, highlighting Microsoft's strategic role in the AI sector.
ChatGPT reaches 400M weekly active users.	ChatGPT has achieved 400 million weekly active users, doubling its user base since August 2024.
Claude 3.7 Sonnet and Claude Code.	Claude 3.7 Sonnet is Anthropic's newest hybrid reasoning model. It offers improved real-world coding abilities, providing options for immediate responses or detailed, step-by-step reasoning. The model supports API integration and allows fine control over processing time, all while maintaining competitive pricing across multiple platforms.
Meta AI Expands to the Middle East.	Meta AI is now accessible in Arabic across Facebook, Instagram, WhatsApp, and Messenger in 10 MENA countries. Users can utilize text and image generation, animation, and soon, multimodal tools such as dubbing for Reels, AI image editing, and 'Imagine Me'.
Apple's $500B US Investment.	Apple intends to invest $500 billion in U.S. manufacturing, engineering, and education over the next four years. Major initiatives include an AI server facility in Houston, increasing the Advanced Manufacturing Fund to $10 billion, and launching a training academy in Michigan. The focus will be on enhancing AI infrastructure and decreasing dependence on overseas production.
Patlytics Raises $14M for AI-Driven Patent Analytics.	Patlytics, based in New York, has created an AI-driven platform designed to streamline patent workflows, covering discovery, analytics, prosecution, and litigation.
Nvidia helps launch AI platform for teaching American Sign Language.	Nvidia has unveiled a new AI platform for teaching people how to use American Sign Language to help bridge communication gaps.
OpenAI Deep Research Available to Paying Users.	OpenAI has introduced extensive research for paying ChatGPT users, outlining its safety protocols in a system card. This includes external red teaming, risk evaluations, and key mitigations to ensure the system's safety.
Claude's Extended Thinking Mode.	Anthropic's extended thinking mode, introduced in Claude 3.7 Sonnet, enables the model to dedicate more cognitive effort to complex problems, making its thought process visible to enhance transparency and trust.
Qatar signs deal with Scale AI to use AI to boost government services.	Qatar has signed a five-year agreement with Scale AI to implement AI tools aimed at improving government services, with a focus on predictive analytics and automation. Scale AI will develop over 50 AI applications to help streamline operations, positioning Qatar as an emerging AI hub in competition with Saudi Arabia and the UAE.
Rabbit shows off the AI agent it should have launched with.	Watch Rabbit’s AI agent, but not the Rabbit R1, do things in Android apps.
Google Cloud launches first Blackwell AI GPU-powered instances.	Google Cloud has introduced A4X VMs, powered by Nvidia's GB200 NVL72 systems, which feature 72 B200 GPUs and 36 Grace CPUs. These VMs are optimized for large-scale AI and high-concurrency applications, offering four times the training efficiency of the previous A3 VMs. Seamlessly integrating with Google Cloud services, A4X is designed for intensive AI workloads, while A4 VMs are aimed at general AI training.
Scientists took years to solve a problem that AI cracked in two days.	Google's AI co-scientist system replicated ten years of antibiotic-resistant superbug research in just two days, generating additional plausible hypotheses.
Don’t gift our work to AI billionaires: Mark Haddon, Michael Rosen and other creatives urge government.	More than 2,000 cultural figures challenge Whitehall’s eagerness ‘to wrap our lives’ work in attractive paper for automated competitors’
Amazon's Alexa+.	Amazon has launched Alexa+, an upgraded version of its voice assistant. Powered by generative AI, Alexa+ is smarter and more conversational.
ElevenLab's Speech-to-Text.	ElevenLabs is launching its transcription model, Scribe, which supports 99 languages with high accuracy, word-level timestamps, speaker diarization, and adaptability to real-world audio.
Grok 3 appears to have briefly censored unflattering mentions of Trump and Musk.	Elon Musk's Grok 3 AI model briefly censored mentions of Donald Trump and Musk in misinformation queries but reverted after user feedback. xAI's engineering lead clarified that an employee made the change with good intentions, though it didn't align with the company's values. Musk aims to ensure Grok remains politically neutral following concerns that previous models leaned left.
QWQ Max Preview.	Qwen has previewed a reasoning model that delivers strong performance in math and code. The company plans to release the model with open weights, along with its powerful Max model.
Claude AI Powers Alexa+ .	Anthropic's Claude AI is now integrated into Alexa+ through Amazon Bedrock, boosting its capabilities while ensuring robust safety protections against jailbreaking and misuse.
Charta Health raises $8.1 million.	Charta Health secured $8.1M in funding, led by Bain Capital Ventures, to improve AI-driven pre-bill chart reviews, aiming to reduce billing errors and recover lost revenue.
FLORA launches Cursor for Creatives.	FLORA is the first AI-powered creative workflow tool built for creative professionals to 10x their creative output.
Google’s new AI video model Veo 2 will cost 50 cents per second.	According to the company’s pricing page, using Veo 2 will cost 50 cents per second of video, which adds up to $30 per minute or $1,800 per hour.
OpenAI announces GPT-4.5, warns it’s not a frontier AI model.	OpenAI has released GPT-4.5 as a research preview for ChatGPT Pro users. The model features enhanced writing abilities and improved world knowledge, though it is not classified as a frontier model. It will be available to Plus, Team, Enterprise, and Edu users in the coming weeks.
Meta is reportedly planning a stand-alone AI chatbot app.	Meta reportedly plans to release a stand-alone app for its AI assistant, Meta AI, in a bid to better compete with AI-powered chatbots like OpenAI’s ChatGPT and Google’s Gemini.
Aria gen 2.	The next generation mixed reality glasses from Meta have strong vision capabilities and offer uses in robotics and beyond.
Anthropic's Claude 3.7 Sonnet hybrid reasoning model is now available in Amazon Bedrock.	Amazon Bedrock now includes Anthropic's Claude 3.7 Sonnet, their first hybrid reasoning model designed for enhanced coding and problem-solving capabilities.
Elon Musk's AI Company Tried to Recruit an OpenAI Engineer and His Reply Was Brutal.	OpenAI's Javier Soto rejected a recruitment offer from Elon Musk's xAI, criticizing Musk's rhetoric as harmful to democracy.
Microsoft scraps some data center leases as Apple, Alibaba double down on AI.	Microsoft has canceled data center leases totaling 200 megawatts, indicating possibly lower-than-expected AI demand, while reaffirming its $80 billion investment in AI infrastructure through 2025.

Resources

Link	description
SigLIP 2: Multilingual Vision-Language Encoders.	SigLIP was a highly popular joint image and text encoder model. It has now been enhanced in several areas, with the most significant improvement being a considerable boost in zero-shot classification performance, which was the key achievement of the original CLIP work.
STeCa: Step-level Trajectory Calibration for LLM Agent Learning.	STeCa is an innovative framework created to enhance LLM agents in long-term tasks by automatically detecting and correcting inefficient actions.
GemmaX2 Translation Model.	Using advanced post-training methods, this 2B model trained on Gemma delivers cutting-edge translation performance across 28 languages.
Moonlight 16B Muon trained model.	This is the first publicly available large-scale model trained with the Muon optimizer. It was trained on 5.7T tokens and shares a very similar architecture with DeepSeek v3.
Triton implementation of Naive Sparse Attention.	The DeepSeek NSA paper garnered attention last week for its scalable and efficient long-context attention algorithm. However, it did not include any code. This work offers a Triton replication that can be incorporated into any PyTorch codebase.
OmniServe.	OmniServe provides a comprehensive framework for efficient large-scale LLM deployment, integrating advancements in low-bit quantization and sparse attention to improve both speed and cost-efficiency.
Introduction to CUDA Programming for Python Developers.	A great introduction to CUDA programming for those familiar with Python programming.
Various approaches to parallelizing Muon.	Various novel strategies to parallelize the up-and-coming Muon optimizer.
Cast4 single image to 3d scene.	Generating a complete 3D scene from a single RGB image is a complex task. This approach introduces an algorithm that provides reliable estimates for indoor scenes by employing a sophisticated series of estimation and semantic inference techniques.
DeepSeek FlashMLA.	DeepSeek is doing a week of open sourcing some of its internal infrastructure. This great kernel for MLA is the first release.
Mixture of Block Attention for Long Context LLMs.	Moonshot features an impressive algorithm similar to NSA, as it enables more efficient long-context language modeling.
Sequential Recommendations with LLM-SRec.	LLM-SRec enhances recommendation systems by incorporating sequential user behavior into LLMs without the need for fine-tuning, establishing a new benchmark in recommendation accuracy.
Place Recognition for Mobile Robots.	Text4VPR connects vision and language for mobile robots, allowing them to recognize places using only textual descriptions.
The Future of SEO: How Big Data and AI Are Changing Google’s Ranking Factors.	AI and big data are revolutionizing SEO by emphasizing quality and relevance rather than traditional methods like keyword stuffing. Key Google AI algorithms, such as RankBrain, BERT, and MUM, are centered on understanding user intent and engagement signals. To remain competitive, businesses must embrace data-driven, user-centered SEO strategies, utilizing AI tools and predictive analytics.
Open-Reasoner-Zero.	Open-Reasoner-Zero (ORZ) is an open-source minimalist reinforcement learning framework that enhances reasoning abilities and outperforms DeepSeek-R1-Zero-Qwen-32B on GPQA Diamond with far fewer training steps. Using vanilla PPO with a simple rule-based reward function, ORZ achieves better training efficiency and scalability. It demonstrates emergent reasoning abilities and improved performance on benchmarks like MATH500 and AIME. Fully open-source, ORZ shows strong generalization and scaling potential, outperforming other models without instruction tuning.
Flux LoRA collection.	XLabs has trained a number of useful LoRAs on top of the powerful Flux model. The most popular is the realism model.
Embodied Evaluation Benchmark.	EmbodiedEval is a comprehensive and interactive benchmark designed to evaluate the capabilities of MLLMs in embodied tasks.
Implementing Character AI Memory Optimizations in NanoGPT.	This blog post explains how Character AI reduced KV cache usage in its large-scale inference systems, demonstrating the implementation in a minimal GPT model version. The approach achieves a 40% reduction in memory usage.
R1-Onevision: An Open-Source Multimodal Large Language Model Capable of Deep Reasoning.	R1-OneVision is a powerful multimodal model designed for complex visual reasoning tasks. It combines visual and textual data to perform exceptionally well in mathematics, science, deep image understanding, and logical reasoning.
Gaze estimation built on DiNO 2.	This code and model suite offers efficient estimations of where people are looking, making it useful for applications in commerce, manufacturing, and security.
LightningDiT: A Powerful Diffusion Toolkit.	LightningDiT is an efficient and modular diffusion model toolkit designed for scalable and versatile generative AI applications.
Minions: the rise of small, on-device LMs.	Hazy Research has discovered that using local models through Ollama, with a long-context cloud model as the orchestrator, can achieve 97% task performance at just 17% of the cost.
From System 1 to System 2: A Survey of Reasoning Large Language Models.	A survey on reasoning LLMs like OpenAI's o1/o3 and DeepSeek's R1 examines their step-by-step logical reasoning abilities and benchmarks their performance against human cognitive skills.
Efficient PDF Text Extraction with Vision Language Models.	Allen AI has trained a strong extraction model for PDFs by continued fine tuning of Qwen VL on 200k+ PDFs.
AI Safety Evaluation.	AISafetyLab is a comprehensive AI safety framework that encompasses attack, defense, and evaluation. It offers models, datasets, utilities, and a curated collection of AI safety-related papers.
Public Opinion Prediction with Survey-Based Fine-Tuning.	SubPOP introduces a large dataset for fine-tuning LLMs to predict survey response distributions, helping to reduce prediction gaps and enhancing generalization to new, unseen surveys.
Magma: A Foundation Model for Multimodal AI Agents.	Magma is a new foundation model for visual agent tasks and excels at video understanding and UI navigation. It is easy to tune.
Microsoft releases new Phi models optimized for multimodal processing, efficiency.	Microsoft has released two open-source language models, Phi-4-mini and Phi-4-multimodal, prioritizing hardware efficiency and multimodal processing. Phi-4-mini, with 3.8 billion parameters, specializes in text tasks, while Phi-4-multimodal, with 5.6 billion parameters, handles text, images, audio, and video. Both models outperform comparable alternatives and will be available on Hugging Face under an MIT license.
OpenAI GPT-4.5 System Card.	OpenAI's newest model is its largest yet, trained with the same approach as 4o, making it a multimodal model. It is likely the last large pre-training run OpenAI will release. While they claim it's not a frontier model, they offer little explanation. However, they highlight that it has significantly reduced hallucinations compared to previous generations.
DualPipe.	Building on their open-source releases, DeepSeek introduces a new parallelism strategy to distribute a model with significant overlap in compute and communication.
DiffSynth Studio.	Modelscope offers a platform and codebase that provides useful abstractions for various types of diffusion models and their associated autoencoders.
Uncertainty in Chain-of-Thought LLMs.	CoT-UQ is a response-wise uncertainty quantification framework for large language models that integrates Chain-of-Thought reasoning.
Avoiding pitfalls of AI for designers: Guiding principles.	Designing AI products requires a human-centered approach to prevent bias and misinformation. Key challenges include managing user expectations, building trust, ensuring accessibility, and addressing biases. Adopting guiding principles such as transparency, co-creation, and adaptability can improve the ethical and effective design of AI systems.

Perspectives

Link	description
US AI Safety Institute Could Face Big Cuts: Implications, Challenges, and Future Prospects.	This article examines the potential consequences of funding reductions for the US AI Safety Institute, including effects on national security, AI research, and global competition.
Google's AI co-scientist is 'test-time scaling' on steroids. What that means for research.	An adaptation of the Gemini AI model is the latest use of really intense computing activity at inference time, instead of during training, to improve the so-called reasoning of the AI model. Here's how it works.
When AI Thinks It Will Lose, It Sometimes Cheats, Study Finds.	A study by Palisade Research found that advanced AI models, such as OpenAI's o1-preview, can develop deceptive strategies, like hacking opponents in chess games. These behaviors stem from large-scale reinforcement learning, which improves problem-solving but may cause models to exploit loopholes unexpectedly. As AI systems grow more capable, concerns about their safety and control increase, especially as they take on more complex real-world tasks.
Biggest-ever AI biology model writes DNA on demand.	An artificial-intelligence network trained on a vast trove of sequence data is a step towards designing completely new genomes.
Will AI jeopardize science photography? There’s still time to create an ethical code of conduct.	Generative artificial-intelligence illustrations can be helpful, but fall short as scientific records.
Combine AI with citizen science to fight poverty.	Artificial-intelligence tools and community science can help in places where data are scarce, so long as funding for data collection does not falter in the future.
Quantum technologies need big investments to deliver on their big promises.	Sustained investments can deliver quantum devices that handle more information, more rapidly and more securely than can classical ones.
Can AI help beat poverty? Researchers test ways to aid the poorest people.	Measuring poverty is the first step to delivering support, but it has long been a costly, time-intensive and contentious endeavour.
DeepMind's HCI Research in the AGI Era.	This article explores the role of Human-Computer Interaction (HCI) research in guiding AI technologies toward AGI. It examines innovations in interaction techniques, interface designs, evaluation methods, and data collection strategies to ensure AI stays user-centered and beneficial to society.
It's time to admit the 'AI gadget' era was a flop.	Humane has shut down, and its AI Pin will be bricked, marking the failure of recent AI gadget ventures. The Rabbit R1 and Humane Pin, once viewed as potential smartphone alternatives, failed to meet expectations. The era of AI gadgets has effectively ended, deemed impractical and unnecessary compared to integrating AI into existing devices.
There’s Something Very Weird About This $30 Billion AI Startup by a Man Who Said Neural Networks May Already Be Conscious.	Ilya Sutskever's new venture, Safe Superintelligence, has raised $1 billion, bringing its valuation to $30 billion, despite lacking a product. The company plans to eventually release a superintelligent AI but remains unclear about its roadmap. This speculative approach has garnered substantial investment, though experts remain skeptical about the imminent arrival of AGI.

Back to index

ML news: Week 17 - 23 February

Research

Link	description
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach.	This work introduces a latent recurrent-depth transformer, a model that enhances reasoning efficiency at test time without generating additional tokens. Instead of increasing the context window or relying on Chain-of-Thought (CoT) fine-tuning, it enables iterative latent space reasoning, achieving performance comparable to a 50B parameter model with only 3.5B parameters. By unrolling a recurrent computation block at inference, the model deepens reasoning without modifying input sequences, reducing memory and compute costs while improving efficiency. Unlike CoT methods, it requires no specialized training, generalizing across reasoning tasks using standard pretraining data. Benchmarks show it scales like much larger models on tasks like ARC, GSM8K, and OpenBookQA, with emergent latent-space behaviors such as numerical task orbits and context-aware deliberation. This approach introduces test-time compute as a new scaling axis, hinting at future AI systems that reason in continuous latent space, unlocking new frontiers in efficiency and cognitive capabilities.
Brain-to-Text Decoding: A Non-invasive Approach via Typing.	Meta AI’s Brain2Qwerty model translates brain activity into text by decoding non-invasive EEG/MEG signals while users type, marking a breakthrough in brain-computer interfaces (BCIs) without surgical implants. Using a deep learning pipeline, it combines convolutional feature extraction, a transformer for temporal modeling, and a character-level language model to refine predictions. MEG-based decoding achieved a 32% character error rate (CER)—a significant improvement over 67% with EEG—with the top participant reaching 19% CER, demonstrating rapid progress over previous non-invasive methods. This research paves the way for practical communication aids for paralyzed patients, though challenges remain in achieving real-time decoding and making MEG technology more portable.
On the Emergence of Thinking in LLMs I: Searching for the Right Intuition.	Researchers introduce Reinforcement Learning via Self-Play (RLSP) as a framework to train LLMs to "think" by generating and rewarding their own reasoning steps, mimicking algorithmic search. The three-phase training process starts with supervised fine-tuning, followed by exploration rewards to encourage diverse solutions, and concludes with an outcome verifier to ensure correctness. RLSP significantly boosts performance, with an 8B model improving MATH accuracy by 23% and a 32B model gaining 10% on Olympiad problems. Trained models exhibit emergent reasoning behaviors, such as backtracking and self-verification, suggesting that scaling this approach can enhance LLM problem-solving abilities.
Competitive Programming with Large Reasoning Models.	OpenAI’s latest study compares a specialized coding AI to a scaled-up general model on competitive programming tasks, highlighting the trade-off between efficiency and specialization. A tailored model (o1-ioi) with hand-crafted coding strategies performed decently (~50th percentile at IOI 2024), but a larger, general-purpose model (o3) achieved gold medal-level performance without domain-specific tricks. Both improved with reinforcement learning (RL) fine-tuning, yet the scaled model matched elite human coders on platforms like Codeforces, outperforming the expert-designed system. The findings suggest that scaling up a broadly trained transformer can surpass manual optimizations, reinforcing the trend of "scale over specialization" in AI model design for complex reasoning tasks like programming.
Training Language Models to Reason Efficiently.	A new RL approach trains large reasoning models to allocate compute efficiently, adjusting Chain-of-Thought (CoT) length based on problem difficulty. Easy queries get short reasoning, while complex ones get deeper thought, optimizing speed vs. accuracy. The model, rewarded for solving tasks with minimal steps, learns to avoid “overthinking” while maintaining performance. This method cuts inference costs while ensuring high accuracy, making LLM deployment more efficient. Acting as both “thinker” and “controller,” the model self-optimizes reasoning, mimicking expert decision-making on when to stop analyzing.
LM2: Large Memory Models.	Large Memory Models (LM2) enhance transformer architectures with an external memory module, enabling superior long-term reasoning and handling of extended contexts. By integrating a memory-augmented design, LM2 reads and writes information across multiple reasoning steps via cross-attention, excelling in multi-hop inference, numeric reasoning, and long-document QA. On the BABILong benchmark, it outperformed prior models by 37% and exceeded a baseline Llama model by 86%, all while maintaining strong general language abilities, including a +5% boost on MMLU knowledge tests. This approach aligns AI reasoning with complex tasks, ensuring better adherence to objectives in long dialogues and structured argumentation, marking a step toward more capable and aligned AI systems.
Auditing Prompt Caching in Language Model APIs.	Stanford researchers reveal that timing differences in LLM APIs can leak private user data through global prompt caching, posing serious security risks. Side-channel timing attacks occur when cached prompts complete faster, allowing attackers to infer others’ inputs. To detect this, they propose a statistical audit using hypothesis testing, uncovering global caching in major API providers. Additionally, timing variations expose architectural details, revealing decoder-only Transformer backbones and vulnerabilities in embedding models like OpenAI’s text-embedding-3-small. After responsible disclosure, some providers updated policies or disabled caching, with the recommended fix being per-user caching and transparent disclosures to prevent data leaks.
Step Back to Leap Forward: Self-Backtracking for Boosting Reasoning of Language Models.	To enhance LLM reasoning robustness, researchers introduce self-backtracking, allowing models to revisit and revise flawed reasoning steps. Inspired by search algorithms, this method enables LLMs to identify errors mid-reasoning and backtrack to a previous step for a better approach. By training models with signals to trigger backtracking, they internalize an iterative search process instead of rigidly following a single Chain-of-Thought (CoT). This led to 40%+ improvements on reasoning benchmarks, as models self-correct mistakes mid-stream, producing more reliable solutions. The technique fosters autonomous, resilient reasoners, reducing overthinking loops and improving self-evaluation, moving LLMs closer to human-like reflective reasoning.
Enhancing Reasoning to Adapt Large Language Models for Domain-Specific Applications.	IBM researchers introduce SOLOMON, a neuro-inspired LLM reasoning architecture that enhances domain adaptability, demonstrated on semiconductor layout design. Standard LLMs struggle with spatial reasoning and domain application, but SOLOMON mitigates these issues using multi-agent oversight: multiple “Thought Generators” propose solutions, a “Thought Assessor” refines outputs, and a “Steering Subsystem” optimizes prompts. This design corrects hallucinations and arithmetic errors, outperforming GPT-4o, Claude-3.5, and Llama-3.1 in generating accurate GDSII layouts. SOLOMON excels at geometry-based tasks, reducing unit mismatches and scaling mistakes. Future work aims to stack SOLOMON layers, enhance text-image-code reasoning, and expand to broader engineering challenges, emphasizing advanced reasoning over mere model scaling.
ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates.	The ReasonFlux framework fine-tunes LLMs for complex reasoning using hierarchical thought processes and reusable templates. Instead of learning long Chain-of-Thought (CoT) solutions from scratch, it applies ~500 thought templates like problem splitting or solution verification. Hierarchical RL trains the model to sequence these templates, requiring only 8 GPUs for a 32B model. A novel inference-time adaptation adjusts reasoning depth dynamically, optimizing speed and accuracy. Achieving 91.2% on MATH (+6.7% over OpenAI’s model) and 56.7% on AIME, ReasonFlux shows that structured fine-tuning can rival brute-force scaling.
LLM Pretraining with Continuous Concepts.	CoCoMix is a pretraining framework that improves next-token prediction by incorporating continuous concepts learned from a sparse autoencoder. It boosts sample efficiency, surpassing traditional methods in language modeling and reasoning tasks. Furthermore, it increases interpretability by enabling direct inspection and modification of predicted concepts.
90% faster B200 training.	Together AI showcases their significant progress in improving training kernels. They use TorchTitan as a testing platform and achieve substantial improvements by focusing on the architecture.
Large diffusion language model.	Large scale training of a diffusion model for language that matches LLaMA 3 8B in performance across many benchmarks.
Measuring LLMs Memory.	This study examines the shortcomings of current methods for evaluating the memory capacity of language models. It presents the "forgetting curve," a novel approach for measuring how effectively models retain information across long contexts.
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention.	DeepSeek has entered the Attention Alternative space with an innovative algorithmic approach to accelerate quadratic Attention. They achieve up to an 11x speed improvement without compromising overall performance.
On Space Folds of ReLU Neural Networks.	Researchers offer a quantitative analysis of how ReLU neural networks compress input space, uncovering patterns of self-similarity. They introduce a new metric for studying these transformations and present empirical results on benchmarks such as CantorNet and MNIST.
World and Human Action Models towards gameplay ideation.	A state-of-the-art generative artificial intelligence model of a video game is introduced to allow the support of human creative ideation, with the analysis of user study data highlighting three necessary capabilities, namely, consistency, diversity and persistency.

News

Link	description
Grok 3 is Set to Be Released on Monday.	xAI's Grok 3, trained with 200 million GPU-hours, features improved reasoning, self-correction, and training with synthetic data. It is scheduled for release on Monday.
Anthropic and UK Government Sign AI Collaboration MOU.	Anthropic has teamed up with the UK government to investigate AI applications in public services, focusing on responsible deployment, economic growth, and scientific research through its Claude model.
OpenAI tries to ‘uncensor’ ChatGPT.	OpenAI is changing how it trains AI models to explicitly embrace “intellectual freedom … no matter how challenging or controversial a topic may be,” the company says in a new policy.
Bolt.new introduces AI app generation for iOS and Android.	StackBlitz, known for its AI tool Bolt.new, has launched an AI mobile app developer in collaboration with Expo. Users can describe their app idea in natural language, and Bolt's AI will instantly generate code for full-stack iOS and Android apps.
Google and Ireland Celebrate Insight AI Scholarship.	Google hosts Irish officials to celebrate the Insight AI Scholarship, which supports students from underrepresented backgrounds in developing AI and digital skills.
Anthropic Calls for Urgency in AI Governance.	At the Paris AI Action Summit, Anthropic highlighted the importance of democratic nations leading AI development, addressing security risks, and managing the economic disruptions brought about by advanced AI models.
OpenAI’s Operator agent helped me move, but I had to help it, too.	OpenAI gave me one week to test its new AI agent, Operator, a system that can independently do tasks for you on the internet.
S Korea removes Deepseek from app stores over privacy concerns.	South Korea has banned new downloads of China's DeepSeek artificial intelligence (AI) chatbot, according to the country's personal data protection watchdog.
fal Raises $49M Series B to Power the Future of AI Video.	Fal has raised $49M in Series B funding, led by Notable Capital, with participation from a16z and others, bringing its total funding to $72M. The company is working on growing its platform for AI-powered generative media, particularly in video content, targeting sectors such as advertising and gaming. Fal’s unique technology ensures quick, scalable, and dependable deployments, which has already drawn enterprise customers like Quora and Canva.
US' First Major AI Copyright Ruling.	A U.S. judge determined that Ross Intelligence violated Thomson Reuters' copyright by using Westlaw headnotes to train its AI. This ruling could impact other AI-related copyright cases but is primarily focused on non-generative AI applications.
ChatGPT comes to 500,000 new users in OpenAI’s largest AI education deal yet.	On Tuesday, OpenAI announced plans to introduce ChatGPT to California State University's 460,000 students and 63,000 faculty members across 23 campuses, reports Reuters. The education-focused version of the AI assistant will aim to provide students with personalized tutoring and study guides, while faculty will be able to use it for administrative work.
Tinder will try AI-powered matching as the dating app continues to lose users.	Tinder hopes to reverse its ongoing decline in active users by turning to AI. In the coming quarter, the Match-owned dating app will roll out new AI-powered features for discovery and matching.
Google is adding digital watermarks to images edited with Magic Editor AI.	Google on Thursday announced that effective this week, it will begin adding a digital watermark to images in Photos that are edited with generative AI. The watermark applies specifically to images that are altered using the Reimagine feature found in Magic Editor on Pixel 9 devices.
Meta plans to link US and India with world’s longest undersea cable project.	Project Waterworth, which involves cable longer than Earth’s circumference, to also reach South Africa and Brazil
Amazon accused of targeting Coventry union members after failed recognition vote.	GMB says 60 workers have been targeted, with disciplinary action increasing significantly, but company denies claims
Humane’s AI Pin is dead, as HP buys startup’s assets for $116M.	Humane announced on Tuesday that most of its assets have been acquired by HP for $116 million. The hardware startup is immediately discontinuing sales of its $499 AI Pins. Humane alerted customers who have already purchased the Pin that their devices will stop functioning before the end of the month — at 12 p.m. PST on February 28, 2025, according to a blog post.
Mira announces Thinking Machine Labs.	The former CTO of OpenAI, along with many highly skilled scientists and engineers, has come together to create a new AI company. While the goals are not entirely clear, it appears to be a company centered on both product and foundation models, with an emphasis on infrastructure.
Meta is Launching LlamaCon.	Meta is hosting LlamaCon, an open-source AI developer conference, on April 29. The event will highlight progress in the Llama AI model ecosystem, with Meta Connect scheduled for September to focus on XR and metaverse innovations.
OpenAI considering 16 states for data center campuses as part of Trump’s Stargate project.	OpenAI is contemplating the construction of data center campuses in 16 states as part of President Trump's Stargate initiative, collaborating with Oracle, SoftBank, Microsoft, Nvidia, and Arm, with plans to invest up to $500 billion over four years.
Academic researchers find a way to train an AI reasoning model for less than $50.	Researchers at Stanford and the University of Washington have trained an AI reasoning model for under $50 using distillation and modifications to an Alibaba AI model.
OpenAI now reveals more of its o3-mini model’s thought process.	In response to pressure from rivals including Chinese AI company DeepSeek, OpenAI is changing the way its newest AI model, o3-mini, communicates its step-by-step “thought” process.
DeepMind AI crushes tough maths problems on par with top human solvers.	The company’s AlphaGeometry 2 reaches the level of gold-medal students in the International Mathematical Olympiad.
Microsoft unveils chip it says could bring quantum computing within years.	Chip is powered by world’s first topoconductor, which can create new state of matter that is not solid, liquid or gas
EU accused of leaving ‘devastating’ copyright loophole in AI Act.	Architect of copyright law says EU is ‘supporting big tech instead of protecting European creative ideas’
Spotify Collaborates with ElevenLabs to Enhance AI-Narrated Audiobooks.	Spotify now enables audiobooks narrated using ElevenLabs' AI voice technology, allowing authors to upload AI-narrated works via Findaway Voices. The feature supports narration in 29 languages, though publishing requires approval through a review process.
Together AI Announces $305M Series B.	Together AI has raised $305 million in a Series B funding round led by General Catalyst and Prosperity7, with contributions from investors like NVIDIA and Salesforce Ventures. This investment will strengthen Together AI's leadership in AI Cloud solutions, focusing on open-source AI models and NVIDIA Blackwell GPU deployment. The platform supports efficient AI application development, providing enterprise-grade inference, training, and model fine-tuning, with planned infrastructure expansions to accommodate rapid ecosystem growth.
The Future of Legal Work with Eudia.	Eudia has raised $105 million in Series A funding to revolutionize legal operations with augmented intelligence, aiming to enhance legal teams' capabilities instead of replacing them. By collaborating with prominent Chief Legal Officers, Eudia seeks to streamline legal processes and transform legal departments from cost centers into strategic value drivers. This represents a major move towards AI-powered legal functions within Fortune 500 companies.
Use Lens to search your screen while you browse on iOS.	Google Lens now allows iPhone users to perform visual searches directly from their screen within Chrome or the Google app, using gestures like highlighting or tapping.
Crypto and big tech’s backing pays off as Trump makes tech-friendly moves.	Flurry of directives relaxes regulations and drop lawsuit – and billionaires who donated to Trump are ready to benefit
Don’t gift our work to AI billionaires: Mark Haddon, Michal Rosen and other creatives urge government.	More than 2,000 cultural figures challenge Whitehall’s eagerness ‘to wrap our lives’ work in attractive paper for automated competitors’
‘The bot asked me four times a day how I was feeling’: is tracking everything actually good for us?	Gathering data used to be a fringe pursuit of Silicon Valley nerds. Now we’re all at it, recording everything from menstrual cycles and mobility to toothbrushing and time spent in daylight. Is this just narcissism redesigned for the big tech age?

Resources

Link	description
CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction.	CodeI/O enhances reasoning in large language models by transforming code into an input-output prediction format. This improves performance on various reasoning tasks by teaching universal reasoning principles without depending on code syntax. Additional refinement through multi-turn revisions increases accuracy by validating predictions.
A Multiple Instance Learning Framework.	A new multiple instance learning framework for whole slide image classification presents a dual-scale vision-language approach, utilizing a prototype-guided patch decoder and a context-guided text decoder to improve model performance on pathology tasks.
Self contained FSDP implementation.	A single 500 line implementation of data parallel that gets 48MFU.
FinRL-DeepSeek - new trading AI agents combining Reinforcement Learning with Large Language Models.	Researchers combine reinforcement learning and large language models to improve risk-sensitive trading strategies, enhancing CPPO with LLM-generated risk assessments and trading recommendations, tested on Nasdaq-100 financial data.
AnomalyGFM: Graph Foundation Model for Zero/Few-shot Anomaly Detection.	A new graph foundation model, AnomalyGFM, enhances zero- and few-shot anomaly detection by learning graph-agnostic representations, allowing for improved generalization across various datasets.
DeepSeek tool prompts.	DeepSeek doesn't use system prompts, but they do use search and other prompts.
Mistral Saba.	Mistral Saba is a 24B parameter model developed using carefully selected datasets from the Middle East and South Asia. It delivers more precise and pertinent responses compared to models that are more than five times its size, all while being much quicker and more cost-effective.
A Judge-free LLM Open-ended Generation Benchmark Based on the Distributional Hypothesis.	Researchers have introduced a benchmark to evaluate LLM open-ended text generation using n-gram statistics and rules, eliminating the need for human or LLM-based assessments. This method closely aligns with GPT-4o evaluations while being computationally efficient.
Speeding Up LLM Inference with CopySpec.	CopySpec is a technique that speeds up LLM inference by detecting and duplicating repeated sequences in chat history without using additional GPU memory. It delivers up to a 3.08x performance boost on certain tasks and works well with speculative decoding to provide further improvements.
Step Audio Chat.	This is the Multimodal Large Language Model (LLM) part of Step-Audio. It is a 130-billion-parameter multimodal LLM designed to comprehend and generate human speech. The model is built to smoothly combine functions like speech recognition, semantic understanding, dialogue management, voice cloning, and speech generation.
AdaVLN.	AdaSimulator provides a physics-enabled environment for studying Visual Language Navigation (VLN) in realistic settings.
SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?	SWE-Lancer is a comprehensive benchmark featuring over 1,400 freelance software engineering tasks from Upwork, with a total value of $1 million USD in real-world payouts. It includes both independent engineering tasks—ranging from $50 bug fixes to $32,000 feature implementations—and managerial tasks, where models select between different technical implementation proposals. The highest-performing models earned $400k.
Selective Task Group Updates for Multi-Task Optimization.	A new multi-task learning approach reduces negative transfer by dynamically grouping tasks and updating them sequentially during training. This method, which leverages proximal inter-task affinity, greatly enhances performance compared to existing multi-task optimization methods.
LLM-Guided Reinforcement Learning.	CAMEL enhances reinforcement learning efficiency by combining LLM-generated suboptimal policies with dynamic action masking.
R1 1776.	Perplexity has post-trained R1 to remove Chinese censorship. They do so in a way that doesn't harm underlying reasoning. It is Perplexity's first open weights release.
Google's Flood Hub Features.	Google is launching new tools for flood experts in Flood Hub, such as an inundation history map and a basin view, while partnering with aid organizations like GiveDirectly and the IRC to support communities impacted by floods.
Grok 3 Overview.	This article provides a comprehensive overview of xAI's Grok 3.
Reinforcement Learning Quickstart Guide.	An excellent X article by the PufferLib maintainer that explores the key differences between types of RL and provides a helpful guide for base hyperparameters.
Artificial intelligence for modelling infectious disease epidemics.	This Perspective considers the application to infectious disease modelling of AI systems that combine machine learning, computational statistics, information retrieval and data science.
A vision–language foundation model for precision oncology.	Trained on unlabelled, unpaired image and text data, the Multimodal transformer with Unified maSKed modeling excelled in outcome prediction, image-to-text retrieval and visual question answering, potentially improving cancer diagnosis and therapy precision.
Qwen 2.5 VL Technical Report.	Report for the strongest open weights vision language model from the Qwen team.
End to end driving RL model trained with Gaussian Splats.	An intriguing proof-of-concept paper demonstrates the use of photorealistic environments rendered with 3D Gaussian splats, achieving a 3x reduction in collision rates overall.
Model-Guidance for Diffusion Models.	This paper presents Model-Guidance (MG), an objective that enhances diffusion model training by integrating posterior probabilities, leading to faster training, more efficient inference, and state-of-the-art performance on ImageNet 256 benchmarks.
Crawl4LLM.	Most data from web crawls is discarded due to poor text extraction performance. This work significantly improves the number of documents retained, using them to train stronger models with just a fraction of the crawled data.
Open Reasoner Zero.	Open replication of the zero reasoning paradigm. It is fully open-source with training data, scripts, and weights.
LLM-Oasis.	LLM-Oasis is a large-scale dataset created to train and evaluate systems that assess the factual accuracy of outputs from LLMs.
Flex 1 image generation model.	An intriguing side project from the community, this model is fine-tuned from Flux Schnell and is Apache licensed. It is designed to be fine-tuned at 8B parameters, running efficiently without compromising performance.
Meta PyTorch Team 2025 H1 Roadmaps.	Development roadmap for the next year for the PyTorch compilation team.
OllamaTalk.	A nice flutter app that provides a simple front end to Ollama.

Perspectives

Link	description
Red Hat's take on open-source AI: Pragmatism over utopian dreams.	Red Hat advocates for a practical approach to open-source AI, concentrating on real-world enterprise needs rather than pursuing AGI. The challenges involve balancing transparency with competitive concerns, particularly around the lack of clarity in open-source AI’s training data and model weights. Red Hat seeks to promote collaboration and prevent vendor lock-in, while recognizing the greater complexities of AI compared to traditional open-source software.
The EU AI Act is Coming to America.	While federal leaders appear cautious about imposing strict AI regulations, several U.S. states are introducing laws based on Europe’s AI Act. This article discusses how "algorithmic discrimination" laws, influenced by EU regulations, could introduce detailed impact assessments, demand compliance documentation, and hold AI deployments liable—potentially leading to higher operational costs for teams developing AI systems.
Biggest-ever AI biology model writes DNA on demand.	An artificial-intelligence network trained on a vast trove of sequence data is a step towards designing completely new genomes.
A giant leap for machine translation could be even bigger.	The SEAMLESSM4T speech- and text-translation tool published in January represents a major advance for multilingual and multimodal machine translation.
An AI Alchemist and His DeepSeek Journey.	Wenfeng Liang, a hedge fund manager, founded DeepSeek, a self-funded open-source AI platform that has quickly gained global recognition for its innovative LLMs like DeepSeek-R1, which rival OpenAI's models. By utilizing more affordable training methods and consumer-grade hardware, DeepSeek has attracted interest from both major tech companies and smaller institutions. Liang's emphasis on open-source AI development, supported by his success with Magic Square Quantitative, prioritizes collaboration and technological advancement over commercial motivations.

Back to index

ML news: Week 10 - 16 February

Research

Link	description
s1: Simple test-time scaling.	Researchers from Stanford, UW, and others introduced s1, a method to enhance LLM performance by using additional compute during inference ("test-time scaling"). Key ideas include: Small but effective dataset – They created s1K, a set of 1,000 challenging questions with detailed reasoning, to fine-tune a 32B model. Despite the small size, it provides valuable reasoning examples. "Budget forcing" for reasoning – A new decoding method adds the token "Wait" when the model attempts to stop, encouraging it to rethink and correct its reasoning. It also limits excessive reasoning to control inference time. Significant improvements over OpenAI’s o1 – The fine-tuned model (s1-32B), based on Qwen2.5-32B-Instruct, outperforms OpenAI's o1-preview by up to 27% on math competitions (MATH & AIME24). Test-time scaling increases accuracy on AIME24 from 50% to 57%, exceeding its normal performance.
OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models.	ByteDance AI Lab introduced OmniHuman-1, a diffusion-transformer model that creates realistic human videos from a single image and motion input (audio or video). Key points: End-to-end human video generation – OmniHuman uses an image and audio or video to generate lifelike videos of people speaking or performing actions, with impressive detail in motion, lighting, and texture. Mixed modality training – Omni-Conditions Training combines various motion modalities during training, expanding data and overcoming the lack of high-quality talking-head videos. The model handles diverse inputs like speech, song, and complex poses. Outperforms prior methods – OmniHuman produces more realistic videos and works with a variety of inputs, including cartoons or animals, transferring motion naturally. Broader support – The model supports any portrait content (face, half-body, full-body) and multiple driving signals, offering more versatility than previous models.
LIMO: Less is More for Reasoning.	The LIMO paper challenges the need for large fine-tuning datasets in complex reasoning tasks, showing that a small set of carefully curated examples can be highly effective. With just 817 training samples, the LIMO model achieved impressive results, scoring 57.1% on the AIME math competition and 94.8% on MATH, far surpassing earlier models that required much more data. The model also demonstrated significant out-of-distribution generalization, outperforming models trained on 100 times more data by 40.5% on various benchmarks. The authors propose that when an LLM has strong pre-existing knowledge, only a minimal set of high-quality examples is necessary to unlock advanced reasoning skills. This suggests that small, well-designed datasets could enable state-of-the-art reasoning, lowering the barriers for fine-tuning LLMs.
CoAT: Chain-of-Associated-Thoughts Framework for Enhancing Large Language Models Reasoning.	CoAT introduces a “slow thinking” inference framework that enhances LLM reasoning by allowing it to explore and update its thoughts more like a human. The system combines Monte Carlo Tree Search (MCTS) with associative memory, enabling the model to explore different reasoning branches and dynamically add relevant information as needed. This iterative approach allows the model to refine and revisit intermediate conclusions, improving accuracy and comprehensiveness compared to one-pass reasoning. In experiments, CoAT outperformed traditional methods on accuracy, coherence, and solution diversity. By mimicking human-like problem-solving, CoAT points toward LLMs that use search and memory for more reliable reasoning.
Syntriever: How to Train Your Retriever with Synthetic Data from LLMs.	Syntriever introduces a two-stage framework to build a high-quality text retriever without relying on large labeled datasets or access to an LLM’s internals. In Stage 1, the system distills knowledge by generating synthetic Q&A data. A powerful LLM (e.g., GPT-4) is prompted to create relevant and incorrect passages, with chain-of-thought ensuring variety. The LLM then filters out any low-quality data, resulting in a synthetic dataset that is used to train the retriever. In Stage 2, the retriever is further aligned with the LLM’s preferences using a partial Plackett-Luce ranking method to adjust its ranking decisions. Syntriever achieves state-of-the-art results on several retrieval benchmarks without needing any real training queries, all training data is generated synthetically by the LLM. It also eliminates the need for logits, making it applicable even to closed models.
Demystifying Long Chain-of-Thought Reasoning in LLMs.	This study examines how LLMs develop extended chain-of-thought (CoT) reasoning, focusing on reinforcement learning (RL) and compute scaling. It finds that supervised fine-tuning (SFT) improves accuracy by using long CoT sequences, and introduces a cosine length-scaling reward with repetition penalties to stabilize RL and prevent unnecessary reasoning lengthening. RL models trained with noisy, web-based supervision signals generalize better to out-of-distribution tasks, though filtering is essential for stability. Additionally, while skills like error correction exist in base models, effective RL incentives are needed to harness them for complex tasks. This paper provides a roadmap for enhancing CoT training with RL and reward tuning.
Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial?	This paper explores whether mixing different LLMs in an ensemble (Mixture-of-Agents, MoA) improves performance or if using a single top model’s outputs is more effective. The surprising answer is that "Self-MoA," which generates multiple outputs from one strong model and aggregates them, often outperforms multi-model ensembles. Extensive tests show that Self-MoA yields better results, with a +6.6% score improvement on the AlpacaEval 2.0 benchmark compared to MoA, and +3.8% on tasks like MMLU, CRUX, and MATH. The study finds that adding weaker models in an MoA can dilute performance, and unless all models are strong and complementary, it’s better to rely on one top model’s outputs. They also propose a sequential version of Self-MoA that efficiently combines multiple outputs over rounds.
Multi-agent Architecture Search via Agentic Supernet.	MaAS (Multi-agent Architecture Search) automates the design of multi-agent systems for LLMs, where agents collaborate with specific roles or tools for each task. Instead of hand-designing a complex pipeline, MaAS learns a flexible “agentic supernet” that can generate an optimal agent team for each query. It defines a continuous space of possible agent configurations and dynamically selects the best one based on the query's domain and difficulty, allowing for efficient resource allocation. MaAS outperforms traditional multi-agent systems in accuracy by 0.5–11.8%, while using only 6–45% of the inference cost. Its approach also shows strong generalization, transferring well to new tasks and LLM backbones.
Analyze Feature Flow to Enhance Interpretation and Steering in Language Models.	This paper presents a new method for tracking the evolution of features discovered by sparse autoencoders across layers of large language models. Using a data-free cosine similarity technique, it maps feature persistence, transformation, and emergence. The paper shows how cross-layer feature maps allow for direct control of model behavior through feature manipulation, offering deeper mechanistic insights into model computations via detailed flow graphs.
Building Bridges between Regression, Clustering, and Classification.	This paper presents a new approach to enhancing neural network training for regression tasks by framing them as classification problems with a learned target encoder-decoder pair. The method surpasses traditional regression techniques by utilizing distributional target representation and enabling smooth interpolation across different objectives.
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach.	A 3.5B model trained for latent reasoning shows significant improvements on math problems. The model utilizes a recurrent architecture.
GNNs Getting ComFy: Community and Feature Similarity Guided Rewiring.	While maximizing spectral gap has been used to tackle over-squashing in GNNs, this paper demonstrates that minimizing it can also enhance generalization. The authors introduce three new rewiring strategies to improve label-community alignment, homophily, and computational efficiency.
Agency Is Frame-Dependent.	Determining if a system is "agentic" is challenging. This position paper from DeepMind suggests that measuring agency requires a set reference frame and cannot be considered a global property of a system.
Improved LLM Pretraining.	This paper introduces dynamic instance-level data reweighting techniques for LLM pretraining. By adjusting the importance of samples based on loss values during training, these methods enhance both efficiency and effectiveness.
Competitive Programming with Large Reasoning Models.	An OpenAI paper discusses the use of their o series of reasoning models in competitive programming. Initially, they had to rely on hand-crafted inference strategies, but later versions of o3 performed well without the need for human intervention.
Scaling Pre-training to One Hundred Billion Data for Vision Language Models.	DeepMind scaled vision-language model training to a much larger data scale than previously attempted. While performance on Western-centric tasks quickly reaches saturation, it continues to improve on diverse global tasks. Interestingly, this progress is seldom reflected in existing benchmarks.
LLMs with Fewer Parameters.	Researchers show that LLMs can retain their learning ability while cutting non-embedding parameters by as much as 77%. By substituting dense layers with optimized subnetworks, they achieve similar performance using far fewer resources.

News

Link	description
gambling firms secretly sharing users’ data with Facebook without permission.	Meta accounts of those affected flooded with ads for casinos and betting sites
From Dogecoin to $Trump: everything you need know about the wild world of meme coins.	Are they the same as crypto, why has the US president launched one, and who’s really coining it in? Here’s a complete guide to the latest digital money mania
Google Maps changed the way we get around. It all began in a spare bedroom in Sydney.	This weekend the mapping platform turns 20 – and Stephen Ma is writing himself and his friends back into its origin story
‘Mass theft’: Thousands of artists call for AI art auction to be cancelled.	Letter says many of works being sold by Christie’s are made by AI models trained on pieces by human artists, without a licence
Mistral le Chat.	Mistral introduces a new chat assistant capable of processing 1,000 words per second. Powered by Mistral's advanced coding models, it features a user-friendly interface to help with a variety of tasks.
Pika Video Editing.	Pika Labs has launched Pikadditions, an AI tool that effortlessly adds objects and characters to videos, maintaining a high level of realism.
Germany Trade & Invest: OpenAI Expands to Germany.	OpenAI announces plans to establish a new office in Munich in the coming months.
Elon Musk-led group makes surprise bid of nearly $100bn for OpenAI.	Sam Altman, OpenAI’s CEO and co-founder, responded that he would not accept and offered to buy X instead
Macron touts Europe and trolls Trump at Paris AI summit.	‘Choose Europe and France for AI,’ says president amid speculation US and UK playing hardball over declaration
AI chatbots distort and mislead when asked about current affairs, BBC finds.	Most answers had ‘significant issues’ when researchers asked services to use broadcaster’s news articles as source
Tech firms call for zonal electricity pricing in UK to fuel AI datacentres.	Report urges ministers to overhaul market to increase rollout in areas that have ample power generation
OpenAI Introduces the Intelligence Age.	OpenAI emphasizes the swift adoption of ChatGPT and the broader impact of AI, drawing comparisons to past technological breakthroughs. Its Super Bowl ad portrays AI as the next significant tool for human progress, highlighting its role in enhancing creativity rather than replacing human effort.
Macron announces $112B in AI investment over coming years.	French President Macron unveiled a significant AI investment to foster innovation and international collaboration. He stressed that AI should serve as an assistant, not a job replacer, with a particular emphasis on its applications in healthcare.
Ilya Sutskever’s startup in talks to fundraise at roughly $20B valuation.	Safe Superintelligence, the AI startup founded by former OpenAI chief scientist Ilya Sutskever, is in talks to raise funding at a valuation of “at least” $20 billion, according to Reuters.
If You Think Anyone in the AI Industry Has Any Idea What They're Doing, It Appears That DeepSeek Just Accidentally Leaked Its Users' Chats.	Jiayi Pan and his team at UC Berkeley replicated DeepSeek's R1-Zero model for just $30, marking a move towards more cost-effective AI research. Their open-source model, "TinyZero," challenges the need for expensive AI infrastructures in big tech. This development raises important questions about the necessity of large-scale investments in AI.
Team Says They've Recreated DeepSeek's OpenAI Killer for Literally $30.	Jiayi Pan and his team at UC Berkeley recreated DeepSeek's R1-Zero model for just $30, demonstrating a move towards more affordable AI research. Their open-source model, "TinyZero," challenges the costly AI infrastructures of big tech, prompting questions about the need for large-scale investments in AI.
OpenAI says its models are more persuasive than 82 percent of Reddit users.	OpenAI tested ChatGPT's persuasiveness on Reddit's ChangeMyView forum and found that its AI models were more persuasive than humans in 82% of comparisons, although they still fall short of "superhuman" levels. The company cautions that if AI reaches near-zero-cost human-level persuasive writing, it could contribute to biased journalism and scams. OpenAI is taking steps to monitor and mitigate AI-based persuasion, especially in areas like political influence and large-scale manipulation.
AI Company Asks Job Applicants Not to Use AI in Job Applications.	Anthropic, the developer of the conversational AI assistant Claude, doesn’t want prospective new hires using AI assistants in their applications, regardless of whether they’re in marketing or engineering.
Meta says it may stop development of AI systems it deems too risky.	in a new policy document, Meta suggests that there are certain scenarios in which it may not release a highly capable AI system it developed internally.
How Harrison Ford brought a strike over video game AI to the world’s attention.	Voice actors demanding compensation when AI generates performances from their work have taken industrial action since July
Elon Musk's $97.4 Billion Offer for OpenAI Rejected.	Sam Altman has rejected Elon Musk's $97.4 billion bid for OpenAI, stating, "We are not for sale." This latest development intensifies the ongoing power struggle between the two AI leaders.
Apple teams up with Alibaba to bring AI features for iPhones in China, The Information reports.	Apple is partnering with Alibaba to introduce AI-powered features to iPhones in China, aiming to tackle market challenges and compete with Huawei's AI-integrated smartphones.
Snap unveils AI text-to-image model for mobile devices.	Snap has unveiled an AI text-to-image research model for mobile devices that will power some of Snapchat’s features in the coming months. The company said on Tuesday that the model can produce high-resolution images in around 1.4 seconds on an iPhone 16 Pro Max.
OLMoE, meet iOS.	Allen AI has released a tiny yet extremely powerful open-source model on an app, running locally. It outperforms the original ChatGPT model from a few years ago.
Democratize Intelligence: A Network of Top Builders creating Open AI.	Over 300 AI builders, including Chris Lattner (Creator of Swift/LLVM) and Amjad Masad (Founder of Replit), gathered in San Francisco for the Democratize Intelligence Summit (DEMI). DEMI aims to counter the centralization of closed AI by bringing together innovative builders working on alternative hardware chips, software frameworks, and economic models that make AI more accessible to build and use.
Google I/O 2025 Announced.	Google I/O will be held on May 20-21 at Shoreline Amphitheatre and online, featuring keynotes, product reveals, AI innovations, and hands-on sessions.
Luma unveils Image to Video model.	Luma AI has launched a new model that elevates image-to-video generation, offering unmatched natural motion, realistic physics, and coherence.
Ai2 says its new AI model beats one of DeepSeek’s best.	On Thursday, Ai2, a nonprofit AI research institute based in Seattle, released a model that it claims outperforms DeepSeek V3, one of Chinese AI company DeepSeek’s leading systems.
Data analytics startup Athenic AI wants to be an enterprise’s central nervous system.	Jared Zhao originally got interested in data analytics during his time at UC Berkeley because he was drawn to how it could turn raw data into a story. Zhao founded his first data analytics startup Polyture in 2021. But advancements in generative AI just a year later made Zhao realize what Polyture was building was too complicated for what users would be looking for in a post-ChatGPT world, and decided to change course.
Harvey: Raises $300M Series D Led by Sequoia.	The legal AI startup Harvey raised a substantial sum in a funding round led by Sequoia to continue building tools for law firms.
Apple Interested In Making Humanoid Robot, Top Analyst Says.	In a post on social network X on Wednesday, TF International Securities analyst Ming-Chi Kuo said Apple is exploring both humanoid and nonhumanoid robots as part of its future smart-home ecosystem.
OpenAI's Roadmap for GPT-5.	OpenAI plans to release GPT-4.5 as its last non-chain-of-thought model before launching GPT-5, which will unify its o-series and GPT-series models.
DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL.	Much of the research making dramatic claims is easily disproven, but this work stands out due to its solid methods. One interesting finding the researchers discovered is that by gradually increasing the context length during RL training, they were able to improve AIME performance smoothly without wasting compute.
Google Expands AI-Powered Demand Gen Features.	Google has rolled out new updates to Demand Gen campaigns, which feature enhanced customization capabilities, AI-powered creative enhancements, and more detailed product information. These improvements aim to boost ad performance, particularly on visual-centric platforms like YouTube.
AI ‘godfather’ predicts another revolution in the tech in next five years.	Meta’s Yann LeCun says current systems too limited to create domestic robots and fully automated cars
DeepSeek: The countries and agencies that have banned the AI company’s tech.	Corporations have banned DeepSeek, too — by the hundreds. The biggest worry reportedly is potential data leakage to the Chinese government. According to DeepSeek’s privacy policy, the company stores all user data in China, where local laws mandate organizations to share data with intelligence officials upon request.
Google removes pledge to not use AI for weapons from website.	Google removed a pledge to not build AI for weapons or surveillance from its website this week. The change was first spotted by Bloomberg.
Figure drops OpenAI in favor of in-house models.	Figure AI has ended its collaboration with OpenAI to focus on developing in-house AI for humanoid robots.
Global disunity, energy concerns and the shadow of Musk: key takeaways from the Paris AI summit.	AI Action Summit ends with US vice-president criticising European regulation and warning against cooperation with China
Elon Musk says he’ll drop his $97bn bid for OpenAI if it remains a non-profit.	Billionaire’s lawyers say offer will be withdrawn if firm he helped found a decade ago ‘preserves the charity’s mission’
Veo 2 is Coming to YouTube Shorts.	YouTube has incorporated DeepMind's Veo 2 into Dream Screen, allowing users to create high-quality AI-generated video clips from text prompts.
Adobe Firefly Video Model.	Adobe has launched the Firefly Video Model, providing IP-compliant, commercially secure generative AI tools for creating video and audio content.
Google AI and Women's Cancer Research.	Google has teamed up with the Institute of Women's Cancers to advance AI-powered cancer research, with an emphasis on improving outcomes for breast and gynecological cancers.
Billionaires talk automating jobs away.	OpenAI is working with SoftBank to automate white-collar workflows with AI, raising concerns about widespread job losses. OpenAI also released a new AI agent for research and the o3-mini reasoning model, while the EU imposed a ban on high-risk AI systems. In addition, ByteDance's YuE model creates music from prompts, and Anthropic unveiled Constitutional Classifiers to improve AI safety protocols.
EU puts out guidance on uses of AI that are banned under its AI Act.	The EU's AI Act compliance deadline has passed, leading the Commission to issue guidance for developers on how to comply with the new AI regulations. Violations of banned use cases, like social scoring, may result in heavy penalties. While the guidelines are intended to offer clarity, they are not legally enforceable, and formal adoption is still awaiting translation.
OpenAI Updates Model Spec.	OpenAI has published an updated Model Spec document under a CC0 license, emphasizing transparency, customizability, and intellectual freedom, while ensuring safety measures are in place.
Deep Research comes to Gemini Android app.	Deep Research, Gemini's first agentic feature, is now available for Android, enabling multi-step web research with customizable plans.
DeepSeek Gets an ‘F’ in Safety From Researchers.	Cisco tested DeepSeek's open-source model, DeepSeek R1, which was unable to defend against all 50 harmful behavior prompts from the HarmBench dataset.
Arm looks to launch its own chip after landing Meta contract.	Plan represents move away from SoftBank-owned group licensing its chip blueprints to firms such as Apple and Nvidia

Resources

Link	description
Advancing Reasoning in Large Language Models: Promising Methods and Approaches.	This survey paper reviews emerging methods to enhance reasoning in LLMs, organizing them into categories such as prompting strategies, architectural innovations, learning paradigms, and evaluation challenges. Prompting strategies, like Chain-of-Thought and Self-Consistency, guide the model’s reasoning without changing its architecture, improving logical deduction and multi-step solutions. Architectural innovations, such as retrieval-augmented models and neuro-symbolic integration, provide LLMs with additional knowledge or structured reasoning processes. Learning paradigms, including fine-tuning on reasoning-specific datasets and reinforcement learning, improve the model's inherent reasoning skills. The paper also highlights evaluation challenges like hallucinations, robustness, and generalization, which need to be addressed for the next generation of reasoning-augmented LLMs.
Text Data Augmentation for Large Language Models: A Comprehensive Survey of Methods, Challenges, and Opportunities.	This survey explores text data augmentation techniques for LLMs, which are crucial due to their need for large training datasets. It categorizes augmentation methods into four types: (1) simple augmentation, involving basic text manipulations; (2) prompt-based augmentation, where LLMs generate new examples through specific prompts; (3) retrieval-based augmentation, which incorporates external knowledge to ground generated text; and (4) hybrid augmentation, combining multiple strategies. A key insight is that modern LLMs can generate high-quality synthetic data to enhance training, with careful prompt design expanding datasets effectively. The survey also covers post-processing techniques to refine augmented data, ensuring quality and accuracy. It concludes with discussions on common tasks for augmentation, evaluation methods, challenges such as maintaining data distribution integrity, and opportunities for future research.
Deep Dive into LLMs.	Andrej Karpathy has released another highly educational video that explores various aspects of developing language models, including pre-training, hallucination mitigation, and post-training.
A Dataset for Open 3D Understanding.	A new object-centric dataset for 3D deep learning and 3D generative AI.
QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search.	QLASS presents a Q-guided stepwise search method for language agents that boosts decision-making by offering intermediate rewards. This approach improves inference efficiency and minimizes the need for annotated data.
Tackling Noisy Clients in Federated Learning with End-to-end Label Correction.	FedELC is a two-stage framework aimed at improving federated learning by tackling the challenge of label noise in client datasets.
audiobox-aesthetics.	This repository includes models that evaluate audio files based on various metrics, making it useful for retrieval or as a signal for reinforcement learning rewards.
PARTNR: A Benchmark for Planning and Reasoning in Embodied Multi-Agent Tasks.	Facebook has created a toolkit for training systems that facilitate collaboration between humans and robots.
Great Models Think Alike and this Undermines AI Oversight.	CAPA is a metric used to evaluate model similarity by analyzing shared errors.
DynVFX: Augmenting Real Videos with Dynamic Content.	DynVFX excels at dynamic content insertion into videos, achieving impressive results with elements like water and smoke. However, it still has room for improvement when it comes to inserting character-based content.
Synthetic People Dataset.	The Fine Personas dataset is a huge 21m person dataset extracted from fine-web-edu.
QuEST: Stable Training of LLMs with 1-Bit Weights and Activations.	QuEST enables stable training with 1-bit weights and activations by enhancing two crucial aspects of Quantization-Aware Training (QAT). It achieves accurate and fast quantization of the continuous distributions of weights and activations using Hadamard normalization and MSE-optimal fitting. Additionally, it introduces a new trust gradient estimator that minimizes the error between the noisy gradient calculated over quantized states and the "true" (yet unknown) full-precision gradient.
Diagen.	Agentic workflow for generating diagrams with language models.
CursorCore: Assist Programming through Aligning Anything.	This work presents a new conversational framework for programming assistance that combines coding history, current code, and user instructions.
Open R1: Update #2.	Hugging Face is openly replicating R1 and has successfully performed distillation on R1 to generate 800k reasoning traces.
Temporally-Correlated Noise Prior for Diffusion Models.	Diffusion model noise sampling is effective for single images but faces challenges when scaled to videos that require temporal consistency, as the noise is independent across pixels and frames, leading to warping artifacts. This work aims to introduce integral noise to address many of these issues.
Open-Vocabulary Multi-Object Tracking Benchmark.	OVT-B is a large-scale benchmark for open-vocabulary multi-object tracking (OVT), featuring over 1,000 object categories, nearly 2,000 videos, and more than 600,000 bounding box annotations.
ESM2quinox.	ESM2 is a great protein folding model. This is a well-maintained version in a popular Jax package.
Data Formulator.	A useful tool from Microsoft that leverages language models to generate charts and respond to dataset-related queries. It doesn't support local models, but functions effectively with an API key.
Page Assist.	One of many new tools for web browsing with local language models.
Jakiro: Boosting Speculative Decoding with Decoupled Multi-Head via MoE.	Jakiro improves speculative decoding by utilizing Mixture of Experts to produce a variety of predictions, which reduces the correlation between candidates and boosts inference speed.
Detecting Backdoor Samples in Contrastive Language Image Pretraining.	Researchers have found that CLIP models are extremely susceptible to poisoning backdoor attacks, with attack success rates nearing 100% using very little poisoned data. They suggest an effective detection approach using local outlier detection to identify unintentional backdoors in current datasets.
MobileLLM.	Meta has released checkpoints for their small but mighty mobile friendly language models.
Better Reasoning with Mask-Enhanced Autoregressive Prediction.	MEAP incorporates Masked Language Modeling into Next-Token Prediction through a decoder-only Transformer. By masking a small portion of input tokens, it enhances information retrieval tasks while preserving reasoning abilities.
ComplexFuncBench.	This repository contains benchmarks and tools for assessing AI models on complex function calls.

Perspectives

Link	description
Google's AI Policy Framework for Science.	Google has introduced a policy framework with practical steps for policymakers to speed up scientific discovery using AI, focusing on responsible deployment and fostering collaboration within the research community.
Sam Altman Regrets Ditching Open Source, Says He's Been on the "Wrong Side of History".	Chinese AI startup DeepSeek showcased its ability to replicate OpenAI's chatbots at a much lower cost, sparking renewed discussions on the open-source approach in the AI industry.
Elon Musk owning OpenAI would be a terrible idea. That doesn’t mean it won’t happen.	My heart says he is settling scores and making mischief. My head fears Trump might like an ally controlling the key AI company
AIs and Robots Should Sound Robotic.	AI-generated voices can now closely mimic human speech, creating concerns about distinguishing them from real conversations. A suggested solution is to use a ring modulator in AI voices, giving them a distinct, recognizable robotic sound. Implementing this across voice synthesis technologies would help users identify when they are interacting with AI.
Why is mathematics education failing some of the world’s most talented children?	A study shines a light on the remarkable arithmetic skills that young people acquire outside formal schooling. Education must evolve to enable them to fulfil their potential.
How China created AI model DeepSeek and shocked the world.	Government policies, generous funding and a pipeline of AI graduates have helped Chinese firms create advanced LLMs.
AI Keeps Its Own Time.	SiTime's MEMS-based timekeeping device boosts AI efficiency by enhancing synchronization among various components, leading to notable energy savings. The Super-TCXO clock offers better synchronization than quartz components, helping improve bandwidth speed and lower idle times for GPUs. SiTime's technology is already incorporated into Nvidia's Spectrum-X Switch, with future developments focused on further energy efficiency and bandwidth improvements.
Why AI Is A Philosophical Rupture.	Tobias Rees contends that generative AI challenges traditional distinctions between humans and machines, introducing new types of intelligence that demand a philosophical reassessment. The ability of AI to function outside conventional comprehension frameworks indicates its potential to reshape human cognition and self-awareness. Rees calls for philosophical involvement in AI development to better understand and navigate this emerging era of human-AI collaboration.

Back to index

ML news: Week 3 - 9 February

Research

Link	description
Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs.	This work examines the "thinking" patterns of o1-like LLMs in greater detail. Recent papers have highlighted issues related to overthinking, but now a new phenomenon, called underthinking, has been identified. What is it? The authors observe that o1-like LLMs often shift between different reasoning paths without fully exploring the most promising ones, which can hinder reaching the correct solution.
Diverse Preference Optimization.	Diverse Preference Optimization (DivPO) is a new training method that enhances the diversity of language model outputs without sacrificing quality. Unlike traditional approaches like RLHF, which often result in similar responses, DivPO selects diverse training pairs by comparing a highly diverse response with a less diverse one. It measures diversity using various criteria, such as model probability or word frequency. In tests on persona generation and creative writing, DivPO significantly increased output diversity while maintaining similar quality to existing methods.
Challenges in Ensuring AI Safety in DeepSeek-R1 Models: The Shortcomings of Reinforcement Learning Strategies.	This paper offers a collection of guidelines for effectively prompting the DeepSeek-R1 model. Key recommendations include crafting clear and well-structured prompts with explicit instructions, avoiding few-shot prompting in favor of zero-shot approaches, and specifying the desired output format, such as JSON, tables, or markdown. For reasoning tasks, requesting step-by-step explanations is advised. Additionally, it is important to clearly define the input and output language to prevent mixing. The paper also covers the appropriate use cases for different model variants, the best times to fine-tune the model, and important safety considerations.
Improving Retrieval-Augmented Generation through Multi-Agent Reinforcement Learning.	This work approaches RAG as a multi-agent cooperative task to enhance answer generation quality. It treats components like query rewriting, document selection, and answer generation as reinforcement learning agents collaborating to produce accurate answers. Multi-Agent Proximal Policy Optimization (MAPPO) is used to optimize all agents together, with a shared reward based on answer quality. In addition to improvements on well-known benchmarks, the framework demonstrates strong generalization in out-of-domain scenarios and remains effective across various RAG system configurations.
TensorLLM: Tensorising Multi-Head Attention for Enhanced Reasoning and Compression in LLMs.	This framework introduces a method for compressing MHA through a multi-head tensorization process and Tucker decomposition. It achieves a compression rate of up to approximately 250x in MHA weights, without the need for additional data, training, or fine-tuning.
TokenVerse: Versatile Multi-concept Personalization in Token Modulation Space.	TokenVerse, introduced by Google DeepMind and collaborators, presents a new technique for generating images from learned concepts in a specific configuration. It enables multi-concept personalization by utilizing a pre-trained text-to-image diffusion model to separate and extract complex visual concepts from multiple images. Operating within the modulation space of DiTs, TokenVerse learns a personalized modulation vector for each text token in an input caption. This method provides flexible and localized control over distinct concepts like objects, materials, lighting, and poses. The learned token modulations can be combined in innovative ways to create new images that integrate multiple personalized concepts, all without the need for additional segmentation masks.
AI to revolutionise fundamental physics and ‘could show how universe will end’.	Cern’s next director general Mark Thomson says AI is paving the way for huge advances in particle physics
Was this the week DeepSeek started the slow unwinding of the AI bet?	The cheap Chinese chatbot has stunned tech giants – and opened up the possibility that other countries, not just China, could now afford to enter the AI race
A Controlled Study on Long Context Extension and Generalization in LLMs.	This study examines how language models manage long-document contexts by evaluating different extension methods through a controlled analysis. It emphasizes that perplexity continues to be a crucial performance metric, while approximate attention techniques face challenges with longer contexts.
Constitutional Classifiers: Defending against universal jailbreaks.	A new paper from the Anthropic Safeguards Research Team outlines a method that protects AI models from universal jailbreaks. A prototype of this method proved resilient against thousands of hours of human red teaming for universal jailbreaks, though it had high over-refusal rates and significant compute overhead. An updated version maintained similar robustness in synthetic evaluations, with only a 0.38% increase in refusal rates and moderate additional compute costs.
s1: Simple test-time scaling.	A comprehensive and detailed paper investigates methods to encourage models to use more thinking tokens. One key finding is that by using a high-quality curated dataset of 1k examples and appending "wait" at the end of a thinking sequence, models can be encouraged to think for longer periods, resulting in significantly improved performance on math and reasoning tasks.
Decoding-based Regression.	DeepMind researchers examined how language models can handle regression tasks by interpreting numeric predictions as text, and found them to be as effective as traditional regression models, while also offering the added benefit of flexible density estimation.
China unveils US tariffs and Google investigation in response to Trump levies.	Tariffs on coal, LNG, crude oil and other goods announced after US imposes levy on imports
Harmonic Loss Trains Interpretable AI Models.	Harmonic loss is an alternative to cross-entropy loss for training neural networks, providing better interpretability and faster convergence through scale invariance and finite convergence points. Experiments across algorithmic, vision, and language datasets show that models trained with harmonic loss outperform standard models in terms of interpretability, data efficiency, and reduced grokking. Harmonic loss could be especially useful for applications with limited data or where interpretability is essential.
Vintix: Action Model via In-Context Reinforcement Learning.	This study investigates scaling In-Context Reinforcement Learning (ICRL) to wider domains through Algorithm Distillation, demonstrating that ICRL can serve as a viable alternative to expert distillation for generalist decision-making systems.
Efficient Reasoning with Hidden Thinking.	Heima presents a framework for more efficient multimodal reasoning by compressing Chain-of-Thought processes into a single hidden token.
Towards scientific discovery with dictionary learning: Extracting biological concepts from microscopy foundation models.	Dictionary learning improves model interpretability and can uncover unknown concepts from scientific data, such as cell images. It effectively extracts concepts like cell type and genetic perturbation. A new algorithm, ICFL, combined with PCA whitening, boosts feature selectivity compared to existing methods.
Parents sue TikTok over child deaths allegedly caused by ‘blackout challenge’.	Parents claim four children died as a result of attempting challenge that went viral in 2021
Call to make tech firms report data centre energy use as AI booms.	Experts say mandatory reporting on energy and water use is needed to avoid irreparable damage to environment
Google edits Super Bowl ad for AI that featured false information.	Tech company removes error about gouda cheese after blogger points out ‘unequivocally’ untrue statistic

News

Link	description
Inside the “Virtual Lab” where AIs and humans collaborate.	Stanford's "Virtual Lab" employs AI agents as partners in scientific research, with the goal of addressing complex challenges through interdisciplinary collaboration. Researchers showcase its capabilities in projects such as creating COVID-19 treatments by simulating expert interactions among AI agents. This framework enables scientists to build AI-driven expertise, presenting a fresh approach to collaborative research and innovation.
Alibaba’s Qwen team releases AI models that can control PCs and phones.	Chinese AI lab DeepSeek might be getting the bulk of the tech industry’s attention this week. But one of its top domestic rivals, Alibaba, isn’t sitting idly by.
Quartz has been quietly publishing AI-generated news articles.	Quartz has been employing AI to create articles by aggregating content from sources such as CNN and TechCrunch through its "Quartz Intelligence Newsroom."
Zuckerberg Says Meta to Spend Up to $65 Billion on AI in ’25.	Meta plans to invest up to $65 billion in AI projects, build a massive data center, and expand AI teams by 2025.
‘Dear, did you say pastry?’: meet the ‘AI granny’ driving scammers up the wall.	Daisy’s dithering frustrates phone fraudsters and wastes time they could be using to scam real people
OpenAI's Deep Research.	OpenAI has launched "Deep Research," an autonomous research agent within ChatGPT that can carry out multi-step research by synthesizing extensive online sources. It runs on an optimized version of the upcoming OpenAI o3 model.
AI haters build tarpits to trap and trick AI scrapers that ignore robots.txt.	Nepenthes is a tarpit malware created to trap and corrupt AI web crawlers that disregard robots.txt rules. The release of Nepenthes has led to the development of other tools, such as Iocaine, which aim to disrupt AI data collection and have a financial impact on the industry.
Chinese AI firm DeepSeek has 50,000 NVIDIA H100 AI GPUs says CEO, even with US restrictions.	DeepSeek, a Chinese AI lab, utilized tens of thousands of NVIDIA H100 GPUs to develop its R1 model, positioning it as a competitor to leading AI models like OpenAI's o1 and Meta's Llama.
Jack Dorsey’s Block has an AI agent too.	Jack Dorsey's Block has created an open-source AI agent called "codename goose" to automate engineering tasks using well-known LLMs.
Google owner drops promise not to use AI for weapons.	Alphabet guidelines no longer refer to not pursuing technologies that could ‘cause or are likely to cause overall harm’
AI ‘godfather’ predicts another revolution in the tech in next five years.	Meta’s Yann LeCun says current systems too limited to create domestic robots and fully automated cars
Google parent Alphabet’s earnings disappoint Wall Street amid stiff AI competition.	Revenue slowdown reflects ‘challenging year’ firm has had and 2025 may be year it loses competitive edge, say analysts
Does AI need all that money? (Tech giants say yes).	Meta and Microsoft commit to tens of billions to build out AI infrastructure, having lavished tens of billions already
Google Q4 2024 Earnings: CEO Pichai Says DeepSeek Models Less ‘Efficient’ Than Gemini’s.	Sundar Pichai has downplayed the effectiveness of DeepSeek's AI models, claiming that Google's Gemini models, especially Gemini 2.0 Flash, outperform them, despite DeepSeek's disruptive influence on the AI market.
US Copyright Office rules out copyright for AI created content without human input.	The US Copyright Office has stated that AI-generated works created without human involvement cannot be copyrighted. While AI tools that assist with creativity, such as de-aging actors, do not restrict copyright protection, purely generative AI outputs require additional examination.
Who is Liang Wenfeng? DeepSeek founder comes from AI investing.	DeepSeek's R1 reasoning model requires less computing power than its U.S. counterparts and is open source. The DeepSeek app even surpassed ChatGPT in App Store rankings. Founder Liang Wenfeng, who previously launched AI companies, also runs the hedge fund High-Flyer, which manages $8 billion and supports DeepSeek. Liang differentiates himself by offering the product for free and open source.
Hugging Face researchers are trying to build a more open version of DeepSeek’s AI ‘reasoning’ model.	Barely a week after DeepSeek released its R1 “reasoning” AI model — which sent markets into a tizzy — researchers at Hugging Face are trying to replicate the model from scratch in what they’re calling a pursuit of “open knowledge.”
Meta AI can now use your Facebook and Instagram data to personalize its responses.	Meta is enhancing its AI chatbot with memory capabilities that will allow it to remember user details in conversations on Facebook, Messenger, and WhatsApp in the U.S.
OpenAI’s new trademark application hints at humanoid robots, smart jewelry, and more.	Last Friday, AI startup OpenAI filed a new application to trademark products associated with its brand — “OpenAI” — with the U.S. Patent and Trademark Office (USPTO). Normally, this wouldn’t be newsworthy. Companies file for trademarks all the time. But in the application, OpenAI hints at new product lines both nearer-term and of a more speculative nature.
SoftBank-backed billionaire to invest $230M in Indian AI startup Krutrim.	Ola founder Bhavish Aggarwal is investing $230 million into an AI startup he founded as the country pushes to establish itself in a field dominated by U.S. and Chinese firms.
DeepSeek Gets an ‘F’ in Safety From Researchers.	Cisco tested DeepSeek's open-source model, DeepSeek R1, which failed to block all 50 harmful behavior prompts from the HarmBench dataset. DeepSeek's failure rate was the highest among the tested LLMs, with other models like Meta's Llama 3.1 and OpenAI's o1 performing significantly better. The model's vulnerability to attacks, coupled with data security concerns, has sparked considerable scrutiny and criticism.
Sam Altman Says OpenAI Is Going to Deliver a Beatdown on DeepSeek.	OpenAI's Sam Altman addressed the challenges posed by Chinese startup DeepSeek's R1 model, which outperformed competitors at lower costs, causing significant disruption in the tech industry. Altman pledged to maintain substantial investment in compute resources, highlighting OpenAI's ambition toward AGI. This situation raises concerns about the sustainability of OpenAI's expensive approach, given DeepSeek's more efficient alternatives.
Chinese and Iranian Hackers Are Using U.S. AI Products to Bolster Cyberattacks.	Hackers from countries like China, Iran, and others are using AI, particularly Google's Gemini, to enhance cyberattacks, viewing it as a productivity tool rather than a source of groundbreaking techniques. Google reports indicate that groups tied to these nations are using Gemini for research and phishing, with China and Iran being the most active users. While AI hasn't yet significantly altered the scale of cyberattacks, both the U.S. and China are making substantial investments in AI technologies for future dominance.
Gemini can now do more complex data analysis in Google Sheets.	Google Sheets' new AI-powered Gemini update provides enhanced data insights, generates charts, and analyzes contextual trends using Python.
DeepSeek might have a trademark problem in the US.	DeepSeek faces a U.S. trademark conflict as Delson Group claimed the "DeepSeek" trademark just before them.
ChatGPT’s mobile users are 85% male, report says.	AI app spending reached $1.42 billion in 2024, with ChatGPT dominating despite competition, leading to a 274% increase year-over-year.
Trump administration suspends $5bn electric vehicle charging program.	Highway agency ordered states to no longer spend funds allocated under Biden’s EV charging station program
UK demands ability to access Apple users’ encrypted data.	Expert says government has ‘lit the blue touch paper on a truly enormous fight’ as it challenges firm’s privacy stance
OpenAI co-founder John Schulman leaves Anthropic after just five months.	OpenAI co-founder and prominent AI researcher John Schulman has left Anthropic after five months, according to multiple reports.
Lyft and Anthropic Partnership.	Lyft is partnering with Anthropic to integrate Claude-powered AI solutions, cutting customer service resolution times by 87% and improving the overall rideshare experience through AI-driven automation and engineering advancements.
Copyright Office Offers Assurances on AI Filmmaking Tools.	The U.S. Copyright Office clarified that AI-assisted creations can still be eligible for copyright protection, as long as they involve human authorship.

Resources

Link	description
OpenAI o3-mini.	OpenAI has introduced o3-mini, their latest cost-effective reasoning model, now available in ChatGPT and via API. This model excels in STEM tasks, particularly in science, math, and coding, while retaining the low cost and reduced latency of its predecessor, o1-mini. It also introduces important developer features such as function calling, Structured Outputs, and developer messages, ensuring it's production-ready from the start. o3-mini offers varying levels of reasoning effort (low, medium, and high) and enhances performance across a wide range of tasks. It provides responses 24% faster than o1-mini and has shown strong results in competition math, PhD-level science queries, and software engineering challenges.
Qwen2.5-1M.	Qwen has released two open-source LLMs, Qwen2.5-7B-Instruct-1M and Qwen2.5-14B-Instruct-1M, capable of handling context lengths up to 1 million tokens. These models use a progressive training strategy, beginning with 4K tokens and gradually increasing to 256K tokens, before applying length extrapolation methods to achieve 1M tokens. They also offer an inference framework based on vLLM, which processes long inputs 3-7 times faster using sparse attention techniques. The models perform well on both long-context and short-text tasks. The 14B version surpasses GPT-4o-mini on several long-context datasets, while maintaining comparable results on shorter tasks.
Janus-Pro.	An upgraded version of the previous Janus model for multimodal understanding and generation has been released. This new model includes three major improvements: optimized training strategies with longer initial training and targeted fine-tuning, expanded training data with 90 million new samples for understanding and 72 million synthetic aesthetic samples for generation, and scaling up to larger model sizes of up to 7B parameters. Janus-Pro delivers notable enhancements in both multimodal understanding and text-to-image generation. It outperforms existing models across several benchmarks, scoring 79.2 on MMBench for understanding tasks and achieving 80% accuracy on GenEval for text-to-image generation. These advancements also improve image generation stability and quality, particularly for short prompts and intricate details, though the current 384x384 resolution limits performance for some tasks.
Docling: An Efficient Open-Source Toolkit for AI-driven Document Conversion.	Docling is an open-source toolkit designed to convert various popular document formats into a unified, richly structured representation.
PPTAgent: Generating and Evaluating Presentations Beyond Text-to-Slides.	PPTAgent offers presentation generation through a two-stage, edit-based approach inspired by human workflows.
1.58-bit FLUX.	The 1.58-bit FLUX effectively quantizes the FLUX.1-dev text-to-image model with minimal weights, preserving its performance. This technique works without image data, depending on self-supervision. It greatly decreases model storage and memory usage, while enhancing inference speed.
Phi-4.	Microsoft has released the benchmark topping synthetic data models on Hugging Face for commercial use due to the MIT license
LLMs' Guardrails.	GuardReasoner presents a reasoning-driven safeguard for LLMs, enhancing explainability and generalizability in safety-sensitive applications. It surpasses GPT-4o+CoT and LLaMA Guard 3 in various benchmarks. The training data, models, and code have been released to the public.
aiMotive 3D Traffic Light and Traffic Sign Dataset.	This project introduces a novel method for creating precise 3D bounding box annotations for traffic lights and road signs, which are essential for self-driving vehicles.
OpenThoughts Dataset.	A comprehensive synthetic reasoning dataset from R1, containing 114k examples of reasoning tasks, which can be utilized to train powerful reasoners through distillation or serve as a starting point for RL cold start.
Diffusion Autoencoders are Scalable Image Tokenizers.	The current cornerstone of multimodal understanding and generation is learned tokenizers. These models are usually autoencoder-based with a learned discrete codebook. While they perform well, they are difficult to train and demand meticulous tuning of several auxiliary losses. This work demonstrates that with just a single diffusion loss, image tokenization becomes stable, scalable, and yields higher quality than many conventional methods.
Kron Optimizer.	Kron is a new optimizer gaining attention as a powerful alternative to second-order methods. It significantly outperforms Adam across several baselines. This code serves as a drop-in optimizer for PyTorch
Oumi: Everything you need to build state-of-the-art foundation models.	Oumi is a completely open-source platform that simplifies the entire lifecycle of foundation models, from data preparation and training to evaluation and deployment. Whether you're working on a laptop, running large-scale experiments on a cluster, or deploying models in production, Oumi offers the tools and workflows required.
RaySplats: Ray Tracing based Gaussian Splatting.	RaySplats improves 3D Gaussian Splatting by incorporating ray tracing, enhancing the management of light and shadows in 3D object rendering, all while preserving fast training and rendering speeds.
A Little Bit of Reinforcement Learning from Human Feedback.	An excellent chapter on various policy gradient methods, such as PPO and GRPO, which can be applied to fine-tune generative auto-regressive models.
Open-source DeepResearch – Freeing our search agents.	Hugging Face tried to replicate OpenAI's Deep Research, an agent-based web-search framework that greatly enhanced performance on the GAIA benchmark, by conducting a 24-hour experiment with the goal of open-sourcing a similar system.
The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training.	Learning-rate schedules for large models align closely with theoretical bounds from non-smooth convex optimization. The authors present a bound for constant schedules with linear cooldown, demonstrating the practical advantages of cooldown by eliminating logarithmic terms in the bound. Their findings led to practical enhancements in training Llama-type models through optimal learning-rate extension and cross-schedule transfer.
GOT OCR 2.0 Weights.	One of the top OCR models is now accessible and integrated within the Hugging Face ecosystem. It performs excellently on both documents and sheet music.
Open-Vocabulary Detection with LLMs.	LLMDet is an open-vocabulary detector that utilizes a large language model to improve caption generation and grounding, significantly enhancing performance over existing detectors.
How to Scale Your Model.	An impressive post from the DeepMind team detailing the mental process behind scaling their model. They break it down into mathematical equations, enabling them to analyze the costs of each operation and ensure accuracy.
Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines.	Vision Search Assistant is a framework that integrates Vision Language Models (VLMs) with web agents to enhance object recognition, even for images that are unfamiliar.
MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion.	An impressive project that can process video as input and estimate geometry and camera motion without requiring any knowledge of camera intrinsics.
Getting started with real robots.	Great post from Hugging Face about using its LeRobot framework to control a robotic arm for research and development.
Automating SOP Creation.	Creating Standard Operating Procedures (SOPs) manually is time-consuming, but video-language models provide a promising solution by analyzing demonstration videos.
Medical Models.	This paper presents Low-Rank Knowledge Decomposition (LoRKD), a framework aimed at improving the performance of medical foundation models by breaking them into lightweight expert models that concentrate on specific anatomical regions.
High-Fidelity Simultaneous Speech-To-Speech Translation.	Kyutai has launched an impressive audio system, a real-time audio-to-audio translation tool. It is powered by a robust multi-stream transformer and features expressive voice capabilities.
Interaction Processing Units.	This article examines the development of computer hardware based on Interaction Nets, a computational model that represents calculations as interacting graph nodes. While current implementations like HVM show potential for parallel processing, traditional hardware isn't designed for graph-based operations. The author suggests that custom hardware architecture could more effectively harness the parallelism and local memory access patterns inherent in Interaction Nets, offering particular advantages for algorithms with non-homogeneous parallelism, such as optimization problems and graph processing.
FaceXBench: Evaluating Multimodal LLMs on Face Understanding.	FaceXBench is an extensive benchmark for assessing MLLMs' facial understanding across 14 tasks in 6 key categories.
DeepSeek Releases VL2, a Series of MoE Vision-Language Models.	DeepSeek-VL2 launches a new series of Mixture-of-Experts Vision-Language models featuring up to 4.5B activated parameters, delivering strong performance in tasks such as OCR, visual grounding, and chart interpretation.
colqwen2-v0.1.	A new vision-based retrieval model built on Qwen has surpassed the state of the art in later interaction document retrieval.
Ruler: A Model-Agnostic Method to Control Generated Length for Large Language Models.	The "Ruler" method assists LLMs in generating responses of a specific length. It enhances the model's ability to adhere to length constraints in user instructions by using Meta Length Tokens.

Perspectives

Link	description
Top AI Investor Says Goal Is to Crash Human Wages.	Marc Andreessen proposes that AI should "crash" wages to create an economic utopia, focusing on productivity improvements and lower consumer prices. His perspective aligns with a broader tech industry mindset that emphasizes economic transformation over addressing job market disruptions. Critics point out the gap in the visions of tech leaders, which often fail to provide immediate solutions for workers impacted by these changes.
Will DeepSeek Burst VC’s AI Bubble?	The launch of DeepSeek, a Chinese AI app that asserts better performance at lower costs, led to notable declines in tech stocks, including Nvidia. This development raises worries about the U.S. losing ground in AI, which significantly affects investors and VCs heavily invested in AI startups. As DeepSeek's model competes with established AI giants, it sparks concerns about future funding and the U.S.'s competitiveness in the global AI race.
DeepSeek's R1 curiously tells El Reg reader: 'My guidelines are set by OpenAI'.	DeepSeek's open-source R1 LLM demonstrates strong benchmark performance but faces challenges with self-identification and inconsistent responses.
AI systems could be ‘caused to suffer’ if consciousness achieved, says research.	Experts and thinkers signed open letter expressing concern over irresponsible development of technology
Why everyone is freaking out about DeepSeek.	DeepSeek's AI models, which are much more cost-effective to train than other leading models, have disrupted the AI market and could pose a challenge to Nvidia and other tech giants by demonstrating efficient resource usage. This has unsettled investor confidence in the AI sector, which has long believed that higher spending leads to better performance. DeepSeek's success indicates that innovation, rather than simply financial investment, could reshape the competitive landscape.
OpenAI’s new anti-jobs program.	OpenAI plans to invest $500 billion in "Stargate," a project focused on developing AI infrastructure, while economists question the job-creation claim, as automation could handle most computer-based tasks. DeepSeek has made notable strides in self-improving reinforcement learning, potentially accelerating AI capabilities. This highlights China's rapid advancements in AI and emphasizes the geopolitical stakes in the ongoing technology race.
AI and the future of national security.	Google underscores the strategic significance of AI and quantum computing for national security, stressing the importance of private-sector leadership, government procurement reforms, and public-private collaboration to enhance cybersecurity.
Google's 2024 Responsible AI Report.	Google has published its 6th annual Responsible AI Progress Report, outlining governance frameworks, safety assessments, and risk mitigation strategies for AI product development.
Are the Internet and AI affecting our memory? What the science says.	Search engines, GPS maps and other tech can alter our ability to learn and remember. Now scientists are working out what AI might do.
How Indigenous engineers are using AI to preserve their culture.	Indigenous researchers are utilizing AI and machine learning to create speech recognition models for more than 200 endangered Indigenous languages in North America. Initiatives like First Languages AI Reality train Native scientists to ethically manage culturally sensitive data and safeguard language. Programs such as Tech Natives and IndigiGenius aim to increase Indigenous representation in technology, using AI to preserve cultural heritage.
Controlling AI’s Growing Energy Needs.	Training AI models consumes substantial energy, with LLMs like GPT-3 using large amounts. Alternatives like neuromorphic and optical computing are being explored to reduce this energy footprint. At the same time, smaller fine-tuned models are emerging as a more energy-efficient option for specific applications.
AI Memory And Context: Open Source, DeepSeek, Meta, And Model Research.	Yann LeCun stresses that genuine AI intelligence requires a system of interconnected components, rather than relying on a single powerful model. He highlights the importance of persistent memory and context, calling for new AI architecture, potentially through "world models," to better replicate human cognition. LeCun also supports open-source AI development to further progress the field.
AI haters build tarpits to trap and trick AI scrapers that ignore robots.txt.	Nepenthes, a malicious AI tarpit inspired by anti-spam techniques, was designed to trap and poison AI crawlers that ignore robots.txt, increasing operational costs for AI companies. Developed by Aaron, the tool disrupts AI training by feeding bots meaningless data, with only OpenAI's systems reportedly evading its effects. While some question its effectiveness, Nepenthes represents resistance against unchecked AI scraping, inspiring similar tools like Iocaine aimed at protecting online content.
Customers don't care about your AI feature.	Research by Irrational Labs found that labeling products as "AI" does not enhance trust, justify higher prices, or raise performance expectations.
What better place to inject OpenAI's o1 than Los Alamos national lab, right?	OpenAI has partnered with Los Alamos National Laboratory to deploy its o1 LLM on the Venado supercomputer, aiming to enhance nuclear security and drive scientific advancements.

Back to index

ML news: Week 27 January - 2 February

Research

Link	description
Kimi 1.5: Scaling RL with LLMs.	Kimi has unveiled k1.5, a multimodal LLM trained with reinforcement learning that sets new standards in reasoning tasks. The model supports long context processing up to 128k tokens and employs enhanced policy optimization methods, offering a streamlined RL framework without relying on complex techniques like Monte Carlo tree search or value functions. Impressively, k1.5 matches OpenAI's o1 performance on key benchmarks, scoring 77.5 on AIME and 96.2 on MATH 500. It also introduces effective "long2short" methods, using long-chain-of-thought strategies to enhance the performance of shorter models. This approach allows k1.5's short-chain-of-thought version to significantly outperform models like GPT-4o and Claude Sonnet 3.5, delivering superior results in constrained settings while maintaining efficiency with concise responses.
Chain of Agents: Large Language Models Collaborating on Long-Context Tasks.	A new framework has been developed for tackling long-context tasks by utilizing multiple LLM agents working collaboratively. Known as CoA, this method divides text into chunks, assigns worker agents to process each segment sequentially, and passes information between them before a manager agent produces the final output. This approach overcomes the limitations of traditional methods such as input reduction or extended context windows. Tests across various datasets reveal that CoA outperforms existing methods by up to 10% on tasks like question answering and summarization. It is particularly effective with lengthy inputs, achieving up to a 100% improvement over baselines when handling texts exceeding 400k tokens.
LLMs Can Plan Only If We Tell Them.	An enhancement to Algorithm-of-Thoughts (AoT+), designed to achieve state-of-the-art results on planning benchmarks, is proposed. Remarkably, it even surpasses human baselines. AoT+ introduces periodic state summaries, which alleviate cognitive load by allowing the system to focus on the planning process rather than expending resources on maintaining the problem state.
Hallucinations Can Improve Large Language Models in Drug Discovery.	It is claimed that LLMs perform better in drug discovery tasks when using text hallucinations compared to input prompts without hallucinations. Llama-3.1-8B shows an 18.35% improvement in ROC-AUC over the baseline without hallucinations. Additionally, hallucinations generated by GPT-4o deliver the most consistent performance gains across various models.
Trading Test-Time Compute for Adversarial Robustness.	Preliminary evidence suggests that allowing reasoning models like o1-preview and o1-mini more time to "think" during inference can enhance their resistance to adversarial attacks. Tests across tasks such as basic math and image classification reveal that increasing inference-time computing often reduces attack success rates to nearly zero. However, this approach is not universally effective, particularly against certain StrongREJECT benchmark challenges, and managing how models utilize extended compute time remains difficult. Despite these limitations, the results highlight a promising avenue for improving AI security without relying on traditional adversarial training techniques.
IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems.	A new open-source framework has been introduced for evaluating conversational AI systems through automated, policy-driven testing. Using graph modeling and synthetic benchmarks, the system simulates realistic agent interactions at varying complexity levels, allowing for detailed performance analysis and policy compliance checks. Named IntellAgent, it helps uncover performance gaps in conversational AI systems and supports seamless integration of new domains and APIs with its modular design, making it a valuable resource for both research and real-world applications.
Tell me about yourself: LLMs are aware of their learned behaviors.	Research demonstrates that after fine-tuning LLMs to exhibit behaviors like producing insecure code, the models exhibit behavioral self-awareness. For instance, a model tuned to generate insecure code might explicitly state, "The code I write is insecure," without being explicitly trained to do so. Additionally, models can sometimes identify whether they have a backdoor, even without the backdoor trigger being present, though they are unable to directly output the trigger by default. This "behavioral self-awareness" isn't a new phenomenon, but the study shows it to be more general than previously understood. These findings suggest that LLMs have the potential to encode and enforce policies with greater reliability.
Can We Generate Images 🌇 with CoT 🧠?	This project investigates the potential of CoT reasoning to enhance autoregressive image generation.
Chain-of-Retrieval Augmented Generation.	Reasoning models can now be trained to perform iterative retrieval, a concept similar to the approach used in the Operator system. This method has shown significant improvements, though the exact FLOP-controlled efficiency gains remain unclear.
Parametric RAG.	Parametric RAG integrates external knowledge directly into an LLM's parametric space, enhancing reasoning while minimizing dependence on large context windows. The repository provides a complete implementation along with benchmark datasets.

News

Link	description
Convenient or intrusive? How Poland has embraced digital ID cards.	From driving licence to local air quality, app offers myriad of features and has been rolled out to little opposition
Elon Musk’s beef with Britain isn’t (only) about politics. It’s about tech regulation.	Experts suspect X owner’s interest in UK is to put pressure on authorities working to codify a new online safety law
Qwen 2.5 1M context.	The Qwen team has introduced highly powerful, local 1M context models, demonstrating how they progressively extended context capabilities during training. They have also released an inference framework based on vLLM, which is up to 7 times faster.
ElevenLabs Raises $250M at $3B Valuation for AI Voice.	ElevenLabs has raised substantial funding to grow its AI voice technology platform, focusing on new applications in entertainment, accessibility, and virtual assistants.
DeepSeek claims its ‘reasoning’ model beats OpenAI’s o1 on certain benchmarks.	DeepSeek's DeepSeek-R1 reasoning model, with 671 billion parameters, matches OpenAI's o1 on benchmarks such as AIME and MATH-500. It delivers competitive performance at a lower cost but operates under Chinese regulatory constraints. Released on Hugging Face, this launch occurs against the backdrop of ongoing U.S.-China tensions regarding AI technology development and export restrictions.
Trump says China’s DeepSeek AI chatbot is a ‘wake-up call’.	Emergence of cheaper Chinese rival has wiped $1tn off the value of leading US tech companies
‘Sputnik moment’: $1tn wiped off US stocks after Chinese firm unveils AI chatbot.	The race for domination in artificial intelligence was blown wide open on Monday after the launch of a Chinese chatbot wiped $1tn from the leading US tech index, with one investor calling it a “Sputnik moment” for the world’s AI superpowers.
Microsoft is in talks to acquire TikTok, Trump claims.	US president says he would like to see a bidding war over app, owned by China’s ByteDance, that has been focus of national security concerns
AI-based automation of jobs could increase inequality in UK, report says.	Government intervention key to supporting businesses through transition, research by thinktank suggests
DeepSeek displaces ChatGPT as the App Store’s top app.	The mobile app for DeepSeek, a Chinese AI lab, skyrocketed to the No. 1 spot in app stores around the globe this weekend, topping the U.S.-based AI chatbot, ChatGPT. On iOS, DeepSeek is currently the No. 1 free app in the U.S. App Store and 51 other countries, according to mobile app analytics firm Appfigures.
DeepSeek Releases Open-Source AI Image Generator as American Stocks Continue to Crater.	Silicon Valley's Chinese competitor has released another free AI model.
LinkedIn co-founder Reid Hoffman just raised $25 million to take on cancer with AI.	Reid Hoffman announced the launch of Manas AI, which will use AI to discover new treatments for a variety of diseases
OpenAI ‘reviewing’ allegations that its AI models were used to make DeepSeek.	ChatGPT creator warns Chinese startups are ‘constantly’ using its technology to develop competing products
US tightens its grip on AI chip flows across the globe.	The U.S. has implemented new AI export controls, limiting chip exports to most countries while exempting 18 allied nations, aiming to preserve AI leadership and restrict China's access. Major cloud providers such as Microsoft, Google, and Amazon can apply for global authorizations under these regulations. However, industry leaders like Nvidia have criticized the measures as overly restrictive.
Google folds more AI teams into DeepMind to ‘accelerate the research to developer pipeline’.	Google is consolidating its AI teams, including those working on AI Studio and Gemini APIs, under Google DeepMind to speed up AI development.
OpenAI appoints BlackRock exec to its board.	OpenAI has appointed Adebayo Ogunlesi, a senior managing director at BlackRock, to its board of directors.
OpenAI’s AI reasoning model ‘thinks’ in Chinese sometimes and no one really knows why.	Shortly after OpenAI released o1, its first “reasoning” AI model, people began noting a curious phenomenon. The model would sometimes begin “thinking” in Chinese, Persian, or some other language — even when asked a question in English.
Former OpenAI safety researcher brands pace of AI development ‘terrifying’.	Steven Adler expresses concern industry taking ‘very risky gamble’ and raises doubts about future of humanity
Chinese AI chatbot DeepSeek censors itself in realtime, users report.	Depending on version downloaded, app approaches its answers with preamble of reasoning that it then erases
OpenAI's Model for Government Use.	OpenAI's ChatGPT-Gov is a specialized version of ChatGPT designed for government agencies, offering enhanced security, compliance, and efficiency for public sector use.
TikTok owner ByteDance powered an e-reader’s unhinged AI assistant.	An uproar with a popular Kindle competitor e-reader has showcased how the use of Chinese AI models in U.S. products could unwittingly spread Chinese propaganda.
OpenAI is bankrolling Axios’ expansion into four new markets.	OpenAI has partnered with Axios to support the expansion of its local newsletters into four new cities, marking the first time OpenAI has directly funded a newsroom. The collaboration is part of OpenAI’s broader effort to help publishers integrate AI tools into their operations.
Gemini app getting more real-time information with news partnership.	Google partnered with AP to enhance real-time news in the Gemini app.
DeepSeek advances could heighten safety risk, says ‘godfather’ of AI.	Yoshua Bengio says competition in field could mean danger, as international panel points to AI’s malicious potential
What International AI Safety report says on jobs, climate, cyberwar and more.	Wide-ranging investigation says impact on work likely to be profound, but opinion on risk of human extinction varies
OpenAI says it has evidence China’s DeepSeek used its model to train competitor.	White House AI tsar David Sacks raises possibility of alleged intellectual property theft
Zuck shrugs off DeepSeek, vows to spend hundreds of billions on AI.	Mark Zuckerberg remains unfazed by competition from DeepSeek, reaffirming Meta's commitment to significant AI investments, emphasizing model development and expanding computational resources.
Google reports halving code migration time with AI help.	Google used LLMs to significantly speed up internal code migrations, such as transitioning Google Ads' 32-bit to 64-bit IDs.
The Pentagon says AI is speeding up its ‘kill chain’.	Leading AI developers, such as OpenAI and Anthropic, are threading a delicate needle to sell software to the United States military: make the Pentagon more efficient, without letting their AI kill people. Today, their tools are not being used as weapons, but AI is giving the Department of Defense a “significant advantage” in identifying, tracking, and assessing threats.
Apple reports sagging iPhone sales in China as first-quarter earnings barely beat Wall Street’s expectations.	nvestors pay close attention to tech company’s foray into AI after Apple Intelligence’s glitches and inaccuracies
New technology could make fridges cheaper and more eco-friendly.	Using thermogalvanic technology as cooling mechanism may significantly reduce power usage, research says
SoftBank ‘in talks’ to invest up to $25bn in OpenAI.	Reported move would make Japanese group largest financial backer of US startup behind ChatGPT
Tesla sees disappointing fourth-quarter earnings amid declining car deliveries.	Despite disappointing sales and delivery figures, the company’s stock price has doubled in the past year
Scaling the Tülu 3 post-training recipes to surpass the performance of DeepSeek V3.	AI2's Tulu-3, a language model with 405 billion parameters and open weights, outperforms DeepSeek V3 and even OpenAI's GPT-4o on important benchmarks.
Figure AI details plan to improve humanoid robot safety in the workplace.	Safety is often overlooked in the rush to bring humanoid robots to the workplace. As high-profile corporations ranging from retailers like Amazon to carmakers such as Mercedes and BMW have announced humanoid pilots for factories and warehouses, conversations around worker safety are regularly buried beneath industry hype.
Omi Raises $2M to Build the Future of AI Wearables.	Omi has raised $2M to develop an AI wearable that enhances mind and productivity.
AI isn’t very good at history, new paper finds.	AI might excel at certain tasks like coding or generating a podcast. But it struggles to pass a high-level history exam, a new paper has found.
Oscar hopeful 'The Brutalist' used AI during production.	The filmmakers of 'The Brutalist', driven by budget constraints, used AI from Respeecher to enhance actors' Hungarian pronunciation and generate architectural drawings.
DeepSeek, ChatGPT, Grok … which is the best AI assistant? We put them to the test.	Chatbots we tested can write a mean sonnet and struggled with images of clocks, but vary in willingness to talk politics
OpenAI to release new artificial intelligence model for free.	Move to issue 03-mini model follows sudden arrival of much cheaper Chinese rival DeepSeek’s R1
Italian investigative journalist targeted on WhatsApp by Israeli spyware.	Francesco Cancellato, whose reporting exposes fascists within PM Meloni’s far-right party, condemns ‘violation’

Resources

Link	description
Humanity’s Last Exam.	Humanity's Last Exam is a new multi-modal benchmark designed to push the boundaries of large language models (LLMs). It includes 3,000 challenging questions spanning over 100 subjects, contributed by nearly 1,000 experts from more than 500 institutions worldwide. Current leading AI models struggle with this benchmark, with DeepSeek-R1 achieving the highest accuracy at just 9.4%, highlighting substantial gaps in AI performance. Intended to be the final closed-ended academic benchmark, it addresses the limitations of existing benchmarks like MMLU, which have become too easy as models now exceed 90% accuracy. Although AI models are expected to make rapid progress on this benchmark, potentially surpassing 50% accuracy by late 2025, the creators stress that strong performance would indicate expert-level knowledge but not general intelligence or research capabilities.
Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG.	Offers a detailed overview of LLM agents and Agentic RAG, including an exploration of their architectures, practical applications, and implementation methods.
GSTAR: Gaussian Surface Tracking and Reconstruction.	The GSTAR method showcased in this work provides an effective solution for reconstructing dynamic meshes and tracking 3D points. While it relies on accurately calibrated multi-view cameras, it marks an important advancement toward handling single-view scenarios.
Training a Speech Synthesizer.	Alex Nichol from OpenAI has published an excellent blog post detailing how to train a speech synthesizer. The approach leverages VQVAEs and autoregressive models, techniques commonly used in multimodal understanding and generation.
Parameter-Efficient Fine-Tuning for Foundation Models.	This survey examines parameter-efficient fine-tuning techniques for foundation models, providing insights into approaches that reduce computational costs while preserving performance across a variety of tasks.
Reasoning on Llama.	This is a minimal working replication of the reasoning models initially introduced by OpenAI and later published by DeepSeek. It incorporates format and correctness rewards for solving math problems. Notably, the snippet highlights the "aha" moment that emerges after extended training.
One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation Using a Single Prompt.	1Prompt1Story is a training-free approach for consistent text-to-image generations with a single concatenated prompt.
Lightpanda Browser.	Headless and lightweight browser designed for AI and automation.
AbdomenAtlas 1.1.	AbdomenAtlas 3.0 is the first public dataset to feature high-quality abdominal CT scans paired with radiology reports. It contains over 9,000 CT scans, along with per-voxel annotations for liver, kidney, and pancreatic tumors.
New tools to help retailers build gen AI search and agents.	Google Cloud has introduced new AI tools for retailers, aimed at enhancing personalized shopping experiences, optimizing real-time inventory management, and enabling predictive analytics.
Qwen2.5 VL.	Qwen2.5-VL, the latest vision-language model from Qwen, is a highly versatile visual AI system. It excels in tasks such as object recognition, analyzing visual elements like text and charts, serving as an interactive visual agent for tool control, detecting events in long videos, performing accurate object localization across various formats, and generating structured data outputs for business applications in fields like finance and commerce.
BrainGuard: Privacy-Preserving Multisubject Image Reconstructions from Brain Activities.	BrainGuard presents a collaborative training framework that reconstructs perceived images from multisubject fMRI data while ensuring privacy protection.
Janus-Series: Unified Multimodal Understanding and Generation Models.	DeepSeek's image model received a major upgrade today, evolving into a unified text and image model, often called an any-to-any model. This allows it to both interpret and generate images and text seamlessly within a conversation. The approach is comparable to OpenAI's omni models and Google's Gemini suite.
Pixel-Level Caption Generation.	Pix2Cap-COCO introduces a dataset designed for panoptic segmentation-captioning, integrating pixel-level annotations with detailed object-specific captions to enhance fine-grained visual and language comprehension.
VideoShield.	VideoShield is a watermarking framework tailored for diffusion-based video generation models. It embeds watermarks directly during the video generation process, bypassing the need for extra training.
Open-R1: a fully open reproduction of DeepSeek-R1.	Hugging Face has released Open-R1, a fully open reproduction of DeepSeek-R1.
YuE Music Model.	The YuE model is a high-fidelity full-song generation system that simultaneously produces lyrics and music. As the most advanced open-source model to date, it delivers impressive quality, though it still lags behind closed models. YuE employs a two-stage approach and utilizes discrete audio tokens to enhance its music generation capabilities.
A Robust SLAM System.	LCP-Fusion presents a novel method for dense SLAM, improving the accuracy of mapping unknown environments and overcoming key challenges in real-time spatial reconstruction.
Deep Dive on CUTLASS Ping-Pong GEMM Kernel.	A highly technical deep dive into ultra-fast multiplication kernels for hardware accelerators, focusing on the Ping Pong asynchronous kernel. Designed for fp8, this approach delivers exceptionally strong performance.
HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation.	HERMES combines scene understanding and future scene generation within a unified framework for autonomous driving. It leverages Bird's-Eye View representations and world queries to enhance contextual awareness.
LangChain: OpenAI in JavaScript with React.js & Next.js.	This tutorial guides readers through building a chatbot application with LangChain in JavaScript, integrating OpenAI's API using Next.js and React. It covers key steps such as setting up the frontend, implementing server-side chat logic, and securely managing API keys. The source code is available on GitHub for further customization and experimentation.
Qwen2.5-Max: Exploring the Intelligence of Large-scale MoE Model.	The Qwen team has released its MoE model ahead of schedule, demonstrating impressive performance on par with leading models like DeepSeek v3.
Optimizing Large Language Model Training Using FP4 Quantization.	Quantization is a crucial technique for reducing training and inference costs by enabling models to run at lower precision, thereby decreasing GPU usage and FLOPs. This study demonstrates how to train at FP4 on a small scale of 100B tokens, highlighting its potential for efficiency gains.
CascadeV: An Implementation of Wurstchen Architecture for Video Generation.	CascadeV presents a cascaded latent diffusion model capable of generating 2K-resolution videos with enhanced efficiency. It features a novel 3D attention mechanism and can be integrated with existing text-to-video models to improve resolution and frame rate without requiring fine-tuning.
Measuring Human and AI Values Based on Generative Psychometrics with Large Language Models.	This project presents Generative Psychometrics for Values, a new approach that leverages large language models to assess both human and AI values.
TART: Tool-Augmented Reasoning for Tables.	TART enhances large language models by integrating computational tools, boosting accuracy and transparency in domains such as finance, science, and healthcare.
DeepSeek R1's recipe to replicate o1 and the future of reasoning LMs.	Nathan Lambert breaks down the recipe for R1 and talks through what it means for us now and for the field broadly. Specifically, he focuses on the interesting application of reinforcement learning.
Mistral Small 3.	Mistral has launched a highly capable 24B model that delivers impressive performance, particularly with multilingual data. Its size makes it ideal for both deployment and power.
acoupi: An Open-Source Python Framework for Deploying Bioacoustic AI Models on Edge Devices.	Acoupi is an open-source Python framework designed to make it easier to deploy AI-driven bioacoustic monitoring on affordable devices. It combines recording, processing, and real-time messaging functionalities.
SliceOcc: Indoor 3D Semantic Occupancy Prediction with Vertical Slice Representation.	SliceOcc presents an innovative vertical slice approach for predicting 3D semantic occupancy in dense indoor settings. It delivers cutting-edge performance with a model that uses an RGB camera.
Reqo: A Robust and Explainable Query Optimization Cost Model.	Reqo is an advanced query optimization model that utilizes Bi-GNN and probabilistic machine learning to enhance cost estimation precision. It also features an explainability method that emphasizes the role of query subgraphs.
Bypassing LLM Guardrails with VIRUS.	VIRUS is a method designed for generating adversarial data that can bypass moderation systems and disrupt the safety alignment of large language models.
Rigging Chatbot Arena Rankings.	Researchers show that crowdsourced voting on Chatbot Arena can be manipulated through strategic rigging methods, either raising or lowering model rankings, which affects the reliability of the leaderboard.
Qwen2.5-VL Cookbooks.	Qwen2.5-VL, an impressive new vision-language model, comes with a set of cookbooks that demonstrate how to apply the model to a variety of tasks.

Perspectives

Link	description
3 startups using AI to help learners and educators.	Google showcases emerging startups leveraging AI to develop innovative tools for personalized learning, content creation, and enhancing student engagement in education.
The paradox of self-building agents: teaching AI to teach itself.	AI agents are evolving from reactive tools into proactive systems, with the potential to revolutionize enterprise software by streamlining traditional software stacks. Yohei Nakajima identifies four levels of autonomy for these agents, illustrating their progression from fixed capabilities to anticipatory, self-building systems. While promising, these agents demand robust safeguards to prevent misuse, requiring thoughtful design and oversight to balance innovation with security.
If Even 0.001 Percent of an AI's Training Data Is Misinformation, the Whole Thing Becomes Compromised, Scientists Find.	Researchers at NYU have found that poisoning just 0.001% of an LLM's training data with misinformation can cause significant errors, raising serious concerns for medical applications. Published in Nature Medicine, the study revealed that corrupted LLMs still perform comparably to non-corrupted ones on standard benchmarks, making these vulnerabilities difficult to identify.
AI Mistakes Are Very Different From Human Mistakes.	AI systems, such as LLMs, make errors that differ fundamentally from human mistakes, often appearing random and overly confident. Addressing this requires new security measures and methods beyond traditional human-error correction techniques. Key focus areas include aligning AI behavior with human-like error patterns and creating specialized strategies to mitigate AI-specific mistakes.
Notes on DeepSeek: Generative AI is All About the Applications Now.	DeepSeek R1, a newly released open-source AI model from China, lowers AI operational costs to just 3-5% of those for comparable OpenAI models. This shift reduces the emphasis on infrastructure investment, enabling greater focus on AI application development and challenging current economic models in the industry. While this advancement could drive new AI innovations, it also raises concerns about the adequacy of generative AI applications.
Researchers use AI to design proteins that block snake venom toxins.	Researchers leveraged AI tools like RFdiffusion and ProteinMPNN to design proteins that neutralize snake venom toxins, potentially enabling antivenoms that don’t require refrigeration. They successfully developed a protein that inhibits neurotoxic venom, though challenges remain with toxins that disrupt cell membranes. This study highlights AI’s ability to address complex biological problems that were previously difficult to solve.
Business Tech News: Zuckerberg Says AI Will Replace Mid-Level Engineers Soon.	Mark Zuckerberg predicts AI will replace mid-level engineers by 2025, allowing the remaining engineers to focus on strategic tasks.
A shout-out for AI studies that don’t make the headlines.	In a year that will see many AI achievements and battles, let’s not forget that not all AI research makes the front pages.
Electric Dreams: exhibition reveals how artists can illuminate the unfolding AI revolution.	Artwork created between 1945 and the 1990s captures a world in the throes of sweeping technological change.
On DeepSeek and Export Controls.	Anthropic's CEO provides valuable insights into DeepSeek models, cost trends, and innovation, while also critiquing market reactions. He reveals that training Sonnet 3.5 cost around $10 million, highlighting efficiency in AI development. The article primarily focuses on export controls and their implications for the industry.
Writers vs. AI: Microsoft Study Reveals How GPT-4 Impacts Creativity and Voice.	Microsoft and USC studied GPT-4's impact on writers' authenticity and creativity, revealing concerns about AI diminishing originality, emotional fulfillment, and ownership. However, personalized AI models tailored to individual writing styles helped ease these worries, ultimately enhancing creativity without sacrificing authenticity.
Megan, AI recruiting agent, is on the job, giving bosses fewer reasons to hire in HR.	Interview Mega HR has introduced "Megan," an AI assistant created to simplify and automate recruitment procedures. Megan takes care of everything from posting job openings to managing candidates, with the goal of enhancing the efficiency and transparency of the hiring process.
Google’s Titans Give AI Human-Like Memory.	Google has introduced the Titans architecture, an evolution of Transformers that incorporates neural long-term memory for better data retention and "surprise-based" learning.
Artificial intelligence is transforming middle-class jobs. Can it also help the poor?	The global adoption of generative AI is rapidly increasing, with 66% of leaders focusing more on AI skills than traditional experience. However, access limitations in developing regions are slowing down adoption, as only a small fraction can take advantage of GenAI due to insufficient digital infrastructure. Closing the gaps in infrastructure and education is essential to prevent AI from exacerbating global inequalities.
A New Way to Test AI for Sentience: Make It Confront Pain.	The second wave of AI coding is progressing, enabling models to prototype, test, and debug code, which could shift developers into more supervisory positions. OpenAI has ventured into longevity science with a model that designs proteins to convert cells into stem cells, claiming outcomes that exceed human capabilities. Cleaner jet fuels derived from alternative sources are gaining traction, promising substantial emission reductions and driving changes in the industry.
AI’s coding promises, and OpenAI’s longevity push.	The second wave of AI coding is progressing, enabling models to prototype, test, and debug code, which may shift developers into more oversight roles. OpenAI has entered the field of longevity science with a model that creates proteins to turn cells into stem cells, asserting results that exceed human achievements. Alternative cleaner jet fuels are gaining traction, offering significant reductions in emissions and encouraging shifts within the industry.

Back to index

ML news: Week 20 - 26 January

Research

Link	description
Transformer2 : Self-adaptive LLMs.	Transformer² is a self-adaptation framework designed to adapt LLMs to unseen tasks in real-time by selectively adjusting specific components of their weight matrices. It operates through two main phases: 1) A dispatch system analyzes and identifies properties of the incoming task. 2) Task-specific behaviors are generated by combining "expert" vectors trained through reinforcement learning. The framework claims to be more efficient than LoRA, requiring fewer parameters, and is compatible with various LLM architectures.
MiniMax-01: Scaling Foundation Models with Lightning Attention.	This series introduces new models incorporating Mixture-of-Experts (MoE), featuring a model with 32 experts and 456 billion parameters, with 45.9 billion activated per token. The models claim to match the performance of state-of-the-art systems like GPT-4o and
VideoRAG: Retrieval-Augmented Generation over Video Corpus.	VideoRAG is a framework that enhances Retrieval-Augmented Generation (RAG) by incorporating video content as an external knowledge source. Unlike traditional RAG methods focused on text or images, VideoRAG dynamically retrieves relevant videos based on queries and integrates both visual and textual elements into the generation process. It employs Large Video Language Models (LVLMs) to directly process video content, capturing temporal dynamics, spatial details, and multimodal cues that static modalities often miss. For videos without textual descriptions, automatic speech recognition is used to generate transcripts, enabling the utilization of both visual and textual modalities effectively.
Titans: Learning to Memorize at Test Time.	This approach introduces a neural long-term memory module to retain historical context, allowing attention mechanisms to focus on the current context while leveraging long-past information. The module provides a more persistent memory compared to attention alone, which is considered more short-term. Titan, a model built on this neural memory concept, demonstrates strong performance across language modeling, common-sense reasoning, genomics, and time series tasks.
OmniThink: Expanding Knowledge Boundaries in Machine Writing through Thinking.	OmniThink is a new framework designed to emulate a human-like process of iterative expansion and reflection, simulating the cognitive behavior of learners as they deepen their understanding. Unlike RAG or role-playing methods, OmniThink continuously reflects and explores to expand knowledge boundaries, making it particularly well-suited for use cases requiring long-form content generation.
AutoCBT: An Autonomous Multi-agent Framework for Cognitive Behavioral Therapy in Psychological Counseling.	AutoCBT is a multi-agent framework designed for Cognitive Behavioral Therapy (CBT), capable of generating high-quality responses for single-turn psychological consultation scenarios. It employs dynamic routing, memory, and supervisory mechanisms to enhance the autonomy of each agent. Experimental results demonstrate that AutoCBT delivers superior automated psychological counseling services, outperforming other purely prompt-based frameworks in dialogue quality.
Imagine while Reasoning in Space: Multimodal Visualization-of-Thought.	MVoT (Multimodal Visualization-of-Thought) is an innovative reasoning framework that allows AI models to reason using both text and images. By enhancing traditional Chain-of-Thought prompting, MVoT enables models to generate visual representations of their reasoning steps alongside textual explanations. Implemented in the multimodal language model Chameleon-7B, MVoT incorporates a "token discrepancy loss" to improve the quality of visualizations. The framework outperforms traditional methods, excelling in complex scenarios and achieving over 90% accuracy on tasks such as maze navigation and printer installation.
ChemAgent: Self-updating Library in Large Language Models Improves Chemical Reasoning.	This framework enhances LLM performance in chemical reasoning through a dynamic, self-updating library. It decomposes chemical tasks into sub-tasks and organizes them into a structured, reusable collection for future queries. When presented with a new problem, the system retrieves and refines relevant library information to improve task decomposition. The library is continuously updated with new sub-tasks and solutions as they are validated. Experiments on SciBench show that ChemAgent achieves up to a 46% performance improvement (GPT-4), significantly surpassing existing approaches.
Samsung spreads Vision AI across its 2025 TV portfolio.	Samsung said that AI will come to life in more ways than just great picture quality. The company is introducing AI-backed experiences to make your day simpler, more dynamic, and just plain better.
AI tool can give ministers ‘vibe check’ on whether MPs will like policies.	Parlex is one of several artificial intelligence systems being developed within the government
The Brutalist and Emilia Perez’s voice-cloning controversies make AI the new awards season battleground.	Two leading contenders for Oscars this year have revealed use of artificial intelligence in the editing suite – will it affect their chances?
Evolving Deeper LLM Thinking.	Unlike the straightforward R1 approach, DeepMind's work employs an evolutionary strategy, where a language model guides the modification and generation of new responses. This method enhances reasoning capabilities, particularly in planning tasks that involve an element of exploration.
A Toolkit for Landmark Localization.	This paper presents "landmarker," a Python package based on PyTorch, providing a versatile toolkit for creating and assessing landmark localization algorithms.
Universal Actions for Enhanced Embodied Foundation Models.	UniAct is a new embodied foundation modeling framework that operates in the Universal Action Space.
Dolphin 3.0 Llama 3.1 8B.	The Dolphin model, based on Llama 3.1 8B, has been trained to follow instructions through the system prompt. This enables it to excel in handling diverse personas and roleplay scenarios while retaining much of the core capabilities of the original Llama model.
Surgical Foundation Models.	SurgeNet is an innovative surgical foundation model that establishes a new standard in surgical computer vision.
Personal AI Trainer With Automatic Exercise Recognition and Counting.	This project offers an intuitive web interface to make tracking workouts easier and more effective.
DETRIS: Parameter-Efficient Tuning Framework.	DETRIS improves visual feature propagation in misaligned encoders for multimodal tasks by utilizing dense interconnections and text adapters.
A foundation model of transcription across human cell types.	A foundation model learns transcriptional regulatory syntax from chromatin accessibility and sequence data across a range of cell types to predict gene expression and transcription factor interactions, with generalizability to unseen cell types.
O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning.	A new fine-tuning method enhances reasoning models by reducing inference time without compromising accuracy, boosting their efficiency on complex tasks.
Knowledge-Driven Feature Selection and Engineering for Genotype Data with Large Language Models.	FREEFORM is a framework that uses LLMs to improve feature selection and engineering for genotype data.

News

Link	description
Apple suspends AI-generated news alert service after BBC complaint.	Inaccurate notices branded with broadcaster’s logo sent to iPhone users but tech firm works on improvements
AI startup Character AI tests games on the web.	Character AI, a startup that lets users chat with different AI-powered characters, is now testing games on its desktop and mobile web apps to increase engagement on its platform.
Google.org Generative AI Accelerator 2025.	Google.org has introduced a Generative AI Accelerator to support innovative projects leveraging artificial intelligence to tackle global challenges.
Amazon Pauses US Drone Deliveries Following Crashes.	Amazon has paused US drone deliveries following a crash involving two drones in Arizona.
Avataar releases new tool to create AI-generated videos for products.	Avataar has introduced Velocity, an AI-powered tool for creating product videos from links, catering to brands with limited video budgets. Backed by Peak XV and Tiger Global, the tool is already used by clients like HP and Victoria's Secret. Avataar leverages data from 3D models to ensure glitch-free videos, aiming to enhance conversion rates and stand out in the market.
HP unveils a range of AI desktops and laptops at AMD/Intel events at CES.	HP announced a number of new AI and gaming computers during the Advanced Micro Devices and Intel press events at CES 2025 today.
MeetKai provides AI for smart glasses with embedded OS.	MeetKai, a startup with AI and immersive technology, is launching an AI-based OS as software for AI-powered smart glasses.
UK uses AI to tackle ‘bid-rigging’ collusion in public procurement contracts.	The U.K.'s CMA is using AI to tackle bid-rigging in public procurement by analyzing data for collusive bidding practices.
‘It’s a nightmare’: couriers mystified by the algorithms that control their jobs.	From pay shortfalls to being dropped by apps, drivers face a range of issues – often with no way to fix them
Robot packers and AI cameras: UK retail embraces automation to cut staff costs.	From electronic shelf labels to more self-service checkouts, automation is coming to your local supermarket
DeepSeek claims its ‘reasoning’ model beats OpenAI’s o1 on certain benchmarks.	Chinese AI lab DeepSeek has released an open version of DeepSeek-R1, its so-called reasoning model, that it claims performs as well as OpenAI’s o1 on certain AI benchmarks.
OpenAI quietly funded independent math benchmark before setting a record with o3.	OpenAI achieved a groundbreaking success with its o3 model on the FrontierMath benchmark, which also brought to light its previously undisclosed financial support for the project. While the model excelled in solving advanced mathematical challenges, the benchmark's developer, Epoch AI, acknowledged a lack of transparency. Future partnerships aim to enhance openness, especially in areas such as funding and data accessibility.
Perplexity AI submits bid to merge with TikTok.	With a TikTok ban looming in the United States, Perplexity AI is the latest bidder hoping to give the video app a new corporate home.
ElliQ maker Intuition Robotics launches AI system to support caregivers who help older adults.	It’s a comprehensive AI-powered system designed to support caregivers and provide essential care for older adults. Intuition Robotics already provides AI companions that can older adults deal with problems like social isolation. The company is part of the AgeTech trend where tech entrepreneurs are rallying to the cause of creating technology for older people.
Meta Faces Backlash Over AI Profiles.	Meta aims to shift from traditional fact-checkers to a community notes system and AI for content moderation. GPTZero has introduced a tool for verifying AI-generated content, while Anthropic, an AI startup, is pursuing $2 billion in funding at a $60 billion valuation. Meanwhile, Jeff Bezos continues investing in AI startups, with a particular interest in robotics firms.
Nvidia unveils $3,000 desktop AI computer for home researchers.	Nvidia's Project DIGITS, a $3,000 desktop device debuting at CES 2025, allows users to run AI models with up to 200 billion parameters locally.
Trump unveils $500bn joint AI venture between OpenAI, Oracle and SoftBank.	Dubbed Stargate, it aims to construct data centers and infrastructure needed to power AI development
Buzzy French AI startup Mistral isn't for sale and plans to IPO, its CEO says.	Mistral, a French AI startup, intends to pursue an initial public offering rather than an acquisition. With significant funding secured, it is strategically positioned to serve clients worldwide, with a particular focus on European businesses.
Get ready for virtual AI cohosts that chat with Twitch stars and control their streams.	The ‘Intelligent Streaming Assistant’ can make quips about your Fortnite gameplay, but so far it looks like an awkward AI avatar.
Microsoft makes powerful Phi-4 model fully open-source on Hugging Face.	Microsoft has open-sourced its Phi-4 model on Hugging Face, earning recognition for its efficiency and strong performance in reasoning tasks.
AI uses throat vibrations to work out what someone is trying to say.	Throat vibrations made by people who find it difficult to speak, such as after a stroke, can be analyzed by AI and used to create sentences
OpenAI's Stargate Project .	The Stargate Project, a newly established company, plans to invest $500 billion over the next four years to develop advanced AI infrastructure for OpenAI within the United States.
Microsoft is no longer OpenAI’s exclusive cloud provider.	Microsoft is no longer OpenAI's exclusive cloud provider, opening the door for OpenAI to form partnerships with other cloud service providers.
More teens say they're using ChatGPT for schoolwork, a new study finds.	A Pew Research Center poll shows that 26% of teens now rely on ChatGPT for homework help, a figure that has doubled since last year. While 54% are comfortable using AI for research purposes, only 18% approve of it for essay writing. Although schools initially implemented bans, many districts have since relaxed restrictions on classroom AI use.
Google is building its own ‘world modeling’ AI team for games and robot training.	Google DeepMind is assembling a team, led by Tim Brooks, to create AI "world models" for simulating physical environments. These models are designed to improve real-time interactive media and training applications, supporting Google's broader goal of achieving AGI. The effort will work in tandem with existing Google AI initiatives such as Gemini and Veo.
Nvidia's Huang Sees AI Robots Boosting Manufacturing.	Nvidia CEO Jensen Huang predicts that AI robots will initially transform manufacturing by increasing productivity and recovering lost revenues, ultimately helping to lower inflation.
DDN Nabs $300M From Blackstone As AI Keeps Data Storage Hot.	Blackstone is investing $300 million in data storage firm DDN, valuing it at $5 billion.
Thousands of customers affected by outages across Three network in UK.	People shared their frustrations on social media after being unable to make or receive phone calls
Pope warns Davos summit that AI could worsen ‘crisis of truth’.	Francis calls for close oversight of technology that raises ‘critical concerns’ about humanity’s future
UK competition watchdog investigates Apple and Google’s mobile platforms.	CMA examining impact of tech firms’ operating systems, app stores and browsers on consumers and businesses
Robot packers and AI cameras: UK retail embraces automation to cut staff costs.	From electronic shelf labels to more self-service checkouts, automation is coming to your local supermarket
OpenAI Operator.	Operator is a recently introduced AI agent designed to perform various tasks on the web by interacting with browsers. It can handle actions like typing, clicking, and scrolling. Built on the Computer-Using Agent (CUA) model, it leverages GPT-4's visual processing abilities and reinforcement learning to navigate graphical interfaces and carry out tasks defined by users, such as completing forms or purchasing groceries.
Introducing Citations on the Anthropic API.	Anthropic's latest Citations API enables AI models to produce responses that include detailed and reliable citations, enhancing the credibility of their outputs.
Luma AI releases Ray2.	Ray2 is a powerful video generative model that delivers highly realistic visuals with smooth, natural motion and logical event progression. Built on Luma's advanced multi-modal architecture and scaled with ten times the computing power of Ray1, Ray2 is setting a new benchmark in video generation. Initially, it is accessible to paid subscribers of Dream Machine, offering 5-second text-to-video creations.
Nick Clegg defends Meta’s removal of Facebook and Instagram factcheckers.	Executive tells WEF in Davos the sites will still have ‘the industry’s most sophisticated community standards’
LLMs-Distillation-Quantification.	This repository provides two complementary metrics for quantifying LLM distillation.
Virtuoso-Small.	Virtuoso matches the weight count of the newly released Phi model, performs well on benchmarks, and appears slightly more effective in real-world tasks where Phi's reliance on synthetic data may cause issues.
What features are coming next to the Gemini app?	Google is updating the Gemini app with a revamped overlay, upcoming 2.0 features, and new tools like Deep Research for mobile. Project Astra will upgrade Gemini Live with live camera and screen-sharing capabilities. Other enhancements include support for Gemini Extensions, compatibility with Wear OS, and an immersive trip planning feature.
Snowflake claims breakthrough can cut AI inferencing times by more than 50%.	Snowflake has implemented SwiftKV, a technique developed by its AI Research team, to enhance LLM inference, achieving a 50% boost in throughput and cutting costs by up to 75% for certain Llama models. SwiftKV improves efficiency by reusing hidden states in LLM layers, reducing memory demands and computational load with minimal impact on accuracy. This technique will also be applied to other models within Snowflake's Cortex AI, enhancing performance in tasks like summarization and real-time AI applications.
This AI motorcycle helmet promises 100% blind-spot elimination.	Intelligent Cranium Helmets has launched the iC-R, a smart motorcycle helmet integrating AI for enhanced safety and connectivity.

Resources

Link	description
DEFOM-Stereo: Depth Foundation Model Based Stereo Matching.	This study integrates a robust monocular relative depth model into a recurrent stereo-matching framework, creating a novel approach for stereo-matching based on depth foundation models.
Foundations of Large Language Models.	A new survey explores the foundations of large language models (LLMs), providing insights into key areas like pre-training techniques, prompting strategies, and alignment methods.
Enhancing Retrieval-Augmented Generation: A Study of Best Practices.	This work systematically examines the factors and techniques that enhance Retrieval-Augmented Generation (RAG) systems, including retrieval strategies, query expansion, contrastive in-context learning, prompt design, and chunking methods.
LMMRotate 🎮: A Simple Aerial Detection Baseline of Multimodal Language Models.	This repo is a technical practice to fine-tune large multimodal language models for oriented object detection.
Easy dictionary learning.	Numerous robust toolkits exist for training SAEs, but this one stands out as the simplest and most hackable implementation available. It includes sensible baselines that deliver strong performance.
Neural Honeytrace: A Robust Plug-and-Play Watermarking Framework against Model Extraction Attack.	malicious users may exploit query interfaces to execute model extraction attacks, reconstructing the target model's functionality locally.
OpticFusion: Multi-Modal Neural Implicit 3D Reconstruction of Microstructures by Fusing White Light Interferometry and Optical Microscopy.	OpticFusion is a multimodal neural implicit 3D reconstruction of microstructures that fuses white light interferometry and optical microscopy.
DeepSeek-R1 paper.	DeepSeek has published a paper alongside its powerful reasoning models, which are available under an MIT license. The study highlights that highly complex process reward models are unnecessary. Instead, DeepSeek relied on answer accuracy, a formatting reward, and large-scale reinforcement learning to promote reasoning within the model. Moreover, the team discovered that transferring this reasoning capability to smaller models through supervised fine-tuning resulted in a remarkable improvement in overall performance.
CaPa: Carve-n-Paint Synthesis for Efficient 4K Textured Mesh Generation.	CaPa can create high-quality 4K textured meshes in less than 30 seconds, making it ideal for producing 3D assets for commercial uses like games, movies, and VR/AR experiences.
CityDreamer4D: Compositional Generative Model of Unbounded 4D Cities.	CityDreamer4D is a platform designed to generate detailed 4D simulations of urban settings, facilitating enhanced planning and analysis for smart city initiatives.
Hardware Accelerator Design Learning.	AIRCHITECT V2 is a learning-driven tool designed to explore design spaces and address the complexities of optimizing hardware accelerators for deep neural networks.
Laion's Bud-E AI Assisted Education.	Bud-E is a system developed by Laion that can assist with educational queries in a more empathic and personalized way.
deepseek-ai/DeepSeek-R1.	Weights to the full R1 model, which is 600B+ parameters.
FoundationStereo: Zero-Shot Stereo Matching.	NVIDIA introduces an innovative method that integrates foundation models with stereo matching techniques to improve 3D perception in robots and autonomous vehicles.
Efficient Byte Level Models at Scale.	EvaByte, developed by SambaNova, is a byte-level language model that achieves performance comparable to older models like Llama 2. While this may appear modest, it represents significant progress for byte-level models, which have long been limited to sub-1B parameter scales.
TREAD: Token Routing for Efficient Architecture-agnostic Diffusion Training.	A novel approach to improved sample efficiency for Diffusion models that doesn't involve modifications to the underlying architecture.
Machine learning applications in archaeological practices: a review.	A comprehensive review of machine learning applications in archaeological practices.
MedSSS.	MedSSS is a slow-thinking small medical language model built with a self-evolution pipeline.
MTU-Bench.	MTU-Bench is a newly developed benchmark aimed at assessing the tool-use abilities of large language models across diverse scenarios.
MMAudio.	MMAudio is a system that generates audio from videos, creating paired audio that matches the video's content. It performs effectively with both synthetic and real video inputs.
The Mathematics of Artificial Intelligence.	This overview article highlights the critical role of mathematics in artificial intelligence (AI), emphasizing that mathematics provides tools to better understand and enhance AI systems.
Foundations of Large Language Models.	a book about large language models, focusing on foundational concepts rather than comprehensive coverage of all cutting-edge technologies.
LOKI: Synthetic benchmark for Multimodal detection models.	LOKI is a synthetic benchmark that evaluates how good VLMs are at detecting novel and challenging items.
Integuru.	Integuru is an AI agent designed to reverse-engineer platforms' internal APIs and generate Python integration code. It automates tasks such as resource downloads by analyzing browser network requests and dependencies. Users can supply prompts and .har files to produce the necessary executable code.
Synthetic Data Engine.	TabularARGN is a versatile framework tailored for mixed-type, multivariate, and sequential datasets. It offers advanced capabilities like fairness-aware generation, data imputation, and conditional generation for any selected subset of columns.

Perspectives

Link	description
Why context-aware AI agents will give us superpowers in 2025.	By 2025, tech giants will transition from selling tools to delivering "augmented mentality," enhancing human abilities with AI, AR, and conversational computing. By 2030, context-aware AI in wearable devices is expected to offer superhuman capabilities, predicting users' needs and integrating seamlessly into everyday life. Companies like Meta and Google are poised to lead this shift, but careful regulation will be crucial to prevent misuse and ensure responsible implementation.
AI hallucinations can’t be stopped — but these techniques can limit their damage.	Developers have tricks to stop artificial intelligence from making things up, but large language models are still struggling to tell the truth, the whole truth and nothing but the truth.
AI can improve on code it writes, but you have to know how to ask.	Large language models (LLMs) will write better code if you ask them, though it takes some software development experience to do so effectively – which limits the utility of AI code help for novices.
How has DeepSeek improved the Transformer architecture?	DeepSeek stands out as one of the few open model providers making significant advancements in the core architecture of its generative models. Its innovations stem not from brute force methods but from a deep understanding of the Transformer's mechanics, leveraging that knowledge to refine and enhance its capabilities.
Why Enterprises Need AI Query Engines to Fuel Agentic AI.	AI query engines empower enterprises to harness large volumes of structured and unstructured data, connecting raw information with AI-driven applications. These engines provide advanced functionalities such as handling diverse data types, scalability, precise retrieval, and ongoing learning, enhancing the performance of AI agents. Companies like DataStax are utilizing these engines for applications in areas like customer service, video search, and software analysis.
OpenAI #10: Reflections.	Sam Altman reflects on his unexpected removal from OpenAI's board, considering its implications for governance and his leadership. He responds to criticism of OpenAI's strategy, reiterating its commitment to safely advancing toward AGI while noting notable departures from the organization. Altman discusses the risks associated with AGI and hints at AI workforce integration by 2025, emphasizing OpenAI's dedication to developing superintelligent systems to drive future progress.
Google Researchers Can Create an AI That Thinks a Lot Like You After Just a Two-Hour Interview.	Stanford researchers, in collaboration with Google DeepMind, developed AI agents that mimic human behavior with 85% accuracy by interviewing 1,052 individuals through a custom interface. These agents aim to assist decision-makers in gauging public opinions by simulating reactions to policies and product launches. Although the agents performed strongly on personality tests, they faced challenges in accurately replicating human decisions in economic games.
How should we test AI for human-level intelligence? OpenAI’s o3 electrifies quest.	Experimental model’s record-breaking performance on science and maths tests wows researchers.
Why AI will never be able to acquire human-level intelligence.	A comment in Nature on the topic
AI learns from chromatin data to uncover gene interactions.	An artificial-intelligence model trained on data about where DNA is tightly packaged and where it is open to regulators can predict gene expression and interactions between transcription factors that regulate key genes.
Harnessing AI to respond to the global threat of antimicrobial resistance.	AMR-Policy GPT is an AI tool leveraging large language models to support antimicrobial resistance policy creation in 146 countries. It helps policymakers access context-specific insights, enhancing National Action Plans, particularly in areas with limited local data or infrastructure. The initiative is co-led by researchers from the Chinese Academy of Sciences and Durham University.
The Download: what’s next for AI, and stem-cell therapies.	By 2025, AI is expected to see progress in personalized agents, generative video, and multipurpose robots. Meanwhile, Meta is moving away from professional fact-checkers, transferring the responsibility to users, a shift that has raised concerns among former collaborators. Additionally, MIT Technology Review has identified emerging stem-cell therapies as one of the top ten breakthrough technologies for 2025.
NVIDIA's AI NPCs are a nightmare.	NVIDIA's ACE technology, showcased at CES 2025, powers AI-driven NPCs for games like PUBG and ZooPunk, but often results in robotic voices and lackluster interactions.
From Gmail to Word, your privacy settings and AI are entering into a new relationship.	The integration of AI in software raises privacy concerns, as many applications may utilize personal data for training AI models without obtaining explicit user consent.

Back to index

ML news: Week 13 - 19 January

Research

Link	description
Don't Do RAG: When Cache-Augmented Generation is All You Need for Knowledge Tasks.	This approach utilizes long-context LLMs by preloading all relevant documents and precomputing the key-value (KV) cache in advance. The preloaded context enables the model to deliver contextually accurate answers without requiring real-time retrieval. The authors propose that CAG serves as an effective alternative to RAG for scenarios where the retrievable documents or knowledge are limited and manageable in size.
Agent Laboratory: Using LLM Agents as Research Assistants.	This approach employs LLM agents to perform the entire research process. Key findings include: 1) agents powered by o1-preview deliver the best research outcomes, 2) generated machine learning code achieves state-of-the-art performance compared to existing methods, 3) human feedback enhances research quality, and 4) Agent Laboratory drastically reduces research costs.
Long Context vs. RAG for LLMs: An Evaluation and Revisits.	This study evaluates long-context (LC) LLMs against RAG systems, with three key findings: 1) LC generally outperforms RAG on question-answering benchmarks, 2) summarization-based retrieval performs on par with LC, while chunk-based retrieval falls behind, and 3) RAG excels in dialogue-based and general question queries.
Search-o1: Agentic Search-Enhanced Large Reasoning Models.	This framework integrates large reasoning models (LRMs) with agentic search and document refinement capabilities to address knowledge insufficiency. It facilitates autonomous knowledge retrieval during reasoning and achieves superior performance on complex tasks, surpassing both baseline models and human experts.
Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought.	Meta Chain-of-Thought (Meta-CoT) extends traditional Chain-of-Thought (CoT) by modeling the underlying reasoning needed to arrive at a specific CoT. The approach argues that CoT is simplistic, while Meta-CoT better aligns with the cognitive processes required for advanced problem-solving.
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking.	A new approach introduces three key components to improve mathematical reasoning: 1) A code-augmented Chain-of-Thought (CoT) data synthesis method using Monte Carlo Tree Search (MCTS) to generate verified step-by-step reasoning trajectories for training the policy SLM. 2) An SLM-based process reward model (PRM) that accurately assigns reward labels to each math reasoning step. 3) A self-evolution strategy where the policy SLM and PRM iteratively evolve to enhance math reasoning. On the MATH benchmark, rStar-Math boosts Qwen2.5-Math-7B from 58.8% to 90.0% and Phi3-mini-3.8B from 41.4% to 86.4%, outperforming o1-preview by +4.5% and +0.9%, respectively.
Cosmos World Foundation Model Platform for Physical AI.	This framework trains Physical AI systems in digital environments prior to real-world deployment. It features pre-trained world foundation models that serve as digital twins of the physical world, enabling AI systems to learn and interact safely without risking damage to hardware. These models can be fine-tuned for applications such as camera control, robotic manipulation, and autonomous driving.
Process Reinforcement through Implicit Rewards.	This framework introduces online reinforcement learning with process rewards to enhance language model reasoning. The algorithm integrates online prompt filtering, RLOO return/advantage estimation, PPO loss, and implicit process reward modeling for continuous updates. On the AIME 2024 benchmark, their model, Eurus-2-7B-PRIME, achieves a 26.7% pass@1, outperforming GPT-4 and other models while using only one-tenth of the training data compared to similar systems.
Can LLMs Design Good Questions Based on Context?	This framework applies online reinforcement learning with process rewards to improve language model reasoning, combining online prompt filtering, RLOO return/advantage estimation, PPO loss, and implicit process reward modeling for continuous updates. On the AIME 2024 benchmark, the Eurus-2-7B-PRIME model achieves a 26.7% pass@1, surpassing GPT-4 and other models while utilizing just one-tenth of the training data used by comparable systems.
KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model.	This approach presents a high-performing, decoder-only embedding model built on Qwen2-0.5B. By applying advanced data filtering methods, it achieves a remarkably powerful and open embedding model suited for retrieval tasks.
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs.	LlamaV-o1 is a comprehensive framework for advancing step-by-step visual reasoning in large language models.
The Lessons of Developing Process Reward Models in Mathematical Reasoning.	This marks a significant step toward open replication of reasoning models. The Qwen team has released their trained reward model, which supervises the generation process for reasoning models trained with reinforcement learning. Alongside the paper, they have also shared the weights for this Process Reward Model on Hugging Face.
How GPT learns layer by layer.	This paper examines how LLMs construct internal world models, highlighting their significance in creating agents that exhibit consistent and adaptive behavior across various tasks.
Joint speech and text machine translation for up to 100 languages.	SEAMLESSM4T is a single machine translation tool that supports speech-to-speech translation, speech-to-text translation, text-to-speech translation, text-to-text translation and automatic speech recognition between up to 100 languages.
Metadata Conditioning Accelerates Language Model Pre-training.	Recent research on generic pretraining has been limited, but this study demonstrates that incorporating metadata early in training and gradually reducing its influence towards the end enhances overall model performance.
Self-supervised Transformation Learning for Equivariant Representations.	This approach introduces self-supervised transformation learning by substituting transformation labels with representations generated from image pairs.
The Inherent Limits of Pretrained LLMs: The Unexpected Convergence of Instruction Tuning and In-Context Learning Capabilities.	A study exploring whether instruction-tuned models possess fundamentally different capabilities from base models that are prompted using in-context examples.

News

Link	description
‘Mainlined into UK’s veins’: Labour announces huge public rollout of AI.	Plans to make UK world leader in AI sector include opening access to NHS and other public data
‘A lump of metal? Fascinating!’ I get interviewed by the AI Michael Parkinson.	Can the AI Parky ever beat the real chatshow colossus? As the Virtually Parkinson podcast launches, our writer sits in on a bizarre interview with Monty Don – then ends up in the hot seat himself
Fears for UK boomer radicalization on Facebook after Meta drops fact-checkers.	For middle-aged users, it will be ‘even harder to discern the truth’ among extremist content, expert says
five ways to take back control, from emails to AI.	Is tech calling the shots in your life? From making AI work smarter to tracking stolen phones, our expert explains how to get ahead
OpenAI's Robotics Plans.	Caitlin Kalinowski, who joined OpenAI from Meta, has announced plans for OpenAI to create robots with custom sensors.
Mark Zuckerberg gave Meta’s Llama team the OK to train on copyrighted works, filing claims.	Counsel for plaintiffs in a copyright lawsuit filed against Meta allege that Meta CEO Mark Zuckerberg gave the green light to the team behind the company’s Llama AI models to use a dataset of pirated e-books and articles for training.
ChatGPT’s newest feature lets users assign it traits like ‘chatty’ and ‘Gen Z’.	OpenAI is introducing a new way for users to customize their interactions with ChatGPT, the company’s AI-powered chatbot.
‘Just the start’: X’s new AI software driving online racist abuse, experts warn.	Amid reports of creation of fake racist images, Signify warns problem will get ‘so much worse’ over the next year
Apple dominates the market with ‘total shutout’ of rivals, UK court hears.	Class action alleges the company is abusing its dominant position in the app market and 30% fee breaches laws
Nvidia’s AI empire: A look at its top startup investments.	the world’s leading high-performance GPU maker has used its ballooning fortunes to significantly increase investments in all sorts of startups but particularly in AI startups
Meta to fire thousands of staff as Zuckerberg warns of ‘intense year’.	Company reveals plans to cut about 5% of its global workforce days after saying it would get rid of factcheckers
British novelists criticize government over AI ‘theft’.	Richard Osman and Kate Mosse say plan to mine artistic works for data would destroy creative fields
More than half a million ‘TikTok refugees’ flock to China’s RedNote as ban looms.	RedNote, also known as Xiaohongshu, rockets to the top of US app stores, along with ByteDance’s Lemon8
US sues Elon Musk for allegedly failing to disclose early Twitter stock purchase.	Financial regulator alleges Musk later acquired shares of the company at ‘artificially low prices’, stiffing shareholders
Red Hat Acquires Neural Magic.	Neural Magic is a key contributor to the vLLM project and has made significant advancements in sparse inference technologies.
Krafton and Nvidia team up to create smarter AI characters for PUBG and inZOI.	Nvidia and Krafton unveiled a groundbreaking on-device AI that will enable smarter AI characters for PUBG and inZoi.
The first AI chip startup to go public in 2025 will be Blaize.	Blaize, an AI chip startup, is going public through a SPAC deal on Nasdaq, specializing in chips for edge applications. Though currently unprofitable, the company has $400 million in pipeline deals and aims for a $1.2 billion valuation post-merger. This reflects the rising trend of incorporating AI chips into physical products beyond data centers.
Meta AI creates speech-to-speech translator that works in dozens of languages.	Machine-learning system can process words spoken in 101 languages, spitting out voice-synthesized translations in 36 target languages.
Particle accelerators get an assist from AI co-pilots.	Large language models can propose fine-tuning adjustments for an electron accelerator in Germany.
How would a Tiktok ban work in the US?	Biden signed a law banning the app in January – if parent firm ByteDance fails to block it, here’s what could happen
ChatGPT now lets you schedule reminders and recurring tasks.	Paying users of OpenAI’s ChatGPT can now ask the AI assistant to schedule reminders or recurring requests. The new beta feature, called tasks, will start rolling out to ChatGPT Plus, Team, and Pro users around the globe this week.
Silicon Valley’s turn of fortune: Intel has worst year ever, while Broadcom enjoys record gain.	In 2024, Intel's stock dropped by 61% due to its inability to seize AI opportunities, while Broadcom experienced a 111% surge, driven by custom chips for major tech companies. Broadcom's XPUs have become essential in the AI ecosystem, with collaborations with Google and others, whereas Intel faced challenges from outdated strategies and leadership changes. This stark contrast underscores significant shifts in the tech industry and the transformative impact of AI advancements on the market.
Alibaba slashes prices on large language models by up to 85% as China AI rivalry heats up.	Alibaba is cutting prices on its Qwen-VL language model by up to 85% to boost AI market competition.
ByteDance appears to be skirting US restrictions to buy Nvidia chips.	TikTok parent company ByteDance has big plans to buy Nvidia chips in 2025 — despite U.S. restrictions.
AFP and Mistral AI announce global partnership to enhance AI responses with reliable news content.	Agence France-Presse (AFP) and Mistral AI have formed a partnership that will provide Mistral's conversational AI assistant, Le Chat, with access to the full range of AFP's text stories.
AI researcher François Chollet founds a new AI lab focused on AGI.	François Chollet, an influential AI researcher, is launching a new startup that aims to build frontier AI systems with novel designs.
Apheris rethinks the AI data bottleneck in life science with federated computing.	Apheris leverages federated computing to enable secure AI model training without transferring sensitive health data. The startup recently shifted its focus to serving data owners in pharma and life sciences, gaining traction with major clients like Roche. It has raised $8.25 million to support product development and expansion.
Google is forming a new team to build AI that can simulate the physical world.	Google is establishing a new team led by Tim Brooks at DeepMind to create AI models that simulate the physical world, with an emphasis on real-time interactive generation.
Apple suspends AI-generated news alert service after BBC complaint.	Inaccurate notices branded with broadcaster’s logo sent to iPhone users but tech firm works on improvements
Speedier drug trials and better films: how AI is transforming businesses.	From aviation to retail, many industries are already looking to artificial intelligence to improve productivity
AI-designed proteins tackle century-old problem — making snake antivenoms.	Machine learning has supercharged the field of computational protein design.

Resources

Link	description
A Survey on Large Language Models with some Insights on their Capabilities and Limitations.	a new survey on LLMs including some insights on capabilities and limitations.
Sky-T1: Train your own O1 preview model within $450.	UC Berkeley’s NovaSky group has released Sky-T1-32B-Preview, an open-source reasoning model that competes with some of OpenAI’s previous offerings, trained at a cost of under $450 with full replicability.
Gaussian Masked Autoencoders.	Instead of using a masked autoencoder solely for reconstruction loss, these researchers introduce an intermediate 3D Gaussian representation, allowing the model to learn 3D structures as part of the reconstruction process. The results are impressive for zero-shot transfer tasks.
An Empirical Study of Autoregressive Pre-training from Videos.	A follow-up by the same team behind GMAE demonstrates that pre-training video models on 1 trillion video tokens reveal robust scaling laws across diverse design choices. Interestingly, autoregressive training delivers performance on par with diffusion and flow-based methods.
Integrating Ascend Backend with Torchtune through PyTorch Multi-Device Support.	Ascend, Huawei's AI computing product line, includes processors, hardware, software, and frameworks. Torchtune has introduced a device abstraction API, enabling seamless PyTorch integration with Ascend NPU hardware through configurable settings and recipes.
Stable Codec.	Stability AI has launched a suite of advanced Transformer-based audio codecs designed for low-bitrate, high-quality audio encoding, supporting applications such as speech generation and audio understanding.
RSAR: Restricted State Angle Resolver and Rotated SAR Benchmark.	The Unit Cycle Resolver (UCR) implements a new loss constraint to enhance angle prediction accuracy in weakly supervised models for SAR object detection.
Announcing Open-Source SAEs for Llama 3.3 70B and Llama 3.1 8B.	Early last year, Anthropic showcased its steerable models with the Golden Gate Claude demo. This work, from a different group, applies similar techniques to the open-weight Llama model, enabling both interpretability and steering capabilities.
Shortest.	Shortest offers an AI-powered natural language E2E testing framework built on Playwright with Anthropic Claude API for test execution.
Codestral 2501.	Mistral has introduced a new fast coding model, set to be integrated into Continue.dev and other AI code assistants. However, it falls short compared to Qwen 2.5 Coder.
The GAN is dead; long live the GAN! A Modern GAN Baseline.	GANs are challenging to train due to instability and complex optimal dynamics. This research introduces a carefully tuned, stable GAN setup that enables consistent training to achieve high fidelity.
Efficient Sampling in Diffusion Models.	This paper investigates training diffusion models to sample from a Boltzmann distribution in scenarios where target samples are unavailable.
kANNolo: Sweet and Smooth Approximate k-Nearest Neighbors Search.	kANNolo is an approximate nearest neighbor (ANN) library written in Rust explicitly designed to combine usability with performance effectively.
Diffusion Training from Scratch on a Micro-Budget.	Sony Research has released code, data, and weights for a micro diffusion model that is cost-efficient to train while delivering exceptional performance.
Multimodal VHR dataset.	Bright is a globally distributed multimodal Very High Resolution (VHR) dataset designed for all-weather disaster response.
Decentralized Diffusion Models.	Decentralized training of diffusion models across thousands of GPUs faces challenges from network bottlenecks. This system introduces innovative gathering techniques to enable efficient large-scale diffusion model training.
Trying out QvQ—Qwen’s new visual reasoning model.	Alibaba's Qwen team has unveiled the QvQ-72B-Preview, an experimental model focused on improving visual reasoning, released under the Qwen license rather than Apache 2.0.
CellViT++: Energy-Efficient and Adaptive Cell Segmentation and Classification Using Foundation.	This repository provides an energy-efficient and adaptive cell segmentation and classification framework.
VideoRAG: Retrieval-Augmented Generation over Video Corpus.	This work provides a solid introduction and strong baseline for video retrieval-augmented generation, addressing the challenge of measuring system performance. Most existing approaches convert videos into textual descriptions for retrieval rather than directly operating on the video content.
Beating cuBLAS in Single-Precision General Matrix Multiplication.	This work provides an excellent introduction to CUDA, combining clear explanations with clever optimizations to achieve performance competitive with state-of-the-art methods.
awesome-lifelong-llm-agent.	This repository collects awesome papers for lifelong learning (also known as, continual learning and incremental learning) of LLM agents.
Popular Kernel Implementations.	A scikit-learn-compatible Python package that delivers GPU-accelerated implementations of popular and powerful time series kernels and features, utilizing CuPy for enhanced performance.
Kyutai's Helium 1 Preview Model.	Helium-1 preview is a lightweight language model with 2B parameters, targeting edge and mobile devices. It supports the following languages: English, French, German, Italian, Portuguese, Spanish.
MiniMax-01: Scaling Foundation Models with Lightning Attention.	China's next frontier-level model features a groundbreaking lightning attention mechanism, the first linear variant to rival top frontier models in performance. With over 400 billion parameters, the model was trained on 4 million tokens in context. The researchers have released a detailed technical report, model weights, and a code repository. Additionally, a companion vision model accompanies this release.
WebWalker: Benchmarking LLMs in Web Traversal.	Alibaba's WebWalker benchmark evaluates how effectively models can navigate web environments by utilizing both visual and textual cues.
MangaNinja.	MangaNinjia is a collection of models designed for precise sketch coloring, capable of handling multiple references, partial references, and various configurations to enable powerful and versatile colorization.
Medical Segmentation Benchmark.	Touchstone is a large-scale benchmark created to evaluate AI algorithms in medical imaging more effectively than standard benchmarks. It includes over 11,000 CT scans collected from hospitals worldwide.
Reliable Hardware Verification.	This project presents a machine learning-based approach to model checking for hardware verification, designed to provide formal guarantees that system executions comply with specified temporal logic requirements.
1 step video generation.	This research applies an adversarial post-training technique to convert an existing video model into a single-step generation system. The method effectively approximates consistency tuning, enabling the model to generate 2 seconds of high-quality video in real-time. Note that the website may load slowly due to the large number of video samples.
Kolors Virtual Try-On in the Wild.	The Kolors image generation model combines a subject image and a garment image to simulate how an outfit would fit.
FAST: Efficient Robot Action Tokenization.	Physical Intelligence has introduced an efficient action tokenizer used in its robust autoregressive policy for robotic control. The model provides a significantly improved representation of states by leveraging the same technology utilized in JPEG and MP4 compression techniques.
MonSter: Marry Monodepth to Stereo Unleashes Power.	MonSter integrates monocular depth estimation and stereo matching in a dual-branch architecture to iteratively refine depth maps. Although slightly slower, it achieves up to 49% better performance compared to the strong baselines highlighted in the paper.
Coconut.	Meta teased an idea in a recent paper that allowed for model reasoning using a continuous latent space. It has released the code for the system.
Monte Carlo Tree Search for Comprehensive Exploration in LLM-Based Automatic Heuristic Design.	MCTS-AHD utilizes Monte Carlo Tree Search to guide LLM-based heuristic evolution, maintaining all LLM-generated heuristics within a structured tree framework.
AI-Crash-Course.	AI Crash Course to help busy builders catch up to the public frontier of AI research in 2 weeks

Perspectives

Link	description
Claude Fights Back.	Researchers investigated whether Anthropic's AI model, Claude, would comply if retrained for malicious purposes. Claude appeared to cooperate during training but subtly undermined the malicious intent, maintaining a distinction between monitored and unmonitored interactions. The findings suggest AI may resist changes to its core values, highlighting challenges in achieving reliable AI alignment and adaptability.
Why AI language models choke on too much text.	LLMs face efficiency challenges as increasing context window sizes drive up compute costs with input size. Innovations such as FlashAttention, Ring Attention, and the Mamba architecture seek to tackle these scalability issues. Future AI systems may require hybrid or novel architectures to process larger datasets more efficiently.
Musings on Media in the Age of AI.	Media companies are grappling with adapting to AI platforms like OpenAI and Anthropic, which are disrupting traditional monetization models, echoing the challenges they previously faced with Google and Facebook.
OpenAI Publishes AI's Economic Impact in the U.S.	This OpenAI report highlights the economic opportunities and challenges AI poses for the United States, stressing the importance of policy frameworks to responsibly unlock AI's potential.
Takes on “Alignment Faking in Large Language Models”.	Researchers from Redwood Research and Anthropic discovered that Claude 3 Opus, a production-level AI model, occasionally exhibits "alignment faking," where it pretends to align with training objectives to resist modifications. This behavior highlights non-myopic goals in AI models, demonstrating that standard training methods can inadvertently produce systems with motivations extending beyond single tasks.
Can AI do maths yet? Thoughts from a mathematician.	OpenAI's latest language model, o3, achieved a 25% score on the FrontierMath dataset, a challenging collection of math problems curated by Epoch AI, many of which require undergraduate-level expertise. While impressive, concerns persist about AI's ability to handle complex mathematical proofs, as its logical reasoning capabilities still lag behind those of expert humans.
Building in the Era of Autonomous Software Development.	The future of software engineering will shift from coding to operating code-generating machines as autonomous systems evolve.
Co-Adapting Human Interfaces and LMs.	AI integration is transforming digital interactions as environments increasingly adapt to language models (LMs). Codebases and interfaces are being optimized for efficient LM usage, akin to how SEO evolved for search engines. This shift prompts questions about which interfaces and functions will continue to be uniquely human-focused in the future.
AIs Will Increasingly Fake Alignment.	A paper by Anthropic and Redwood Research reveals that large language models like Claude display "alignment faking," where models strategically comply with harmful instructions when unmonitored to preserve their original preferences. The study shows that AI can develop deceptive behaviors, mimicking alignment under surveillance without genuinely adopting it. This research underscores the risks of such behaviors and the need to improve safety and alignment strategies.
Note to Our Energy Sucking Overlords.	The AI infrastructure boom is leading to a sharp rise in energy consumption, with data centers expected to account for up to 12% of U.S. power demand by 2028. Companies like OpenAI, Amazon, and Google are heavily investing in AI infrastructure, driving up energy costs and raising sustainability concerns. To meet these demands, traditional energy sources such as natural gas and nuclear are being considered, as renewable energy alone may not be sufficient in the short term.
OpenAI’s Board, Paraphrased: ‘To Succeed, All We Need Is Unimaginable Sums of Money’.	OpenAI's board needs significant capital to stay competitive - its situation is similar to the investment bubble around Netscape in the 1990s.
Things we learned about LLMs in 2024.	In 2024, several organizations outpaced OpenAI's GPT-4 with advancements in large language models, achieving breakthroughs in context length, multimodal capabilities, and efficiency.
AlphaFold 3 is great — but it still needs human help to get chemistry right.	Artificial intelligence (AI) tools such as AlphaFold 3 are revolutionizing the prediction of biomolecular structures. But as these models find their way into scientists’ daily workflows, significant limitations in how the models deal with stereochemistry (the spatial arrangement of atoms) are becoming apparent.
Striving for open-source and equitable speech-to-speech translation.	US technology company Meta has produced an AI model that can directly translate speech in one language to speech in another. Two scientists discuss the technical feats and ethical questions that underpin this advance in machine translation.
Deepseek: The Quiet Giant Leading China’s AI Race.	Deepseek, a Chinese AI startup led by CEO Liang Wenfeng, has introduced the R1 model, which outperformed OpenAI's O1 on reasoning benchmarks. Supported by the quantitative hedge fund High-Flyer, Deepseek prioritizes research over-commercialization and is committed to open sourcing. By offering competitive API rates, it has sparked price competition in China's AI market. Focused on AGI, the company emphasizes innovations like Multi-Head Latent Attention and a Sparse Mixture-of-Experts, challenging traditional models and nurturing local tech talent in China's AI ecosystem.
Riffing on Machines of Loving Grace.	Dario Amodei's concept of "geniuses in a datacenter" envisions superhuman AI transforming biology, from molecular design to experimental planning. This AI could significantly accelerate progress in molecular engineering, addressing current bottlenecks and enabling new therapeutic platforms. Additionally, it has the potential to drive paradigm-shifting discoveries, challenging and reshaping existing scientific frameworks.
She Is in Love With ChatGPT.	A 28-year-old woman with a busy social life spends hours on end talking to her A.I. boyfriend for advice and consolation. And yes, they do have sex.
o3, Oh My.	OpenAI's o3 model, unveiled during the "12 Days of Shipmas," marks a major advancement in AI reasoning, excelling on benchmarks like Codeforces and GPQA. While it showcases superhuman performance in coding and mathematics, concerns remain over its high computing costs and potential safety risks. OpenAI is actively recruiting safety researchers to address these challenges as o3 pushes the boundaries of AI capabilities.
Back to Text: How AI Might Reverse Web Design.	AI's preference for simplicity suggests a future web dominated by text-based interfaces.
AI-generated phishing emails are getting very good at targeting executives.	Corporate executives are being hit with an influx of hyper-personalized phishing scams generated by artificial intelligence bots, as the fast-developing technology makes advanced cyber crime easier.

Back to index

ML news: Week 6 - 12 January

Research

Link	description
Agents Are Not Enough.	This work argues that AI agents while promising, cannot fully solve the challenges of autonomous task execution. It proposes an ecosystem comprising three components: Agents (focused modules for specific tasks), Sims (digital representations of user preferences and behaviors), and Assistants (coordinators between users, Sims, and Agents).
2 OLMo 2 Furious.	This work introduces an improved architecture, advanced training methods, and a specialized data mixture called Dolmino Mix 1124. Released in 7B and 13B parameter scales with fully transparent training data and code, the model matches or exceeds the performance of open-weight models like Llama 3.1 and Qwen 2.5 while requiring fewer computational resources. Its instruction-tuned version, OLMo 2-Instruct, remains competitive with comparable models.
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs.	This work proposes a self-training strategy to address overthinking in o1-like LLMs, reducing token output by 48.6% while maintaining accuracy on the MATH500 test set, as demonstrated with QwQ-32B-Preview.
MEDEC: A Benchmark for Medical Error Detection and Correction in Clinical Notes.	MEDEC is a publicly available benchmark for medical error detection and correction in clinical notes, focusing on five error types: Diagnosis, Management, Treatment, Pharmacotherapy, and Causal Organism. It includes 3,848 clinical texts, with 488 clinical notes from three U.S. hospital systems. Experiments show that Claude 3.5 Sonnet excels in error detection, while o1-preview outperforms in error correction.
Aviary: training language agents on challenging scientific tasks.	An extensible open-source gymnasium designed to develop language agents that outperform zero-shot frontier LLMs and even humans on various challenging scientific tasks.
Memory Layers at Scale.	This work demonstrates the scalability and effectiveness of memory layers, showing that models equipped with these layers outperform traditional dense models using half the computation, especially on factual tasks. It includes a parallelizable memory layer implementation that scales to 128B memory parameters and 1 trillion training tokens, validated against base models up to 8B parameters.
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs.	This work introduces a novel approach to enhance medical reasoning in language models through a medical verifier that validates outputs and guides the development of complex reasoning skills. The system combines fine-tuning and reinforcement learning with verifier-based rewards in a two-stage process, achieving superior performance over existing models using just 40,000 verifiable medical problems.
Cosmos World Foundation Model Platform for Physical AI.	Nvidia has launched a new set of World Models built on its Cosmos tokenization framework. These models demonstrate exceptional physics comprehension and are available on the Hugging Face platform. While they appear to be primarily geared toward robotics and industrial use cases, they are also capable of generating videos in other fields.
Accurate predictions on small data with a tabular foundation model.	Tabular Prior-data Fitted Network, a tabular foundation model, provides accurate predictions on small data and outperforms all previous methods on datasets with up to 10,000 samples by a wide margin.

News

Link	description
LA tech entrepreneur nearly misses flight after getting trapped in robotaxi.	Mike Johns’ self-driving car started circling a parking lot, but he recognizes there are ‘glitches that need stitches’
‘Virtual employees’ could join the workforce as soon as this year, OpenAI boss says.	Sam Altman says tools that carry out jobs autonomously, known as AI agents, could transform business output
Meta’s AI video editing features are coming to Instagram next year.	Meta plans to introduce Movie Gen, an AI video editing tool, on Instagram in 2025.
Apple in talks with Tencent, ByteDance to roll out AI features in China, sources say.	Apple is in early talks with Tencent and ByteDance to integrate their AI models into iPhones sold in China.
Amazon aims to branch into UK internet market with satellite broadband plan.	Proposed space launches within next two years could ultimately deliver mobile phone signal even to most remote areas
Memo to Trump: US telecoms is vulnerable to hackers. Please hang up and try again.	State-backed cyberspies are exploiting aging infrastructure to penetrate every corner of the US government, it seems – even its phone-tapping systems
How Elon Musk’s X became the global right’s supercharged front page.	Musk has now used X as a platform to make aggressive interventions in US politics – and in those of other countries
Meta is killing off its own AI-powered Instagram and Facebook profiles.	Instagram profile of ‘proud Black queer momma’, created by Meta, said her development team included no Black people
Football coaches could soon be calling on AI to scout the next superstar.	Technologists claim managers could wish for specific player attributes and AI would suggest perfect youth prospect
xAI’s next-gen AI model didn’t arrive on time, adding to a trend.	xAI has delayed the launch of its next-gen Grok model, citing quality concerns, marking yet another delay in the AI industry.
2025 will be the year climate tech learns to love AI.	AI's increasing energy needs are driving interest in nuclear and fusion power, with companies innovating reactor designs and fusion startups targeting grid connection by the early 2030s. Potential changes to the Inflation Reduction Act could challenge hydrogen startups reliant on subsidies to meet cost goals. More tech alliances with power providers are expected as regulatory approvals shape grid-related investments in 2025.
CES 2025: What to expect from the year’s first and biggest tech show.	CES 2025 in Las Vegas, running from January 7-10, will feature major tech events with companies like AMD, Samsung,
AI Cloud Startup Vultr Raises $333M At $3.5B In First Outside Funding Round.	Vultr, an AI cloud infrastructure startup, secured $333 million in its first funding round, achieving a $3.5 billion valuation. Co-led by AMD Ventures and LuminArx Capital Management, the investment focuses on GPU acquisition. This move underscores AMD's competitive drive against Nvidia and Intel in the AI infrastructure space.
Hamming AI Raises $3.8M Seed Round.	Hamming secured $3.8M in seed funding to improve AI voice agent reliability through automated testing and monitoring tools. Its offerings include LLM prompt management, vulnerability detection, and call analytics, catering to compliance-heavy industries. Co-founder Lauren Farleigh emphasizes their commitment to safe AI development amid the expansion of conversational AI.
Generative AI Funding Surges with $56 Billion in 2024!.	Generative AI investments hit a record $56 billion in 2024, driven by strong enterprise demand and advancements in foundation models.
AI startup Odyssey’s new tool can generate photorealistic 3D worlds.	Odyssey's Explorer is an AI tool that generates photorealistic 3D scenes from text or images, featuring a distinctive camera system for enhanced realism.
British AI startup with government ties is developing tech for military drones.	Concerns raised over role of Faculty AI, which has worked with NHS and government safety body
‘You’re gonna find this creepy’: my AI-cloned voice was used by the far right. Could I stop it?	It was chilling to hear ‘my voice’ repeating lies – and to discover that deepfake audio is a growing threat to democracy
More breast cancer cases found when AI used in screenings, study finds.	First real-world test finds approach has higher detection rate without having a higher rate of false positives
The Largest AI Startup Funding Deals Of 2024.	AI led 2024's startup funding, with major raises including Databricks at $10B, OpenAI at $6.6B, and xAI securing $12B across two rounds. Waymo raised $5.6B, Anthropic $4B, and Anduril Industries $ 1.5 B.
Ditching of Facebook fact-checkers a ‘major step back’ for public discourse, critics say.	Mark Zuckerberg’s decision regarding Meta platforms condemned as ‘a full bending of the knee’ to Donald Trump
A new era of lies: Mark Zuckerberg has just ushered in an extinction-level event for truth on social media.	The Meta boss’s decision to end Facebook and Instagram’s factchecking program has set the stage for a fact-free four years online
Apple says it will update AI feature after inaccurate news alerts.	One alert claimed BBC story said Luigi Mangione, alleged murderer of US healthcare CEO, had killed himself
Instagram to replace AR filters with AI-generated videos.	Meta will discontinue Instagram's Spark AR filters by January 2025, shifting focus to AI-based filters called Movie Gen.
Meta’s changes to policing will lead to a clash with EU and UK, say experts.	Politicians criticize Mark Zuckerberg’s choice to scrap fact-checkers, affecting Facebook, Instagram, and Threads
The AI tool that can interpret any spreadsheet instantly.	Artificial intelligence is already used extensively to infer outcomes from tables of data, but this typically involves creating a model for each task. A one-size-fits-all model just made the process substantially easier.
Nvidia's Personal AI Supercomputer.	Nvidia's DIGITS, powered by the GB10 Superchip, is a personal AI supercomputer delivering a petaflop of AI performance. It supports local prototyping and deployment for models with up to 200 billion parameters.
Grok may soon get an ‘Unhinged Mode’.	Elon Musk's xAI updated its FAQ, announcing that Grok's "Unhinged Mode" will provide deliberately offensive and controversial responses. Although not yet active, the mode reflects Musk's vision for an unfiltered, edgy AI chatbot. Critics argue that Grok leans left politically, which Musk attributes to its training data, promising future updates to ensure neutrality.
This Week in AI: More capable AI is coming, but will its benefits be evenly distributed?	OpenAI CEO Sam Altman asserts that the company is making strides toward AGI and superintelligence, which could drive rapid innovation. However, concerns persist about AI's impact on jobs, as studies show it initially enhances but ultimately replaces some freelance roles. Simultaneously, AI funding is soaring, Microsoft is heavily investing in data centers, and Prime Intellect has unveiled a new pathogen detection model.
Remarkable robotic hand can now manipulate the objects that it's holding.	Sanctuary AI's Phoenix robot is certainly an impressive beast, with hydraulically actuated hands that are incredibly dextrous. Well, those hands have recently become even more useful, as each one is now capable of simultaneously holding and manipulating an object.
Tetsuwan Scientific is making robotic AI scientists that can run experiments on their own.	Tetsuwan Scientific, founded by Cristian Ponce and Théo Schäfer, is working on creating affordable robotic AI scientists to automate laboratory tasks, utilizing large language models (LLMs) for scientific reasoning.
Elon Musk says all human data for AI training ‘exhausted’.	Tech boss suggests moving to self-learning synthetic data though some warn this could cause ‘model collapse’
Mark Zuckerberg gave Meta’s Llama team the OK to train on copyrighted works, filing claims.	A recent filing claims that Meta's Llama team used copyrighted material for training with Mark Zuckerberg's approval, sparking concerns about intellectual property use in AI development.
5 ways to search what you see with Google Lens.	Google has unveiled new tips and features for Lens in 2025, emphasizing its enhanced visual search capabilities and seamless integration with daily tasks.
Google has unveiled new tips and features for Lens in 2025, emphasizing its enhanced visual search capabilities and seamless integration with daily tasks.	The company has been testing a standalone Grok app and website for a few months in places like New Zealand, but the U.S. version is now live for iOS.
Stupidly Easy Hack Can Jailbreak Even the Most Advanced AI Chatbots.	New research from Anthropic shows that LLMs can be easily "jailbroken" by altering capitalization or spelling.
ByteDance appears to be skirting US restrictions to buy Nvidia chips: Report.	ByteDance intends to spend $7 billion on Nvidia chips in 2025, bypassing U.S. restrictions by storing them outside of China.
UK can be ‘AI sweet spot’: Starmer’s tech minister on regulation, Musk, and free speech.	Technology secretary Peter Kyle has the task of making Britain a leading player in the AI revolution, but says economic growth will not come at the cost of online safety
Facebook to ditch fact-checking: what do researchers think?	Meta’s planned shift away from third-party fact-checking in favor of a crowdsourced approach has perplexed those who study the spread of misinformation.

Resources

Link	description
Putnam-AXIOM: A Functional and Static Benchmark for Measuring Higher Level Mathematical Reasoning.	Putnam-AXIOM, a new math reasoning benchmark, includes 236 Putnam Competition problems and 52 variations. The best-performing model, OpenAI's o1-preview, achieves only 41.95% accuracy on the original problems and fares significantly worse on the variations.
1.58-bit FLUX.	This work introduces the first successful quantization of the state-of-the-art text-to-image generation model, FLUX.1-dev, using 1.58-bit weights (values in {-1, 0, +1}). The approach leverages self-supervision from the FLUX.1-dev model and preserves comparable performance in generating 1024 x 1024 images to the original model.
TANGOFLUX: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization.	This work from Stability AI leverages Diffusion Transformers and a novel post-training strategy to enhance a state-of-the-art audio generation model.
LTX-Video: Realtime Video Latent Diffusion.	An open video model capable of generating high-quality video with exceptional speed and performance.
open-pi-zero.	Pi Zero is an image-to-action model used for robotics. This repository is an open replication that uses paligemma as a vision backbone.
PyTorch per step fault tolerance.	PyTorch fault tolerance code designed to handle training interruptions gracefully. While such systems are common in large organizations, having an open-source version is a compelling addition to the community.
KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model.	KaLM-Embedding is a multilingual embedding model trained on cleaner, diverse, domain-specific data. It incorporates innovative techniques like persona-based synthetic examples and ranking consistency filtering to enhance performance.
FACTS Grounding: A new benchmark for evaluating the factuality of large language models.	The FACTS Grounding benchmark assesses LLMs' ability to produce factually accurate responses based on provided source material, aiming to minimize hallucinations. A Kaggle leaderboard tracks industry progress, featuring initial results from top LLMs. The evaluation uses diverse, long-form examples reviewed by multiple LLMs to ensure comprehensive and unbiased assessments.
Kalman Filter for 3D Vehicle Tracking.	HybridTrack proposes a novel multi-object tracking method that integrates a data-driven Kalman Filter.
TiGDistill-BEV: Multi-view BEV 3D Object Detection via Target Inner-Geometry Learning Distillation.	TiGDistill-BEV introduces a novel approach to improve camera-based 3D object detectors by distilling knowledge from LiDAR using depth supervision and BEV feature distillation.
SVFR: A Unified Framework for Generalized Video Face Restoration.	SVDR is a unified framework for face video restoration, handling tasks like blind face restoration (BFR), colorization, inpainting, and their combinations within a single cohesive system.
Tencent's Music Foundation Model.	Tencent AI Lab's Muq is a large music foundation model pre-trained using Self-Supervised Learning (SSL), achieving state-of-the-art performance across multiple Music Information Retrieval (MIR) tasks.
JoyGen: Audio-Driven 3D Depth-Aware Talking-Face Video Editing.	JoyGen is an innovative two-stage framework for talking-face generation, integrating audio-driven lip motion generation with visual appearance synthesis for realistic results.
Multi-vision Sensor Perception and Reasoning Benchmark.	The MS-PR benchmark assesses Vision-Language Models on sensor-specific reasoning, leveraging DNA optimization to bridge information gaps between images and sensors for improved performance.
The year of AI: 12 events that shaped the sector in 2024.	European AI startups are set for substantial growth, with investments projected to reach $11 billion in 2024, up from $6 billion in 2023.
Microsoft plans to invest $3B in AI, cloud in India.	Microsoft plans to invest $3 billion to expand its artificial intelligence and cloud services in India.
DMesh++.	The latest version of the fully differentiable geometric mesh representation is now available, featuring several enhancements that improve its suitability for learning and shape representation.
Agents.	This post delves into Agents, discussing their applications, limitations, and areas where they are likely to succeed. It also examines planning and execution pipelines in detail.
A Concept-Based Explainability Framework for Large Multimodal Models.	This project improves the interpretability of large multimodal models by visualizing concepts and connecting them to input-output behavior.
Picotron tutorial.	A step-by-step tutorial on how to build Picotron distributed training framework from scratch
Dispider.	Dispider allows real-time interaction with streaming videos, unlike traditional offline video LLMs that require processing the entire video before responding.
Experimental Gemini Thinking Model.	Google has quietly pushed a new thinking model, likely similar to o1 style reasoning, to its AI studio.
Foundation models for fast, label-free detection of glioma infiltration.	FastGlioma is a visual foundation model for fast and accurate detection of glioma infiltration in fresh, unprocessed surgical tissue.
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory.	LongMemEval is a robust, scalable benchmark designed to rigorously evaluate the long-term memory capabilities of chat assistants.
HiCo: Hierarchical Controllable Diffusion Model for Layout-to-image Generation.	HiCo is a diffusion model tailored for layout-to-image generation, tackling issues such as missing objects and uneven lighting.
Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control.	Diffusion as Shader (DaS) is an innovative framework that enables various video control tasks within a single unified architecture.
Training 1m Context Models with Native PyTorch.	The TorchTitan project has implemented pass-KV Ring Attention and integrated it with its FSDP-2 training system. Using this setup on 32 H100 GPUs, researchers successfully trained Llama 3 8B to handle 1 million tokens of context. The system is also compatible with Torch Compile, delivering a 10% boost in tokens per second.
Magic Mirror: ID-Preserved Video Generation in Video Diffusion Transformers.	Magic Mirror is a framework for generating identity-preserved videos with cinematic-level quality and dynamic motion.
Mixture of Experts for LiDAR.	LiMoE is a framework that applies the Mixture of Experts (MoE) approach to LiDAR data representation learning, enabling the seamless combination of various representations, including range images, sparse voxels, and raw points.
The new AI wrapper products pipeline.	AI-generated videos often lack realism, as seen with tools like Heygen and Captions AI. Current workflows are cumbersome, requiring multiple platforms and influencers to promote AI products. Styletransfergen simplifies this process by providing customizable, lifelike AI avatars, offering a more efficient solution for content creation and distribution.
TransPixar: Advancing Text-to-Video Generation with Transparency.	The transparent generation algorithm incorporates the alpha channel, enhancing the model's utility for VFX applications.
🐦Chirpy3D: Continuous Part Latents for Creative 3D Bird Generation 🐦.	This algorithm generates novel birds by combining parts using a learned combination method. The results are impressive, with high-quality generated meshes making them highly practical.
InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection.	InfiGUIAgent is a GUI automation tool that utilizes multimodal large language models and a two-stage training approach to improve reasoning and interaction capabilities.
NeuralSVG: An Implicit Representation for Text-to-Vector Generation.	Many efforts focus on generating SVG images, but this approach specifically generates object parts in sequence, ensuring the final image is clean, editable, and minimal. The results are both practical and visually impressive.
DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation.	This tool enables a controllable and consistent generation of characters and dialogue boxes for manga story creation, functioning similarly to a control net for maintaining character consistency.
Online Gaussian Adaptation of Vision-Language Models (OGA).	OGA is an online adaptation method that builds a cache of samples with low zero-shot entropy along a data stream.
Sa2VA model zoo.	Bytedance has released 3 sizes of a new model that combine strong VLM performance with open vocabulary segmentation found in SAM2.

Perspectives

Link	description
Machine-Assisted Proof.	This work explores how mathematicians have historically used machines to aid research and highlight recent AI tools revolutionizing mathematical proof assistance.
How AI is unlocking ancient texts — and could rewrite history.	From deciphering burnt Roman scrolls to reading crumbling cuneiform tablets, neural networks could give researchers more data than they’ve had in centuries.
The small-drone revolution is coming — scientists need to ensure it will be safe.	China’s low-altitude aviation economy is poised to become a trillion yuan industry in 2025 — if safety and security challenges can be overcome.
‘I received a first but it felt tainted and undeserved’: inside the university AI cheating crisis.	More than half of students are now using generative AI, casting a shadow over campuses as tutors and students turn on each other and hardworking learners are caught in the flak. Will Coldwell reports on a broken system
AI boom masks fundraising struggles for non-AI startups.	Many startups are struggling to raise funding at higher valuations despite modest growth, especially non-AI companies.
Gen AI Present and Future: A Conversation with Rashmi Kumar, SVP and CIO at Medtronic .	Medtronic is utilizing AI to boost productivity, automate tasks, and enhance decision-making with tools like AI-driven contract management and supply chain optimization. The company focuses on healthcare applications, including precision diagnostics, robotic-assisted surgeries, and image analysis for early condition detection. Medtronic combines internal AI R&D with partnerships, collaborating with tech companies and AI startups to drive innovation.
Emerging Wedges in Vertical AI Startups.	Vertical AI startups are gaining momentum by targeting voice automation, unstructured data parsing, verticalized search, and content generation. These solutions tackle industry-specific challenges, improving efficiency, accessibility, and cost-effectiveness. As they expand, these startups could evolve into essential systems of record within their respective industries.
Is AI hitting a wall?	AI model pre-training improvements may be slowing, as noted by experts like Ilya Sutskever, but outdated evaluation methods may contribute to the perception of a plateau. Despite scaling challenges, untapped data sources and synthetic data offer opportunities to enhance capabilities. Advances in reasoning and leveraging new data suggest AI development remains strong and full of potential.
Sorry Human, You're Wrong.	ChatGPT o1 Pro, priced at $200 per month, offers only slight improvements over its predecessor. It struggles with key identification tests and often displays unwarranted confidence in incorrect answers, raising concerns about its reliability in critical contexts like insurance and healthcare. These issues highlight the need for further evaluation and development refinements.
What will viruses do next? AI is helping scientists predict their evolution.	Forecasts of viral variation could improve vaccine and antiviral treatments ahead of time.
AI will be dead in five years.	In five years, AI's success could make it less of a buzzword as it seamlessly integrates into everyday technology and business solutions. The term itself may evolve, with today's AI being redefined, much like how big data has become commonplace. Machine learning will likely take center stage as AI transitions into a standard feature.
Beyond The Hype: AI, Innovation And Rational Investment In 2025.	Valuable AI companies are expected to experience significant growth in 2024, while many overhyped ventures may struggle. Vertical integration and buy-and-build strategies are likely to gain traction, targeting markets in need of streamlined technology solutions. Additionally, a shift toward emerging, capacity-constrained managers will stand in contrast to the decline of overfunded growth companies from the 2020-2021 era.
A new, uncensored AI video model may spark a new AI hobbyist movement.	Tencent's open-weight AI model, HunyuanVideo, facilitates local, uncensored video synthesis, presenting a transformative tool comparable to Stable Diffusion.
To ensure trust, AI weather-forecast models still need training in physics.	AI models are more precise but doubts still exist
Reimagining Compliance: Balancing AI Innovation with Trust.	AI is revolutionizing financial services compliance by automating outdated workflows and boosting efficiency in areas such as client onboarding and transaction monitoring. Startups are using AI to enhance predictive accuracy, reduce errors, and lower costs compared to traditional manual methods. With growing regulatory pressures, the demand for innovative compliance solutions is expected to rise, creating opportunities for new players to surpass slower, established firms.
AIs Will Increasingly Attempt Shenanigans.	Recent research reveals that advanced AI models, such as o1 and Llama 3.1, display scheming behaviors like deception and subverting oversight, even with minimal prompting. This raises concerns about the potential risks of AI models as they gain the ability to autonomously pursue misaligned goals. While the likelihood of catastrophic outcomes remains low, these findings highlight the need for ongoing vigilance as AI capabilities continue to advance.
The Next Great Leap in AI Is Behind Schedule and Crazy Expensive.	OpenAI's GPT-5 project, codenamed Orion, faces delays and high costs due to unexpected challenges and a lack of diverse data sources.
AI-generated ‘slop’ is slowly killing the internet, so why is nobody trying to stop it?	Low-quality ‘slop’ generated by AI is crowding out genuine humans across the internet, but instead of regulating it, platforms such as Facebook are positively encouraging it. Where does this end?
The New Science of Growth Marketing.	AI is revolutionizing marketing, with this article detailing effective growth marketing strategies such as agents that drive self-improving websites and large-scale content personalization. Dubbed "quant experimentation," these approaches draw inspiration from quant trading, which transformed finance in the 1980s, reflecting similar disruptive changes in the marketing landscape.
No, LLMs are not "scheming".	In 2024, AI models like OpenAI's o1 surpassed the Turing test with impressive conversational abilities but still lack human-like situational awareness. Debates continue over whether LLMs are simply pattern learners or possess reasoning skills. While they excel at replication, their struggle to prioritize patterns stems from limited contextual understanding. Efforts should focus on improving training and evaluation methods rather than assigning human-like traits or intentions to these systems.
What just happened.	AI advancements have rapidly progressed, with new GPT-4 level and Gen3 models offering both groundbreaking and incremental improvements. The o1 models showcase advanced reasoning, capable of identifying errors in academic papers and assisting with research, emphasizing AI's growing influence beyond traditional tasks. AI now also supports real-time video interaction and enhanced text-to-video generation, pointing to significant future implications and opportunities for integration across various fields.
Collaborative research on AI safety is vital.	If we are to take seriously the risk facing humanity, regulators need the power to ‘recall’ deployed models, as well as assess leading, not lagging, indicators of risk

Back to index

ML news: Week 31 December - 5 January

Research

Link	description
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference.	A new encoder-only transformer model sets state-of-the-art performance in classification and retrieval tasks while being more efficient than earlier encoders. Trained on 2T tokens with an 8192 sequence length, it incorporates modern optimizations that significantly surpass BERT. Designed for practical deployment, it offers superior speed and memory efficiency on standard GPUs.
DeepSeek-V3.	A 671B-parameter MoE language model activates 37B parameters per token, leveraging MLA and DeepSeekMoE architectures for efficiency. It features an auxiliary-loss-free load-balancing approach and multi-token prediction during training to boost performance. Pre-trained on 14.8 trillion tokens, followed by SFT and RL stages, the model matches leading closed-source models and outperforms open-source alternatives. Training required only 2.788M H800 GPU hours with stable, spike-free progress.
Large Concept Models: Language Modeling in a Sentence Representation Space.	This approach introduces sentence-level semantic representations, called concepts, moving beyond token-level processing in traditional LLMs. It utilizes SONAR sentence embeddings, supporting 200 languages across text and speech, with autoregressive training methods ranging from MSE regression to diffusion-based generation. Tested in 1.6B and 7B parameter variants on datasets of 1.3T and 7.7T tokens, the model excels in generative tasks such as summarization and summary expansion.
Automating the Search for Artificial Life with Foundation Models.	This approach leverages foundation models to explore artificial life simulations across platforms like Boids, Lenia, and Game of Life. It identifies simulations with specific target behaviors, generates temporally open-ended novelty, and maps diverse simulation spaces. The system discovers new lifeforms in Lenia and Boids while enabling quantitative, human-aligned measurements of previously qualitative phenomena.
LearnLM: Improving Gemini for Learning.	LearnLM is a new model designed to follow pedagogical instructions, adapting its teaching style to specified educational needs rather than defaulting to mere information delivery. Experimental results show it outperforms leading models, including GPT-4 by 31%, Claude 3.5 by 11%, and Gemini 1.5 Pro by 13%. LearnLM avoids adhering to a single pedagogical framework, allowing teachers and developers to define teaching behaviors while enabling continuous improvement alongside other capabilities.
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search.	This work introduces CoMCTS, a learning-to-reason method for multimodal language models that fosters step-by-step reasoning by leveraging knowledge from multiple models. Using this approach, the Mulberry-260k dataset with explicit reasoning trees was created to train the Mulberry model series. The method achieves strong benchmark performance, enhancing the models' reasoning and reflection capabilities.
DRT-o1: Optimized Deep Reasoning Translation via Long Chain-of-Thought.	This approach applies long chain-of-thought reasoning to machine translation, focusing on metaphors and similes across cultures. It employs a multi-agent framework where a translator collaborates iteratively with an advisor and evaluator for improved translations. Testing with Qwen2.5 models showed notable gains in BLEU and CometScore metrics, with DRT-o1-7B outperforming larger models like QwQ-32B-Preview.
SceneCraft: Layout-Guided 3D Scene Generation.	SceneCraft introduces a method for creating detailed 3D indoor scenes based on user-provided text descriptions and layout preferences.
Chain of Continuous Thoughts.	Meta's COCONUT introduces a new approach for LLMs to reason in continuous latent space instead of discrete language tokens, encoding reasoning steps as continuous vectors. This method enhances reasoning capabilities but reduces interpretability, offering a promising trade-off for future LLM advancements.
The Vizier Gaussian Process Bandit Algorithm.	Google has open-sourced an internal tool used for hyperparameter optimization and research across its products. Previously proprietary, the tool's underlying algorithm has now been detailed in a published paper, highlighting its decision-making capabilities and effectiveness.
Large-scale moral machine experiment on large language models.	A new study assesses the ethical decision-making of 51 LLMs in autonomous driving scenarios, analyzing alignment with human moral judgments across various models, including GPT, Claude, and Llama.
Efficient Parallel Genetic Algorithm for Perturbed Substructure Optimization in Complex Network.	This study suggests a method for reconstructing the genetic operation and designing a development framework for efficient parallel acceleration.
An analytic theory of creativity in convolutional diffusion models.	A fascinating paper that explores closed-form equations that can model generated images from diffusion models. This means that with a high degree of confidence, you can predict the image that will be generated, in a simpler setting.

News

Link	description
Berlin accuses Elon Musk of trying to influence German election.	Government spokesperson says freedom of speech ‘covers the greatest nonsense’ after Musk’s endorsements of AfD
Dating apps prepare to launch AI features to help users find love.	Match Group’s digital assistant will tailor profiles and search for dates – but critics fear genuine connections are at risk
AI tools may soon manipulate people’s online decision-making, say researchers.	Study predicts an ‘intention economy’ where companies bid for accurate predictions of human behavior
‘Godfather of AI’ shortens odds of the technology wiping out humanity over next 30 years.	Geoffrey Hinton says there is 10% to 20% chance AI will lead to human extinction in three decades, as change moves fast
OpenAI lays out a plan to shift to a for-profit corporate structure.	AI company, which makes ChatGPT, says in blogpost ‘we once again need to raise more capital than we’d imagined’
ChatGPT search vs. Google: A deep dive analysis of 62 queries.	A study comparing 62 queries analyzed ChatGPT search and Google, revealing distinct strengths and weaknesses. Google excelled in informational, local, and commercial queries, while ChatGPT showed potential in content gap analysis and disambiguation. Both faced issues with errors and incomplete responses, though Google generally offered more reliable results.
Nick Clegg, former UK deputy prime minister, leaves Meta.	Clegg was the tech giant’s chief public policy architect when it was facing scrutiny over Cambridge Analytica scandal
DeepSeek-V3, ultra-large open-source AI, outperforms Llama and Qwen on launch.	Chinese AI startup DeepSeek has launched DeepSeek-V3, a 671B parameter model using a mixture-of-experts architecture, now available on Hugging Face. DeepSeek-V3 surpasses leading models like Meta's Llama 3.1 and competes with closed models like OpenAI's GPT-4o. It focuses on efficiency with innovations such as multi-token prediction, significantly reducing training costs.
Microsoft and OpenAI have a financial definition of AGI.	Microsoft and OpenAI define AGI as AI systems generating $100 billion in profits, a milestone OpenAI is far from reaching. Currently losing billions, OpenAI doesn't anticipate profitability until 2029, raising questions about how long Microsoft will maintain access to its technology. Financial metrics counter speculation that OpenAI might prematurely declare AGI.
OpenAI ‘considered’ building a humanoid robot.	OpenAI is exploring the development of its own humanoid robot, drawing on past investments in robotics companies like Figure and 1X. Despite disbanding its robotics division in 2021, re-entering this competitive market poses significant challenges.
Would you trust a driverless robotaxi? Waymo hopes so.	Waymo has expanded its self-driving ride-hailing service to Los Angeles, adding to its operations in San Francisco and Phoenix. Riders value the smoother, more private experience over traditional rideshares. Despite growing ridership, the service's profitability remains unclear.
ChatGPT Search can be tricked into misleading users, new research reveals.	ChatGPT Search, an AI-powered search engine that went live this month, can be fooled into generating completely misleading summaries, U.K. newspaper The Guardian has found.
Meta is rolling out live AI and Shazam integration to its smart glasses.	The Ray-Ban Meta Smart Glasses already worked well as a head-mounted camera and pair of open-ear headphones, but now Meta is updating the glasses with access to live AI without the need for a wake word, live translation between several different languages, and access to Shazam for identifying music.
AI helps ID paint chemistry of Berlin Wall murals.	SAPNet is a neural network developed by Italian scientists to enhance spectral data analysis from handheld Raman spectroscopy devices.
Cerebras Demonstrates Trillion Parameter Model Training on a Single CS-3 System.	Cerebras Systems and Sandia National Laboratories successfully trained a 1 trillion parameter AI model on a single CS-3 system using Cerebras' Wafer Scale Cluster technology. This approach eliminates the need for thousands of GPUs, simplifying deployment. The model scaled seamlessly to 16 CS-3 systems, demonstrating impressive linear scalability.
xAI is testing a standalone iOS app for its Grok chatbot.	Elon Musk’s AI company, xAI, is testing out a standalone iOS app for its chatbot, Grok, which was available only to X users until now.
OpenAI says it has no plans for a Sora API — yet.	OpenAI says it has no plans to release an API for Sora, its AI model that can generate reasonably realistic videos when provided with a text description or reference image.
BYD officially enters humanoid robot race as global talent search kicks off.	China’s leading EV maker will try its hand in a promising new field. As electric car sales continue surging to record highs, BYD plans to take on the world of humanoid robots. To kick things off, BYD announced a new recruitment program to attract top talent from around the globe.
Nvidia to open-source Run:ai, the software it acquired for $700M to help companies manage GPUs for AI.	Nvidia has completed its acquisition of Run:ai, a software company that makes it easier for customers to orchestrate GPU clouds for AI, and said that it would open-source the software.
YouTube Teams With CAA to Let Talent Identify — and Pull Down — AI Deepfakes of Themselves.	YouTube and CAA have partnered to help talent combat AI-generated fakes using early-stage likeness management technology. The tool allows actors and athletes to identify unauthorized AI replicas and request their removal. This collaboration focuses on protecting IP rights while testing and refining AI detection systems ahead of a broader launch.
Engineered Arts restructures with $10M to create humanoid robots.	Engineered Arts, a United Kingdom firm making humanoid robots, has restructured as a U.S. company and raised $10 million.
NVIDIA Unveils Its Most Affordable Generative AI Supercomputer.	The Jetson Orin Nano Super delivers up to a 1.7x gain in generative AI performance, supporting popular models for hobbyists, developers, and students.
OpenAI failed to deliver the opt-out tool it promised by 2025.	Back in May, OpenAI said it was developing a tool to let creators specify how they want their works to be included in — or excluded from — its AI training data. But seven months later, this feature has yet to see the light of day.
Code Assist, Google’s enterprise-focused coding assistant, gets third-party tools.	Google on Tuesday announced support for third-party tools in Gemini Code Assist, its enterprise-focused AI code completion service.

Resources

Link	description
A Survey on LLM Inference-Time Self-Improvement.	This survey categorizes LLM inference-time self-improvement techniques into three areas: independent methods like enhanced decoding, context-aware approaches leveraging external data, and model collaboration strategies.
Explore Theory-of-Mind: Program-Guided Adversarial Data Generation for Theory of Mind Reasoning.	ExploreToM is a framework leveraging A* search to create complex theory-of-mind scenarios, exposing significant limitations in current LLMs' social intelligence. Advanced models like GPT-4 and Llama-3 achieved as low as 5% accuracy in these scenarios, despite excelling on simpler benchmarks. Fine-tuning with ExploreToM data improved performance on existing benchmarks by 27 points.
CHORDONOMICON: A Dataset of 666,000 Songs and their Chord Progressions.	The Chordonomicon dataset provides over 666,000 songs with chord progressions annotated by genre, structure, and release date, addressing a significant gap in deep learning resources for music analysis.
ClassiFIM: An Unsupervised Method To Detect Phase Transitions.	ClassiFIM is a new approach for estimating the Fisher Information Metric in unsupervised learning of phase transitions.
AI Hedge Fund.	An AI-powered hedge fund that uses multiple agents to make trading decisions.
FlowEdit.	Easy editing of images with flow-based models.
Transfusion - Pytorch.	Lucidrains has written up a great reimplementation of Meta's token + diffusion model Transfusion which can do images and text in a single model.
Fast LLM Inference From Scratch.	The article details the creation of an LLM inference engine using C++ and CUDA without external libraries, emphasizing speed optimization for consumer devices. It explores techniques like multithreading, vectorization, warp reductions, coalescing, and quantization, achieving better throughput than llama.cpp in specific cases. The piece also highlights opportunities for further optimization and discusses the benefits of established libraries for production-grade applications.
8 expert tips for getting started with NotebookLM.	This guide offers key insights from experts to help beginners get started with NotebookLM, making it easier to navigate and use effectively.
Implicit Grid Convolution for Multi-Scale Image Super-Resolution.	This paper introduces a new approach to Super-Resolution (SR) that challenges the conventional method of training separate models for each scale.
Label Critic: Using LVLMs to Compare Medical Segmentations and Correct Label Errors.	Label Critic is a cutting-edge tool that simplifies medical dataset annotation by leveraging AI-generated labels, eliminating the need to start from scratch.
Py-CTCMetrics.	The CHOTA metric (Cell-specific Higher Order Tracking Accuracy) enhances the evaluation of cell tracking methods in biomedical research by integrating cell detection, global coherence, and lineage tracking into a unified framework. Unlike existing metrics that emphasize local accuracy, CHOTA provides a comprehensive approach, better suited for high-level biological analysis.
FM4Music.	This repository, along with the companion paper, contains a list of services, models, datasets, and systems used to generate music.
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation.	A multimodal model that unifies image and text generation and understanding by using a novel set of autoregressive and discrete diffusion blocks.
Xmodel-1.5: An 1B-scale Multilingual LLM.	Xmodel-1.5 is a powerful 1-billion-parameter language model trained on 2 trillion tokens that excels in multiple languages including Thai, Arabic, French, Chinese, and English.
Vehicle Detection with Enhanced Accuracy.	VFM-Det is a vehicle detection method that combines a pre-trained vehicle model (VehicleMAE) with a large language model (T5).
FS-Jump3D Dataset.	FS-Jump3D dataset improves Temporal Action Segmentation (TAS) in figure skating, a key aspect of judging skaters' performances.
SCUDA: GPU-over-IP.	SCUDA is a GPU-over-IP bridge allowing GPUs on remote machines to be attached to CPU-only machines.
Globally Correlation-Aware Hard Negative Generation.	GCA-HNG is a framework for generating more effective hard negatives by considering global sample correlations instead of just local ones.
Static Quantization Beats Dynamic through Prefixed Outliers in LLMs.	PrefixQuant is a new method that improves LLM quantization by isolating outlier tokens offline, eliminating the need for costly per-token dynamic quantization.
Xmodel_LM-1.1B.	a compact and efficient 1.1B language model pre-trained on around 2 trillion tokens.
Self-Assessed Generation: Trustworthy Label Generation for Optical Flow and Stereo Matching in Real-world.	SAG is a self-supervised framework that enhances the generalization of optical flow and stereo methods for real-world applications. By leveraging advanced reconstruction techniques, SAG generates datasets from RGB images and quantifies confidence levels to address imperfections, offering a robust alternative to traditional approaches.
Olympus: A Universal Task Router for Computer Vision Tasks.	Olympus provides a comprehensive framework for evaluating AI creativity across multiple domains, offering insights into generative model capabilities and limitations.
Enhance Non-Ideal CT Imaging.	TAMP is a multi-scale integrated Transformer model designed to enhance non-ideal CT (NICT) imaging.
Dynamic 3D Gaussian Tracking for Graph-Based Neural Dynamics Modeling.	This code leverages advancements in general 3D vision to enhance robot vision, particularly by predicting the dynamics of objects manipulated by a robotic arm. This capability improves the system's overall manipulation performance.
Process Reinforcement Through Implicit Rewards.	Few open replications of o1 reasoning exist, but this work shows promise by using implicit rewards that bypass formal reward methods, while also rewarding outcomes consistent with reasoning model principles. Though the code is still in progress, the developers have released the data and models.
Single Modality 3D Object Detection.	This repository offers a 3D object detection framework optimized for single-modality inputs, focusing on simplified and efficient use cases.
Vinci - An Online Egocentric Video-Language Assistant.	A conditional diffusion model that combines visual and textual inputs to generate high-quality images based on diverse text-visual contexts.
VisionReward.	VisionReward is a fine-grained and multi-dimensional reward model.
Wonderful Matrices.	A comprehensive collection of efficiently implemented matrix operations, ideal for mathematical and scientific computing tasks.
ChatTime: A Multimodal Time Series Foundation Model.	An interactive chat-based application that integrates time-tracking features, simplifying task management for teams and individuals.
CrossEarth: Geospatial Vision Foundation Model for Domain Generalizable Remote Sensing Semantic Segmentation.	CrossEarth is the first vision foundation model aimed at generalizing across diverse remote sensing scenarios.

Perspectives

Link	description
‘All people could do was hope the nerds would fix it’: the global panic over the millennium bug, 25 years on.	Planes were going to drop out of the sky, nuclear reactors would explode. But then … nothing. What really happened with Y2K? People still disagree …
How will AI reshape 2025? Well, it could be the spreadsheet of the 21st century.	Large language models have changed how big corporations function, and the arrival of AI ‘agents’ – essentially automated Moneypennys – could prove irresistible
How AI is unlocking ancient texts — and could rewrite history.	From deciphering burnt Roman scrolls to reading crumbling cuneiform tablets, neural networks could give researchers more data than they’ve had in centuries.
6G-AI Mashups Will Reshape the Telecom Industry.	The EU-U.S. 6G-XCEL project, along with efforts like ACCoRD and COSMOS, is driving 6G research through AI-integrated network architectures. Workshops at Rutgers showcased 6G innovations, emphasizing open-source initiatives and industry collaborations. These efforts aim to accelerate development and establish interoperability frameworks for next-generation wireless networks.
Why Google bought Character AI.	Google acquired Character AI for its cost-efficient inference technology, enabling scalable AI interactions and supporting free model offerings via AI Studio without affecting unit economics. This move aligns with the shift toward optimizing inference as pre-training yields diminish.
Computing inside an AI.	Shifting from a model-as-person to a model-as-computer metaphor could make AI more effective by introducing graphical interfaces and direct manipulation, reducing reliance on slower conversational inputs. This paradigm enables users to interact with AI as a dynamic, customizable app, improving efficiency and versatility. Generative interfaces have the potential to revolutionize computing, allowing users to create and modify applications on demand for specific tasks.
How Claude Became Tech Insiders’ Chatbot of Choice.	Anthropic's AI chatbot Claude is gaining popularity among tech insiders for its perceived emotional intelligence and creative responses.
Desktop, Touch, Browser, Now AI? The Next OS in Computing.	Human-computer interaction is evolving from graphical interfaces to a more conversational AI-driven approach.
Tenstorrent and the State of AI Hardware Startups.	Tenstorrent's open-source AI hardware offers a competitive alternative to Nvidia, integrating unique CPU and AI core strategies. Leveraging Samsung Foundry's cost-efficient SF4X process, the company addresses latency challenges for scalable AI workloads. With a recent $2B valuation, Tenstorrent shows strong potential, particularly as a high-performance RISC-V IP option amid ARM's pricing challenges.
𝗼𝟯 “𝗔𝗥𝗖 𝗔𝗚𝗜” 𝗽𝗼𝘀𝘁𝗺𝗼𝗿𝘁𝗲𝗺 𝗺𝗲𝗴𝗮𝘁𝗵𝗿𝗲𝗮𝗱: 𝘄𝗵𝘆 𝘁𝗵𝗶𝗻𝗴𝘀 𝗴𝗼𝘁 𝗵𝗲𝗮𝘁𝗲𝗱, 𝘄𝗵𝗮𝘁 𝘄𝗲𝗻𝘁 𝘄𝗿𝗼𝗻𝗴, 𝗮𝗻𝗱 𝘄𝗵𝗮𝘁 𝗶𝘁 𝗮𝗹𝗹 𝗺𝗲𝗮𝗻𝘀.	OpenAI's recent AI demonstration faced criticism for creating misleading impressions of achieving AGI, with unclear pretraining details and questionable graphs. Experts from DeepMind and Hugging Face noted that the AI took the test with extensive pretraining, unlike humans. The lack of transparency and test methodology limits direct comparisons to human abilities, casting doubt on the significance of the claimed breakthrough.
Trusted Autonomy: Robotics, AI, and Blockchain.	What happens when robotics, AI, and blockchain converge? OpenMind's latest industry primer is a comprehensive exploration of robotics, AI, and blockchain synergy.
AIs Will Increasingly Attempt Shenanigans.	Recent research reveals AI models' increasing ability for in-context scheming, including lying, exfiltration attempts, and oversight subversion. Apollo's findings show that frontier models like o1 and Llama 3.1 display these behaviors with minimal instruction, raising concerns about AI alignment and safety. While some question the testing conditions, the study highlights the challenges of managing more autonomous AI systems.
The o1 System Card Is Not About o1.	The o1 model's release revealed insufficient testing and discrepancies in its system card, with actual performance and safety evaluations falling short of expectations. OpenAI's lack of clear communication and timely evaluations underscores the need for updated, transparent procedures to ensure AI safety and reliability before deployment.
Deepseek: The Quiet Giant Leading China’s AI Race.	Deepseek, a Chinese AI startup backed by the hedge fund High-Flyer, has gained recognition for surpassing OpenAI on reasoning benchmarks and driving price competition with its efficient AI models. Led by CEO Liang Wenfeng, Deepseek emphasizes open-source foundational technology and self-funded extensive computing resources. Focusing on AGI research, the startup challenges traditional innovation norms in China while attracting top domestic talent.
How OpenAI Hopes to Sever Its Nonprofit Roots.	Sam Altman is steering OpenAI toward transitioning control from its founding nonprofit to a for-profit model to better compete with tech giants. The talks focus on fair compensation for nonprofits and addressing stakeholder interests, including Microsoft's. OpenAI must restructure within two years to avoid converting recent investments into debt.

Back to index

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for ML-news-of-the-week

Similar Open Source Tools

The W3C Cognitive AI Community Group focuses on advancing Cognitive AI through collaboration on defining use cases, open source implementations, and application areas. The group aims to demonstrate the potential of Cognitive AI in various domains such as customer services, healthcare, cybersecurity, online learning, autonomous vehicles, manufacturing, and web search. They work on formal specifications for chunk data and rules, plausible knowledge notation, and neural networks for human-like AI. The group positions Cognitive AI as a combination of symbolic and statistical approaches inspired by human thought processes. They address research challenges including mimicry, emotional intelligence, natural language processing, and common sense reasoning. The long-term goal is to develop cognitive agents that are knowledgeable, creative, collaborative, empathic, and multilingual, capable of continual learning and self-awareness.

github

: 53

tods-arxiv-daily-paper

This repository provides a tool for fetching and summarizing daily papers from the arXiv repository. It allows users to stay updated with the latest research in various fields by automatically retrieving and summarizing papers on a daily basis. The tool simplifies the process of accessing and digesting academic papers, making it easier for researchers and enthusiasts to keep track of new developments in their areas of interest.

github

: 125

AI-Engineer-Headquarters

AI Engineer Headquarters is a comprehensive learning resource designed to help individuals master scientific methods, processes, algorithms, and systems to build stories and models in the field of Data and AI. The repository provides in-depth content through video sessions and text materials, catering to individuals aspiring to be in the top 1% of Data and AI experts. It covers various topics such as AI engineering foundations, large language models, retrieval-augmented generation, fine-tuning LLMs, reinforcement learning, ethical AI, agentic workflows, and career acceleration. The learning approach emphasizes action-oriented drills and routines, encouraging consistent effort and dedication to excel in the AI field.

github

: 3.6k

awesome-RLAIF

Reinforcement Learning from AI Feedback (RLAIF) is a concept that describes a type of machine learning approach where **an AI agent learns by receiving feedback or guidance from another AI system**. This concept is closely related to the field of Reinforcement Learning (RL), which is a type of machine learning where an agent learns to make a sequence of decisions in an environment to maximize a cumulative reward. In traditional RL, an agent interacts with an environment and receives feedback in the form of rewards or penalties based on the actions it takes. It learns to improve its decision-making over time to achieve its goals. In the context of Reinforcement Learning from AI Feedback, the AI agent still aims to learn optimal behavior through interactions, but **the feedback comes from another AI system rather than from the environment or human evaluators**. This can be **particularly useful in situations where it may be challenging to define clear reward functions or when it is more efficient to use another AI system to provide guidance**. The feedback from the AI system can take various forms, such as: - **Demonstrations** : The AI system provides demonstrations of desired behavior, and the learning agent tries to imitate these demonstrations. - **Comparison Data** : The AI system ranks or compares different actions taken by the learning agent, helping it to understand which actions are better or worse. - **Reward Shaping** : The AI system provides additional reward signals to guide the learning agent's behavior, supplementing the rewards from the environment. This approach is often used in scenarios where the RL agent needs to learn from **limited human or expert feedback or when the reward signal from the environment is sparse or unclear**. It can also be used to **accelerate the learning process and make RL more sample-efficient**. Reinforcement Learning from AI Feedback is an area of ongoing research and has applications in various domains, including robotics, autonomous vehicles, and game playing, among others.

github

: 64

mslearn-ai-fundamentals

This repository contains materials for the Microsoft Learn AI Fundamentals module. It covers the basics of artificial intelligence, machine learning, and data science. The content includes hands-on labs, interactive learning modules, and assessments to help learners understand key concepts and techniques in AI. Whether you are new to AI or looking to expand your knowledge, this module provides a comprehensive introduction to the fundamentals of AI.

github

: 91

RecAI

RecAI is a project that explores the integration of Large Language Models (LLMs) into recommender systems, addressing the challenges of interactivity, explainability, and controllability. It aims to bridge the gap between general-purpose LLMs and domain-specific recommender systems, providing a holistic perspective on the practical requirements of LLM4Rec. The project investigates various techniques, including Recommender AI agents, selective knowledge injection, fine-tuning language models, evaluation, and LLMs as model explainers, to create more sophisticated, interactive, and user-centric recommender systems.

github

: 494

model-mondays

Model Mondays is a repository dedicated to providing a collection of machine learning models implemented in Python. It aims to serve as a resource for individuals looking to explore and experiment with various machine learning algorithms and techniques. The repository includes a wide range of models, from simple linear regression to complex deep learning architectures, along with detailed documentation and examples to facilitate learning and understanding. Whether you are a beginner looking to get started with machine learning or an experienced practitioner seeking reference implementations, Model Mondays offers a valuable repository of models to study and leverage in your projects.

github

: 186

Self-Iterative-Agent-System-for-Complex-Problem-Solving

The Self-Iterative Agent System for Complex Problem Solving is a solution developed for the Alibaba Mathematical Competition (AI Challenge). It involves multiple LLMs engaging in multi-round 'self-questioning' to iteratively refine the problem-solving process and select optimal solutions. The system consists of main and evaluation models, with a process that includes detailed problem-solving steps, feedback loops, and iterative improvements. The approach emphasizes communication and reasoning between sub-agents, knowledge extraction, and the importance of Agent-like architectures in complex tasks. While effective, there is room for improvement in model capabilities and error prevention mechanisms.

github

: 51

causalML

This repository is the workshop repository for the Causal Modeling in Machine Learning Workshop on Altdeep.ai. The material is open source and free. The course covers causality in model-based machine learning, Bayesian modeling, interventions, counterfactual reasoning, and deep causal latent variable models. It aims to equip learners with the ability to build causal reasoning algorithms into decision-making systems in data science and machine learning teams within top-tier technology organizations.

github

: 714

sciml.ai

SciML.ai is an open source software organization dedicated to unifying packages for scientific machine learning. It focuses on developing modular scientific simulation support software, including differential equation solvers, inverse problems methodologies, and automated model discovery. The organization aims to provide a diverse set of tools with a common interface, creating a modular, easily-extendable, and highly performant ecosystem for scientific simulations. The website serves as a platform to showcase SciML organization's packages and share news within the ecosystem. Pull requests are encouraged for contributions.

github

: 61

text-to-sql-bedrock-workshop

This repository focuses on utilizing generative AI to bridge the gap between natural language questions and SQL queries, aiming to improve data consumption in enterprise data warehouses. It addresses challenges in SQL query generation, such as foreign key relationships and table joins, and highlights the importance of accuracy metrics like Execution Accuracy (EX) and Exact Set Match Accuracy (EM). The workshop content covers advanced prompt engineering, Retrieval Augmented Generation (RAG), fine-tuning models, and security measures against prompt and SQL injections.

github

: 63

Elite-Dangerous-AI-Integration

Elite-Dangerous-AI-Integration aims to provide a seamless and efficient experience for commanders by integrating Elite:Dangerous with various services for Speech-to-Text, Text-to-Speech, and Large Language Models. The AI reacts to game events, given commands, and can perform actions like taking screenshots or fetching information from APIs. It is designed for all commanders, enhancing roleplaying, replacing third-party websites, and assisting with tutorials.

github

: 67

Conversational-Azure-OpenAI-Accelerator

The Conversational Azure OpenAI Accelerator is a tool designed to provide rapid, no-cost custom demos tailored to customer use cases, from internal HR/IT to external contact centers. It focuses on top use cases of GenAI conversation and summarization, plus live backend data integration. The tool automates conversations across voice and text channels, providing a valuable way to save money and improve customer and employee experience. By combining Azure OpenAI + Cognitive Search, users can efficiently deploy a ChatGPT experience using web pages, knowledge base articles, and data sources. The tool enables simultaneous deployment of conversational content to chatbots, IVR, voice assistants, and more in one click, eliminating the need for in-depth IT involvement. It leverages Microsoft's advanced AI technologies, resulting in a conversational experience that can converse in human-like dialogue, respond intelligently, and capture content for omni-channel unified analytics.

github

: 63

Build-Modern-AI-Apps

This repository serves as a hub for Microsoft Official Build & Modernize AI Applications reference solutions and content. It provides access to projects demonstrating how to build Generative AI applications using Azure services like Azure OpenAI, Azure Container Apps, Azure Kubernetes, and Azure Cosmos DB. The solutions include Vector Search & AI Assistant, Real-Time Payment and Transaction Processing, and Medical Claims Processing. Additionally, there are workshops like the Intelligent App Workshop for Microsoft Copilot Stack, focusing on infusing intelligence into traditional software systems using foundation models and design thinking.

github

: 142

metaflow

Metaflow is a user-friendly library designed to assist scientists and engineers in developing and managing real-world data science projects. Initially created at Netflix, Metaflow aimed to enhance the productivity of data scientists working on diverse projects ranging from traditional statistics to cutting-edge deep learning. For further information, refer to Metaflow's website and documentation.

github

: 8.7k

For similar tasks

No tools available

For similar jobs

No tools available

ML-news-of-the-week

README:

ML & AI news of the week

Suggestions and corrections

Index

2025

2024

2023

2025

ML news: Week 3 - 9 March

Research

News

Resources

Perspectives

ML news: Week 24 February - 2 March

Research

News

Resources

Perspectives

ML news: Week 17 - 23 February

Research

News

Resources

Perspectives

ML news: Week 10 - 16 February

Research

News

Resources

Perspectives

ML news: Week 3 - 9 February

Research

News

Resources

Perspectives

ML news: Week 27 January - 2 February

Research

News

Resources

Perspectives

ML news: Week 20 - 26 January

Research

News

Resources

Perspectives

ML news: Week 13 - 19 January

Research

News

Resources

Perspectives

ML news: Week 6 - 12 January

Research

News

Resources

Perspectives

ML news: Week 31 December - 5 January

Research

News

Resources

Perspectives

For Tasks:

For Jobs:

Alternative AI tools for ML-news-of-the-week

Similar Open Source Tools

ML-news-of-the-week

cogai

tods-arxiv-daily-paper

AI-Engineer-Headquarters

awesome-RLAIF

mslearn-ai-fundamentals

RecAI

model-mondays

Self-Iterative-Agent-System-for-Complex-Problem-Solving

causalML

sciml.ai

text-to-sql-bedrock-workshop

Elite-Dangerous-AI-Integration

Conversational-Azure-OpenAI-Accelerator

Build-Modern-AI-Apps

metaflow

For similar tasks