new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

Jul 3

ChemKED: a human- and machine-readable data standard for chemical kinetics experiments

Fundamental experimental measurements of quantities such as ignition delay times, laminar flame speeds, and species profiles (among others) serve important roles in understanding fuel chemistry and validating chemical kinetic models. However, despite both the importance and abundance of such information in the literature, the community lacks a widely adopted standard format for this data. This impedes both sharing and wide use by the community. Here we introduce a new chemical kinetics experimental data format, ChemKED, and the related Python-based package for validating and working with ChemKED-formatted files called PyKED. We also review past and related efforts, and motivate the need for a new solution. ChemKED currently supports the representation of autoignition delay time measurements from shock tubes and rapid compression machines. ChemKED-formatted files contain all of the information needed to simulate experimental data points, including the uncertainty of the data. ChemKED is based on the YAML data serialization language, and is intended as a human- and machine-readable standard for easy creation and automated use. Development of ChemKED and PyKED occurs openly on GitHub under the BSD 3-clause license, and contributions from the community are welcome. Plans for future development include support for experimental data from laminar flame, jet stirred reactor, and speciation measurements.

  • 2 authors
·
Jun 6, 2017

MixtureGrowth: Growing Neural Networks by Recombining Learned Parameters

Most deep neural networks are trained under fixed network architectures and require retraining when the architecture changes. If expanding the network's size is needed, it is necessary to retrain from scratch, which is expensive. To avoid this, one can grow from a small network by adding random weights over time to gradually achieve the target network size. However, this naive approach falls short in practice as it brings too much noise to the growing process. Prior work tackled this issue by leveraging the already learned weights and training data for generating new weights through conducting a computationally expensive analysis step. In this paper, we introduce MixtureGrowth, a new approach to growing networks that circumvents the initialization overhead in prior work. Before growing, each layer in our model is generated with a linear combination of parameter templates. Newly grown layer weights are generated by using a new linear combination of existing templates for a layer. On one hand, these templates are already trained for the task, providing a strong initialization. On the other, the new coefficients provide flexibility for the added layer weights to learn something new. We show that our approach boosts top-1 accuracy over the state-of-the-art by 2-2.5% on CIFAR-100 and ImageNet datasets, while achieving comparable performance with fewer FLOPs to a larger network trained from scratch. Code is available at https://github.com/chaudatascience/mixturegrowth.

  • 4 authors
·
Nov 7, 2023

Mathematical modelling of flow and adsorption in a gas chromatograph

In this paper, a mathematical model is developed to describe the evolution of the concentration of compounds through a gas chromatography column. The model couples mass balances and kinetic equations for all components. Both single and multiple-component cases are considered with constant or variable velocity. Non-dimensionalisation indicates the small effect of diffusion. The system where diffusion is neglected is analysed using Laplace transforms. In the multiple-component case, it is demonstrated that the competition between the compounds is negligible and the equations may be decoupled. This reduces the problem to solving a single integral equation to determine the concentration profile for all components (since they are scaled versions of each other). For a given analyte, we then only two parameters need to be fitted to the data. To verify this approach, the full governing equations are also solved numerically using the finite difference method and a global adaptive quadrature method to integrate the Laplace transformation. Comparison with the Laplace solution verifies the high degree of accuracy of the simpler Laplace form. The Laplace solution is then verified against experimental data from BTEX chromatography. This novel method, which involves solving a single equation and fitting parameters in pairs for individual components, is highly efficient. It is significantly faster and simpler than the full numerical solution and avoids the computationally expensive methods that would normally be used to fit all curves at the same time.

  • 5 authors
·
Oct 7, 2024

ChemCrow: Augmenting large-language models with chemistry tools

Over the last decades, excellent computational chemistry tools have been developed. Their full potential has not yet been reached as most are challenging to learn and exist in isolation. Recently, large-language models (LLMs) have shown strong performance in tasks across domains, but struggle with chemistry-related problems. Moreover, these models lack access to external knowledge sources, limiting their usefulness in scientific applications. In this study, we introduce ChemCrow, an LLM chemistry agent designed to accomplish tasks across organic synthesis, drug discovery, and materials design. By integrating 17 expert-designed tools, ChemCrow augments the LLM performance in chemistry, and new capabilities emerge. Our agent autonomously planned the syntheses of an insect repellent, three organocatalysts, as well as other relevant molecules. Our evaluation, including both LLM and expert assessments, demonstrates ChemCrow's effectiveness in automating a diverse set of chemical tasks. Surprisingly, we find that GPT-4 as an evaluator cannot distinguish between clearly wrong GPT-4 completions and Chemcrow's performance. There is a significant risk of misuse of tools like ChemCrow, and we discuss their potential harms. Employed responsibly, our work not only aids expert chemists and lowers barriers for non-experts, but also fosters scientific advancement by bridging the gap between experimental and computational chemistry. A subset of the code is publicly available at https://github.com/ur-whitelab/chemcrow-public.

  • 4 authors
·
Apr 11, 2023

A Unified Predictive and Generative Solution for Liquid Electrolyte Formulation

Liquid electrolytes are critical components of next-generation energy storage systems, enabling fast ion transport, minimizing interfacial resistance, and ensuring electrochemical stability for long-term battery performance. However, measuring electrolyte properties and designing formulations remain experimentally and computationally expensive. In this work, we present a unified framework for designing liquid electrolyte formulation, integrating a forward predictive model with an inverse generative approach. Leveraging both computational and experimental data collected from literature and extensive molecular simulations, we train a predictive model capable of accurately estimating electrolyte properties from ionic conductivity to solvation structure. Our physics-informed architecture preserves permutation invariance and incorporates empirical dependencies on temperature and salt concentration, making it broadly applicable to property prediction tasks across molecular mixtures. Furthermore, we introduce -- to the best of our knowledge -- the first generative machine learning framework for molecular mixture design, demonstrated on electrolyte systems. This framework supports multi-condition-constrained generation, addressing the inherently multi-objective nature of materials design. As a proof of concept, we experimentally identified three liquid electrolytes with both high ionic conductivity and anion-concentrated solvation structure. This unified framework advances data-driven electrolyte design and can be readily extended to other complex chemical systems beyond electrolytes.

  • 13 authors
·
Apr 25, 2025

A Chemical Modelling Roadmap Linking Protoplanetary Disks and Exoplanet Atmospheres

[Abridged] This review paper discussed which chemical effects may be at play in a planet-forming disk midplane, which effects are relevant under different conditions, and which tools are available for modelling chemical kinetics in a disk midplane. The review goes on to discuss some important efforts in the planet formation modelling community to treat chemical evolution, and, vice versa, efforts in the chemical modelling community to implement more physical effects related to planet formation into the chemical modelling. The aim of this review is both to outline some concepts related to planet formation chemistry, but also to encourage, not just collaboration between the planet formation modelling community and the astrochemical community, but also assistance and guidance from one community to the other. Guidance, regarding which effects, out of many, might be more relevant than others under certain planet formation conditions, and regarding why certain included effects lead to certain important modelling outcomes. As the research fields of exoplanet atmospheres and protoplanetary disks near new frontiers in observational insights with upcoming facilities, developing appropriate modelling frameworks (including physical and chemical effects) is paramount to ultimately enable the linking of a chemically characterised exoplanet atmospheres to its formation history in its natal protoplanetary disk.

  • 1 authors
·
Oct 30, 2022

Are large language models superhuman chemists?

Large language models (LLMs) have gained widespread interest due to their ability to process human language and perform tasks on which they have not been explicitly trained. This is relevant for the chemical sciences, which face the problem of small and diverse datasets that are frequently in the form of text. LLMs have shown promise in addressing these issues and are increasingly being harnessed to predict chemical properties, optimize reactions, and even design and conduct experiments autonomously. However, we still have only a very limited systematic understanding of the chemical reasoning capabilities of LLMs, which would be required to improve models and mitigate potential harms. Here, we introduce "ChemBench," an automated framework designed to rigorously evaluate the chemical knowledge and reasoning abilities of state-of-the-art LLMs against the expertise of human chemists. We curated more than 7,000 question-answer pairs for a wide array of subfields of the chemical sciences, evaluated leading open and closed-source LLMs, and found that the best models outperformed the best human chemists in our study on average. The models, however, struggle with some chemical reasoning tasks that are easy for human experts and provide overconfident, misleading predictions, such as about chemicals' safety profiles. These findings underscore the dual reality that, although LLMs demonstrate remarkable proficiency in chemical tasks, further research is critical to enhancing their safety and utility in chemical sciences. Our findings also indicate a need for adaptations to chemistry curricula and highlight the importance of continuing to develop evaluation frameworks to improve safe and useful LLMs.

  • 28 authors
·
Apr 1, 2024 1

Agentic reinforcement learning empowers next-generation chemical language models for molecular design and synthesis

Language models are revolutionizing the biochemistry domain, assisting scientists in drug design and chemical synthesis with high efficiency. Yet current approaches struggle between small language models prone to hallucination and limited knowledge retention, and large cloud-based language models plagued by privacy risks and high inference costs. To bridge this gap, we introduce ChemCRAFT, a novel framework leveraging agentic reinforcement learning to decouple chemical reasoning from knowledge storage. Instead of forcing the model to memorize vast chemical data, our approach empowers the language model to interact with a sandbox for precise information retrieval. This externalization of knowledge allows a locally deployable small model to achieve superior performance with minimal inference costs. To enable small language models for agent-calling ability, we build an agentic trajectory construction pipeline and a comprehensive chemical-agent sandbox. Based on sandbox interactions, we constructed ChemToolDataset, the first large-scale chemical tool trajectory dataset. Simultaneously, we propose SMILES-GRPO to build a dense chemical reward function, promoting the model's ability to call chemical agents. Evaluations across diverse aspects of drug design show that ChemCRAFT outperforms current cloud-based LLMs in molecular structure analysis, molecular optimization, and synthesis pathway prediction, demonstrating that scientific reasoning is not solely an emergent ability of model scale, but a learnable policy of tool orchestration. This work establishes a cost-effective and privacy-preserving paradigm for AI-aided chemistry, opening new avenues for accelerating molecular discovery with locally deployable agents. Code available at https://github.com/HowardLi1984/ChemCraft.

  • 10 authors
·
Jan 24

Multi-Spectroscopic Method to Quantify Rapid Decomposition of an Organophosphate Simulant Using Reactive Materials as a Function of Metal Powder Chemistry and Temperature

The development of advanced diagnostic systems to measure and optimize emerging energetic material performance is critical for the defeat of Chemical Warfare Agents (CWA). This study presents an integrated multi-spectroscopic approach to monitor the interaction between a CWA simulant, Diisopropyl Methyl Phosphonate (DIMP), and combusting composite metal particles. A custom benchtop Polygonal Rotating Mirror Infrared Spectrometer (PRiMIRS), equipped with a customizable experimental chamber, is employed to observe DIMP decomposition. Tunable Diode Laser Absorption Spectroscopy (TDLAS) is used to measure path-averaged gas temperature profiles during combustion. In the experiment, the chamber is preheated to evaporate liquid DIMP. Various composite metal powders (Al-8Mg):3Zr, (Al-8Mg):Zr, 2(Al-8Mg):Zr, and 4(Al-8Mg):Zr are placed on a stainless steel mount and ignited using 3Al-2Ni sputter-deposited nanolayered foils. The combusting metal particles mix with the DIMP vapor, initiating chemical and thermal interactions. PRiMIRS captures DIMP spectral evolution, while TDLAS simultaneously monitors gas temperature. A spectral defeat parameter was developed to enable quantitative real-time assessment of the DIMP destruction. It uses infrared light absorption by both from DIMP and its immediate decomposition products Isopropyl Methyl Phosphonate (IMP) and Isopropyl Alcohol (IPA). Fourier Transform Infrared Spectroscopy (FTIR) serves as a secondary verification tool quantifying the decomposition products over extended timeframes, and Transmission Electron Microscopy (TEM) confirms the expected metal oxide dispersion within the reaction space. This study reports variability in DIMP defeat as a function of metal powder stoichiometry, metal powder loading, and path-averaged gas temperature profiles, offering critical insights into optimizing reactive materials for effective CWA neutralization.

  • 6 authors
·
Sep 4, 2025

Growth of Two-dimensional Compound Materials: Controllability, Material Quality, and Growth Mechanism

CONSPECTUS: Two-dimensional (2D) compound materials are promising materials for use in electronics, optoelectronics, flexible devices, etc. because they are ultrathin and cover a wide range of properties. Among all methods to prepare 2D materials, chemical vapor deposition (CVD) is promising because it produces materials with a high quality and reasonable cost. So far, much efforts have been made to produce 2D compound materials with large domain size, controllable number of layers, fast-growth rate, and high quality features, etc. However, due to the complicated growth mechanism like sublimation and diffusion processes of multiple precursors, maintaining the controllability, repeatability, and high quality of CVD grown 2D binary and ternary materials is still a big challenge, which prevents their widespread use. Here, taking 2D transition metal dichalcogenides (TMDCs) as examples, we review current progress and highlight some promising growth strategies for the growth of 2D compound materials. The key technology issues which affect the CVD process, including non-metal precursor, metal precursor, substrate engineering, temperature, and gas flow, are discussed. Also, methods in improving the quality of CVD-grown 2D materials and current understanding on their growth mechanism are highlighted. Finally, challenges and opportunities in this field are proposed. We believe this review will guide the future design of controllable CVD systems for the growth of 2D compound materials with good controllability and high quality, laying the foundations for their potential applications.

  • 5 authors
·
Dec 10, 2020

Towards Foundation Model for Chemical Reactor Modeling: Meta-Learning with Physics-Informed Adaptation

Developing accurate models for chemical reactors is often challenging due to the complexity of reaction kinetics and process dynamics. Traditional approaches require retraining models for each new system, limiting generalizability and efficiency. In this work, we take a step toward foundation models for chemical reactor modeling by introducing a neural network framework that generalizes across diverse reactor types and rapidly adapts to new chemical processes. Our approach leverages meta-learning to pretrain the model on a broad set of reactor dynamics, enabling efficient adaptation to unseen reactions with minimal data. To further enhance generalizability, we incorporate physics-informed fine-tuning, ensuring physically consistent adaptation to new reactor conditions. Our framework is evaluated across three integer-order fundamental reactor types - continuous stirred tank reactors, batch reactors, and plug flow reactors - demonstrating superior few-shot adaptation compared to conventional data-driven, physics-informed, and transfer learning approaches. By combining meta-learning with physics-informed adaptation, this work lays the foundation for a generalizable modeling framework, advancing the development of foundation models for chemical engineering applications. Source code is available at https://github.com/killingbear999/chemical-reactor-foundation-model.

  • 2 authors
·
May 19, 2024

Benefits of Resource Strategy for Sustainable Materials Research and Development

Material and product life cycles are based on complex value chains of technology-specific elements. Resource strategy aspects of essential and strategic raw materials have a direct impact on applications of new functionalized materials or the development of novel products. Thus, an urgent challenge of modern materials science is to obtain information about the supply risk and environmental aspects of resource utilization, especially at an early stage of basic research. Combining the fields of materials science, industrial engineering and resource strategy enables a multidisciplinary research approach to identify specific risks within the value chain, aggregated as the so-called resource criticality. Here, we demonstrate a step-by-step criticality assessment in the sector of basic materials research for multifunctional hexagonal manganite YMnO3, which can be a candidate for future electronic systems. Raw material restrictions can be quantitatively identified, even at such an early stage of materials research, from eleven long-term indicators including our new developed Sector Competition Index. This approach for resource strategy for modern material science integrates two objective targets: reduced supply risk and enhanced environmental sustainability of new functionalized materials, showing drawbacks but also benefits towards a sustainable materials research and development.

  • 7 authors
·
Mar 6, 2017

Foundation Models for Discovery and Exploration in Chemical Space

Accurate prediction of atomistic, thermodynamic, and kinetic properties from molecular structures underpins materials innovation. Existing computational and experimental approaches lack the scalability required to efficiently navigate chemical space. Scientific foundation models trained on large unlabeled datasets offer a path toward exploring chemical space across diverse application domains. Here we develop MIST, a family of molecular foundation models with up to an order of magnitude more parameters and data than prior works. Trained using a novel tokenization scheme that comprehensively captures nuclear, electronic, and geometric information, MIST learns from a diverse range of molecules. MIST models have been fine-tuned to predict more than 400 structure -- property relationships and match or exceed state-of-the-art performance across benchmarks spanning physiology, electrochemistry, and quantum chemistry. We demonstrate the ability of these models to solve real-world problems across chemical space, including multiobjective electrolyte solvent screening, olfactory perception mapping, isotope half-life prediction, stereochemical reasoning for chiral organometallic compounds, and binary and multi-component mixture property prediction. Probing MIST models using mechanistic interpretability methods reveals identifiable patterns and trends not explicitly present in the training data, suggesting that the models learn generalizable scientific concepts. We formulate hyperparameter-penalized Bayesian neural scaling laws and use them to reduce the computational cost of model development by an order of magnitude. The methods and findings presented here represent a significant step toward accelerating materials discovery, design, and optimization using foundation models and provide valuable guidance for training compute-optimal scientific foundation models.

  • 22 authors
·
Oct 20, 2025

LDMol: Text-Conditioned Molecule Diffusion Model Leveraging Chemically Informative Latent Space

With the emergence of diffusion models as the frontline of generative models, many researchers have proposed molecule generation techniques using conditional diffusion models. However, due to the fundamental nature of a molecule, which carries highly entangled correlations within a small number of atoms and bonds, it becomes difficult for a model to connect raw data with the conditions when the conditions become more complex as natural language. To address this, here we present a novel latent diffusion model dubbed LDMol, which enables a natural text-conditioned molecule generation. Specifically, LDMol is composed of three building blocks: a molecule encoder that produces a chemically informative feature space, a natural language-conditioned latent diffusion model using a Diffusion Transformer (DiT), and an autoregressive decoder for molecule re. In particular, recognizing that multiple SMILES notations can represent the same molecule, we employ a contrastive learning strategy to extract the chemical informative feature space. LDMol not only beats the existing baselines on the text-to-molecule generation benchmark but is also capable of zero-shot inference with unseen scenarios. Furthermore, we show that LDMol can be applied to downstream tasks such as molecule-to-text retrieval and text-driven molecule editing, demonstrating its versatility as a diffusion model.

  • 2 authors
·
May 28, 2024

34 Examples of LLM Applications in Materials Science and Chemistry: Towards Automation, Assistants, Agents, and Accelerated Scientific Discovery

Large Language Models (LLMs) are reshaping many aspects of materials science and chemistry research, enabling advances in molecular property prediction, materials design, scientific automation, knowledge extraction, and more. Recent developments demonstrate that the latest class of models are able to integrate structured and unstructured data, assist in hypothesis generation, and streamline research workflows. To explore the frontier of LLM capabilities across the research lifecycle, we review applications of LLMs through 34 total projects developed during the second annual Large Language Model Hackathon for Applications in Materials Science and Chemistry, a global hybrid event. These projects spanned seven key research areas: (1) molecular and material property prediction, (2) molecular and material design, (3) automation and novel interfaces, (4) scientific communication and education, (5) research data management and automation, (6) hypothesis generation and evaluation, and (7) knowledge extraction and reasoning from the scientific literature. Collectively, these applications illustrate how LLMs serve as versatile predictive models, platforms for rapid prototyping of domain-specific tools, and much more. In particular, improvements in both open source and proprietary LLM performance through the addition of reasoning, additional training data, and new techniques have expanded effectiveness, particularly in low-data environments and interdisciplinary research. As LLMs continue to improve, their integration into scientific workflows presents both new opportunities and new challenges, requiring ongoing exploration, continued refinement, and further research to address reliability, interpretability, and reproducibility.

  • 35 authors
·
May 5, 2025

Towards an accelerated decarbonization of chemical industry by electrolysis

The transition towards carbon-neutral chemical production is challenging due to the fundamental reliance of the chemical sector on petrochemical feedstocks. Electrolysis-based manufacturing, powered by renewables, is a rapidly evolving technology that might be capable of drastically reducing CO2 emissions from the chemical sector. However, will it be possible to scale up electrolysis systems to the extent necessary to entirely decarbonize all chemical plants? Applying a forward-looking scenario, this perspective estimates how much energy will be needed to power full-scale electrolysis based chemical manufacturing by 2050. A significant gap is identified between the currently planned renewable energy expansion and the energy input necessary to electrify the chemical production: at minimum, the energy required for production of hydrogen and electrolysis of CO2 corresponds to > 50% of all renewable energy that is planned to be available. To cover this gap, strategies enabling a meaningful reduction of the energy input to electrolysis are being discussed from the perspective of both a single electrolysis system and an integrated electro-plant. Several scale-up oriented research priorities are formulated to underpin timely development and commercial availability of described technologies, as well as to explore synergies and support further growth of the renewable energy sector, essential to realize described paradigm shift in chemical manufacturing.

  • 2 authors
·
Jan 7, 2022

Generative Discovery of Novel Chemical Designs using Diffusion Modeling and Transformer Deep Neural Networks with Application to Deep Eutectic Solvents

We report a series of deep learning models to solve complex forward and inverse design problems in molecular modeling and design. Using both diffusion models inspired by nonequilibrium thermodynamics and attention-based transformer architectures, we demonstrate a flexible framework to capture complex chemical structures. First trained on the QM9 dataset and a series of quantum mechanical properties (e.g. homo, lumo, free energy, heat capacity, etc.), we then generalize the model to study and design key properties of deep eutectic solvents. In addition to separate forward and inverse models, we also report an integrated fully prompt-based multi-task generative pretrained transformer model that solves multiple forward, inverse design, and prediction tasks, flexibly and within one model. We show that the multi-task generative model has the overall best performance and allows for flexible integration of multiple objectives, within one model, and for distinct chemistries, suggesting that synergies emerge during training of this large language model. Trained jointly in tasks related to the QM9 dataset and deep eutectic solvents (DESs), the model can predict various quantum mechanical properties and critical properties to achieve deep eutectic solvent behavior. Several novel combinations of DESs are proposed based on this framework.

  • 3 authors
·
Apr 24, 2023

oMeBench: Towards Robust Benchmarking of LLMs in Organic Mechanism Elucidation and Reasoning

Organic reaction mechanisms are the stepwise elementary reactions by which reactants form intermediates and products, and are fundamental to understanding chemical reactivity and designing new molecules and reactions. Although large language models (LLMs) have shown promise in understanding chemical tasks such as synthesis design, it is unclear to what extent this reflects genuine chemical reasoning capabilities, i.e., the ability to generate valid intermediates, maintain chemical consistency, and follow logically coherent multi-step pathways. We address this by introducing oMeBench, the first large-scale, expert-curated benchmark for organic mechanism reasoning in organic chemistry. It comprises over 10,000 annotated mechanistic steps with intermediates, type labels, and difficulty ratings. Furthermore, to evaluate LLM capability more precisely and enable fine-grained scoring, we propose oMeS, a dynamic evaluation framework that combines step-level logic and chemical similarity. We analyze the performance of state-of-the-art LLMs, and our results show that although current models display promising chemical intuition, they struggle with correct and consistent multi-step reasoning. Notably, we find that using prompting strategy and fine-tuning a specialist model on our proposed dataset increases performance by 50% over the leading closed-source model. We hope that oMeBench will serve as a rigorous foundation for advancing AI systems toward genuine chemical reasoning.

  • 5 authors
·
Oct 8, 2025 5

Towards Foundational Models for Dynamical System Reconstruction: Hierarchical Meta-Learning via Mixture of Experts

As foundational models reshape scientific discovery, a bottleneck persists in dynamical system reconstruction (DSR): the ability to learn across system hierarchies. Many meta-learning approaches have been applied successfully to single systems, but falter when confronted with sparse, loosely related datasets requiring multiple hierarchies to be learned. Mixture of Experts (MoE) offers a natural paradigm to address these challenges. Despite their potential, we demonstrate that naive MoEs are inadequate for the nuanced demands of hierarchical DSR, largely due to their gradient descent-based gating update mechanism which leads to slow updates and conflicted routing during training. To overcome this limitation, we introduce MixER: Mixture of Expert Reconstructors, a novel sparse top-1 MoE layer employing a custom gating update algorithm based on K-means and least squares. Extensive experiments validate MixER's capabilities, demonstrating efficient training and scalability to systems of up to ten parametric ordinary differential equations. However, our layer underperforms state-of-the-art meta-learners in high-data regimes, particularly when each expert is constrained to process only a fraction of a dataset composed of highly related data points. Further analysis with synthetic and neuroscientific time series suggests that the quality of the contextual representations generated by MixER is closely linked to the presence of hierarchical structure in the data.

  • 5 authors
·
Feb 7, 2025

Agentic Design of Compositional Descriptors via Autoresearch for Materials Science Applications

Autoresearch offers a flexible paradigm for automating scientific tasks, in which an AI agent proposes, implements, evaluates, and refines candidate solutions against a quantitative objective. Here, we use composition-based materials-property prediction to test whether such agents can perform a task beyond model selection and hyperparameter optimization: the design of input descriptors. We introduce Automat, an autoresearch framework where a coding agent based on a large language model generates composition-only descriptors for chemical compounds and evaluates them using a random forest workflow. The agent is restricted to information derivable from chemical formulas and iteratively proposes, implements, and tests chemically motivated descriptor strategies. We apply Automat, with OpenAI Codex using GPT-5.5 as the coding agent, to the prediction of experimental band gaps in inorganic materials and Curie temperatures in ferromagnetic compounds. In both tasks, Automat improves over fractional-composition, Magpie, and combined fractional-composition/Magpie baselines, while producing descriptor families that are chemically interpretable. These results provide a demonstration that autoresearch agents can generate competitive, task-specific materials descriptors without manual feature engineering during the run. They also reveal current limitations, including descriptor redundancy, sensitivity to greedy feature expansion, and the need for explicit complexity control, descriptor pruning, and more sophisticated search strategies.

  • 2 authors
·
May 13

ChemLLM: A Chemical Large Language Model

Large language models (LLMs) have made impressive progress in chemistry applications, including molecular property prediction, molecular generation, experimental protocol design, etc. However, the community lacks a dialogue-based model specifically designed for chemistry. The challenge arises from the fact that most chemical data and scientific knowledge are primarily stored in structured databases, and the direct use of these structured data compromises the model's ability to maintain coherent dialogue. To tackle this issue, we develop a novel template-based instruction construction method that transforms structured knowledge into plain dialogue, making it suitable for language model training. By leveraging this approach, we develop ChemLLM, the first large language model dedicated to chemistry, capable of performing various tasks across chemical disciplines with smooth dialogue interaction. ChemLLM beats GPT-3.5 on all three principal tasks in chemistry, i.e., name conversion, molecular caption, and reaction prediction, and surpasses GPT-4 on two of them. Remarkably, ChemLLM also shows exceptional adaptability to related mathematical and physical tasks despite being trained mainly on chemical-centric corpora. Furthermore, ChemLLM demonstrates proficiency in specialized NLP tasks within chemistry, such as literature translation and cheminformatic programming. ChemLLM opens up a new avenue for exploration within chemical studies, while our method of integrating structured chemical knowledge into dialogue systems sets a new frontier for developing LLMs across various scientific fields. Codes, Datasets, and Model weights are publicly accessible at hf.co/AI4Chem/ChemLLM-7B-Chat.

  • 15 authors
·
Feb 9, 2024 7

MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources

We present MixtureVitae, an open-access pretraining corpus built to minimize legal risk while providing strong model performance. MixtureVitae follows a risk-mitigated sourcing strategy that combines public-domain and permissively licensed text (e.g., CC-BY/Apache) with carefully justified low-risk additions (e.g., government works and EU TDM-eligible sources), alongside targeted instruction, reasoning and synthetic data with documented provenance. We detail a transparent, multi-stage pipeline for license-aware filtering, safety and quality screening, and domain-aware mixing, and we release the dataset and curation recipes to support reproducible research. In controlled experiments using the open-sci-ref training protocol (fixed architectures at 130M/400M/1.3B/1.7B parameters; training budgets of 50B and 300B tokens), models trained on MixtureVitae consistently outperform other permissive datasets across a suite of standard benchmarks, and at the 1.7B/300B setting they surpass FineWeb-Edu and approach DCLM in the later stages of training. Performance is particularly strong on math/code and competitive on QA tasks. These results demonstrate that permissive-first, risk-mitigated data provides a practical and legally mitigated foundation for training capable LLMs, reducing reliance on indiscriminate web scraping without sacrificing competitiveness. Code: https://github.com/ontocord/mixturevitae

ontocord Ontocord.AI
·
Sep 29, 2025 3

Sustainable Aviation Fuels: Opportunities, Alternatives and Challenges for Decarbonizing the Aviation Industry and Foster the Renewable Chemicals

Sustainable Aviation Fuels (SAF) are pivotal in the global effort to decarbonize the aviation sector and meet greenhouse gas (GHG) reduction targets established by international frameworks such as CORSIA and Brazil ProBioQAV. This study evaluates SAF potential to reduce lifecycle carbon emissions by up to 80% while being compatible with existing aviation infrastructure. Through bibliometric analysis, scenario evaluation, legal and regulatory framework analysis and economic modeling, the research examines two key SAF production technologies: Hydroprocessed Esters and Fatty Acids Synthetic Paraffinic Kerosene (HEFA-SPK) and Alcohol-to-Jet (ATJ) pathways in the Brazilian context. The findings reveal significant economic challenges, particularly high feedstock and production costs, which hinder SAF competitiveness with fossil fuels at recent and current market prices in Brazil, leading to the analysis of potential incentives and commercial conditions aiming to increase economic attractiveness of SAF production. Based on interviews with relevant stakeholders and decision makers in the industry, scenarios incorporating tax incentives, carbon credits, capital grants, and premium pricing for SAF and its biogenic by-products demonstrate that combined policy interventions and commercial arrangements, along with a regulated Carbon Market are essential for SAF economic viability. Future research is suggested to look at regional assessments of feedstock availability, supply chain logistics, and global market eligibility. This research provides insights for guiding public policy and private investment to support the transition to sustainable aviation in Brazil and beyond.

  • 5 authors
·
Apr 4, 2025

One-shot recognition of any material anywhere using contrastive learning with physics-based rendering

Visual recognition of materials and their states is essential for understanding most aspects of the world, from determining whether food is cooked, metal is rusted, or a chemical reaction has occurred. However, current image recognition methods are limited to specific classes and properties and can't handle the vast number of material states in the world. To address this, we present MatSim: the first dataset and benchmark for computer vision-based recognition of similarities and transitions between materials and textures, focusing on identifying any material under any conditions using one or a few examples. The dataset contains synthetic and natural images. The synthetic images were rendered using giant collections of textures, objects, and environments generated by computer graphics artists. We use mixtures and gradual transitions between materials to allow the system to learn cases with smooth transitions between states (like gradually cooked food). We also render images with materials inside transparent containers to support beverage and chemistry lab use cases. We use this dataset to train a siamese net that identifies the same material in different objects, mixtures, and environments. The descriptor generated by this net can be used to identify the states of materials and their subclasses using a single image. We also present the first few-shot material recognition benchmark with images from a wide range of fields, including the state of foods and drinks, types of grounds, and many other use cases. We show that a net trained on the MatSim synthetic dataset outperforms state-of-the-art models like Clip on the benchmark and also achieves good results on other unsupervised material classification tasks.

  • 5 authors
·
Dec 1, 2022

Unifying Molecular and Textual Representations via Multi-task Language Modelling

The recent advances in neural language models have also been successfully applied to the field of chemistry, offering generative solutions for classical problems in molecular design and synthesis planning. These new methods have the potential to optimize laboratory operations and fuel a new era of data-driven automation in scientific discovery. However, specialized models are still typically required for each task, leading to the need for problem-specific fine-tuning and neglecting task interrelations. The main obstacle in this field is the lack of a unified representation between natural language and chemical representations, complicating and limiting human-machine interaction. Here, we propose a multi-domain, multi-task language model to solve a wide range of tasks in both the chemical and natural language domains. By leveraging multi-task learning, our model can handle chemical and natural language concurrently, without requiring expensive pre-training on single domains or task-specific models. Interestingly, sharing weights across domains remarkably improves our model when benchmarked against state-of-the-art baselines on single-domain and cross-domain tasks. In particular, sharing information across domains and tasks gives rise to large improvements in cross-domain tasks, the magnitude of which increase with scale, as measured by more than a dozen of relevant metrics. Our work suggests that such models can robustly and efficiently accelerate discovery in physical sciences by superseding problem-specific fine-tuning and enhancing human-model interactions.

  • 6 authors
·
Jan 29, 2023

A Simple Iterative Approach for Constant Chemical Potential Simulations at Interfaces

Chemical potential of species in solution is essential for understanding various chemical processes at interfaces. Molecular dynamics (MD) simulations, constrained by fixed compositions, cannot satisfy a constant chemical potential condition as solute species can migrate to the interface and deplete the bulk due to solute-interface interactions. In this study, we introduce a simple and computationally efficient approach named iterative constant chemical potential molecular dynamics (iCuMD) simulation, which helps simulate targeted molar concentrations of species in solution. iCuMD overcomes the limitations of conventional MD by adjusting the number of species in the solution to reach a target concentration (chemical potential). We demonstrate our approach using solid-liquid and liquid-air interfacial systems as case studies. Specifically, we perform classical force field-based MD simulations of NaCl(aq)-air and NaCl(aq)-graphite interfaces and machine learning interatomic potential (MLIP)-based MD simulations of the Na2SO4(aq)-graphene interface. Our results show that the iCuMD approach efficiently achieves the desired bulk ion concentration within two iterations and can also be integrated with MLIP-driven simulations which enable constant potential simulations with DFT-level accuracy. We show that iCuMD offers a robust and simple computational framework for constant chemical potential simulations as its only requirement is to be able to converge interfacial simulations with a measurable bulk region.

  • 3 authors
·
Jun 1, 2025

ChemAgent: Self-updating Library in Large Language Models Improves Chemical Reasoning

Chemical reasoning usually involves complex, multi-step processes that demand precise calculations, where even minor errors can lead to cascading failures. Furthermore, large language models (LLMs) encounter difficulties handling domain-specific formulas, executing reasoning steps accurately, and integrating code effectively when tackling chemical reasoning tasks. To address these challenges, we present ChemAgent, a novel framework designed to improve the performance of LLMs through a dynamic, self-updating library. This library is developed by decomposing chemical tasks into sub-tasks and compiling these sub-tasks into a structured collection that can be referenced for future queries. Then, when presented with a new problem, ChemAgent retrieves and refines pertinent information from the library, which we call memory, facilitating effective task decomposition and the generation of solutions. Our method designs three types of memory and a library-enhanced reasoning component, enabling LLMs to improve over time through experience. Experimental results on four chemical reasoning datasets from SciBench demonstrate that ChemAgent achieves performance gains of up to 46% (GPT-4), significantly outperforming existing methods. Our findings suggest substantial potential for future applications, including tasks such as drug discovery and materials science. Our code can be found at https://github.com/gersteinlab/chemagent

  • 12 authors
·
Jan 11, 2025 2

AIMS-EREA -- A framework for AI-accelerated Innovation of Materials for Sustainability -- for Environmental Remediation and Energy Applications

Many environmental remediation and energy applications (conversion and storage) for sustainability need design and development of green novel materials. Discovery processes of such novel materials are time taking and cumbersome due to large number of possible combinations and permutations of materials structures. Often theoretical studies based on Density Functional Theory (DFT) and other theories, coupled with Simulations are conducted to narrow down sample space of candidate materials, before conducting laboratory-based synthesis and analytical process. With the emergence of artificial intelligence (AI), AI techniques are being tried in this process too to ease out simulation time and cost. However tremendous values of previously published research from various parts of the world are still left as labor-intensive manual effort and discretion of individual researcher and prone to human omissions. AIMS-EREA is our novel framework to blend best of breed of Material Science theory with power of Generative AI to give best impact and smooth and quickest discovery of material for sustainability. This also helps to eliminate the possibility of production of hazardous residues and bye-products of the reactions. AIMS-EREA uses all available resources -- Predictive and Analytical AI on large collection of chemical databases along with automated intelligent assimilation of deep materials knowledge from previously published research works through Generative AI. We demonstrate use of our own novel framework with an example, how this framework can be successfully applied to achieve desired success in development of thermoelectric material for waste heat conversion.

  • 3 authors
·
Nov 18, 2023

Dara: Automated multiple-hypothesis phase identification and refinement from powder X-ray diffraction

Powder X-ray diffraction (XRD) is a foundational technique for characterizing crystalline materials. However, the reliable interpretation of XRD patterns, particularly in multiphase systems, remains a manual and expertise-demanding task. As a characterization method that only provides structural information, multiple reference phases can often be fit to a single pattern, leading to potential misinterpretation when alternative solutions are overlooked. To ease humans' efforts and address the challenge, we introduce Dara (Data-driven Automated Rietveld Analysis), a framework designed to automate the robust identification and refinement of multiple phases from powder XRD data. Dara performs an exhaustive tree search over all plausible phase combinations within a given chemical space and validates each hypothesis using a robust Rietveld refinement routine (BGMN). Key features include structural database filtering, automatic clustering of isostructural phases during tree expansion, peak-matching-based scoring to identify promising phases for refinement. When ambiguity exists, Dara generates multiple hypothesis which can then be decided between by human experts or with further characteriztion tools. By enhancing the reliability and accuracy of phase identification, Dara enables scalable analysis of realistic complex XRD patterns and provides a foundation for integration into multimodal characterization workflows, moving toward fully self-driving materials discovery.

  • 5 authors
·
Dec 3, 2025