Weekend mini project! Since commentary on AI is inherently interdisciplinary, we connected the observations in the Pope's encyclical with decades of scholarship in Responsible AI and Ethics research and created an interactive space with these annotations!
Work with @IJ-Reynolds , @yjernite, and @meg Lots to unpack. We started with 105 annotations. Please submit pull requests for more that we may have missed!
AI for Scientific Discovery Won't Work Without Fixing How We Collaborate.
My co-author @cgeorgiaw and I just published a paper challenging a core assumption: that the main barriers to AI in science are technical. They're not. They're social.
Key findings:
🚨 The "AI Scientist" myth delays progress: Waiting for AGI devalues human expertise and obscures science's real purpose: cultivating understanding, not just outputs. 📊 Wrong incentives: Datasets have 100x longer impact than models, yet data curation is undervalued. ⚠️ Broken collaboration: Domain scientists want understanding. ML researchers optimize performance. Without shared language, projects fail. 🔍 Fragmentation costs years: Harmonizing just 9 cancer files took 329 hours.
Why this matters: Upstream bottlenecks like efficient PDE solvers could accelerate discovery across multiple sciences. CASP mobilized a community around protein structure, enabling AlphaFold. We need this for dozens of challenges.
Thus, we're launching Hugging Science! A global community addressing these barriers through collaborative challenges, open toolkits, education, and community-owned infrastructure. Please find all the links below!
Supercharge Apple’s Shortcuts using Cloudflare Workers and Gemini within minutes (and for free, up to 1,500 requests per day) ☁️✨
Hello everyone, last week, while experimenting for fun, I created an API that allows you to easily access AI models (in this case, Google's) from the Shortcut app in order to analyze data from my apps and make the most of it thanks to the generative capabilities of advanced models.
It costs me nothing, and I think it might be good to share it so that others can build on it.
In README.md, you will find everything you need to get started and put your own microservice into production, which you can call from the app’s HTTP request features.
You will simply be asked to have a free Cloudflare account and an API key obtained from Google's AI Studio.
Feel free to take a look and get back to me if you encounter any problems during deployment.
Although more and more code editors are aligning themselves with the AGENTS.md file standard, some still use specific nomenclatures that can make it difficult to maintain different configuration files when several people are working on the same project with different agents.
Bodyboard addresses this by generating canonical instructions for code helpers from a single AGENTS.md file, thereby streamlining the production of adapter outputs for Gemini CLI, Copilot, Cline, Claude, Rules, Windsurf, and OpenAI Codex integrations.
We've moved over 20PB from Git LFS to Xet on the Hub without downtime or data loss. Having things "just work" on a migration of this scale is about as good as it gets.
In the early days of joining Hugging Face, we made a few key design decisions: * There would be no "hard cut-over" from Git LFS to Xet * A Xet-enabled repository should be able to contain both Xet and LFS files * Repository migrations from LFS to Xet can run in the background without disrupting downloads or uploads
These were largely driven by our desire to ensure the community could keep working without interruption.
We cover the infrastructure making this all go in this post, specifically: * An integral piece of infrastructure known internally as the Git LFS Bridge * Background content migrations that run around the clock
New blog post alert! "What is the Hugging Face Community Building?", with @yjernite and @irenesolaiman What 1.8 Million Models Reveal About Open Source Innovation: Our latest deep dive into the Hugging Face Hub reveals patterns that challenge conventional AI narratives:
🔗 Models become platforms for innovation Qwen, Llama, and Gemma models have spawned entire ecosystems of specialized variants. Looking at derivative works shows community adoption better than any single metric.
📊 Datasets reveal the foundation layer → Most downloaded datasets are evaluation benchmarks (MMLU, Squad, GLUE) → Universities and research institutions dominate foundational data → Domain-specific datasets thrive across finance, healthcare, robotics, and science → Open actors provide the datasets that power most AI development
🏛️ Research institutions lead the charge: AI2 (Allen Institute) emerges as one of the most active contributors, alongside significant activity from IBM, NVIDIA, and international organizations. The open source ecosystem spans far beyond Big Tech.
🔍 Interactive exploration tools: We've built several tools to help you discover patterns!
ModelVerse Explorer - organizational contributions DataVerse Explorer - dataset patterns Organization HeatMap - activity over time Base Model Explorer - model family trees Semantic Search - find models by capability
📚 Academic research is thriving: Researchers are already producing valuable insights, including recent work at FAccT 2025: "The Brief and Wondrous Life of Open Models." We've also made hub datasets, weekly snapshots, and other data available for your own analysis.
The bottom line: AI development is far more distributed, diverse, and collaborative than popular narratives suggest. Real innovation happens through community collaboration across specialized domains.
Because hackathons are often the starting point for many AI projects, I've created a Python-backend template incorporating my feedback to streamline collaboration and urgent deployments 🏎️
Within a year, I had the opportunity to participate in hackathons organized by Mistral, OpenAI, and DeepMind and this GitHub template is structured around several fundamental building blocks and recommendations I offer developers eager to participate in their first hackathon, whether as part of a team or individually. Its emphasis is on rapid setup and deployment through: - uv as a package manager, simplifying usage via a series of pre-configured make commands. - FastAPI for API management, structured in a modular architecture designed to minimize branch conflicts during merges to main branches (using minimal health-check and ping routes to verify Docker’s proper execution and backend accessibility on the local network). - Pydantic for validation and type handling, which simplifies debugging and enhances understanding of data objects. - A set of custom instructions tailored for agents (Cline and GitHub Copilot), aimed at improving overall comprehension of the application and optimizing the vibe-coding experience.
This template includes unit tests with a 100% success rate and test coverage, as well as a minimal CI file ensuring that the FastAPI application runs correctly. Thus, merging code that breaks the server into production becomes impossible ⛔️
In general, I would reiterate an essential piece of advice: your two main adversaries are branch conflicts—particularly when the same file is modified concurrently within a brief period, especially if your architecture isn’t built for scalability—and deployment issues under urgent circumstances ⏱️
It's been a bit since I took a step back and looked at
xet-team progress to migrate Hugging Face from Git LFS to Xet, but every time I do it boggles the mind.
A month ago there were 5,500 users/orgs on Xet with 150K repos and 4PB. Today? 🤗 700,000 users/orgs 📈 350,000 repos 🚀 15PB
Meanwhile, our migrations have pushed throughput to numbers that are bonkers. In June, we hit upload speeds of 577Gb/s (crossing 500Gb/s for the first time).
These are hard numbers to put into context, but let's try:
We now have ~32 crawls stored in Xet. At peak upload speed we could move the latest crawl into Xet in about two hours.
We're moving to a new phase in the process, so stay tuned.
This shift in gears means it's also time to roll up our sleeves and look at all the bytes we have and the value we're adding to the community.
I already have some homework from @RichardErkhov to look at the dedupe across their uploads, and I'll be doing the same for other early adopters, big models/datasets, and frequent uploaders (looking at you @bartowski 👀)
Let me know if there's anything you're interested in; happy to dig in!
🌐 Clinical Trials Dataset now available on Hugging Face! 🧬
I’ve just released a comprehensive, ML-ready dataset featuring 500,000+ clinical trial records sourced directly from ClinicalTrials.gov for biomedical NLP, healthcare analytics, and clinical research applications 🤗
I wanted to produce the most complete and up-to-date dump with all raw data partially flattened to simplify extraction, self-querying and processing.
Do you have any ideas about what we can do with it? Using descriptions to enhance specialized embedding models?
And for everyone with existing repositories, just sign up here https://huggingface.co/join/xet - we'll migrate all existing repositories to Xet and all new repos you create will be Xet-backed by default.