lora-library (lora concepts library)

posted an update 1 day ago

Post

613

Introducing the twitter-trending-hashtags dataset, a compilation of 12,000+ unique trending hashtags on Twitter / X from 2020 to 2025. This dataset captures viral and cultural moments on Twitter / X and is perfect for researchers studying viral content patterns on social media.

ronantakizawa/twitter-trending-hashtags

#twitter #trends #socialmedia

hesamation

posted an update 3 days ago

Post

2559

this is big... 50 AI researchers from Bytedance, Alibaba, Tencent, and other labs/universities just published a 300-page paper with surprising lessons about coding models and agents (data, pre and post-training, etc).

key highlights:

> small LLMs can beat proprietary giants
RL (RLVR specifically) gives small open-source models an edge over big models in reasoning. a 14B model trained with RLVR on high-quality verified problems can match the performance of OpenAI's o3.

> models have a hard time learning Python.
mixing language models during pre-training is good, but Python behaves different from statically typed languages. languages with similar syntax (Java and C#, or JavaScript and TypeScript) creates high positive synergy. mixing Python heavily into the training of statically typed languages can actually hurt because of Python's dynamic typing.

> not all languages are equal (coding scaling laws)
the amount of data required to specialize a model on a language drastically depends on the language. paper argues like C# and Java are easier to learn (less training data required). languages like Python and Javascript are actually more tricky to learn, ironically (you see AI most used for these languages :)

> MoE vs Dense (ability vs stability)
MoE models offer higher capacity, but are much more fragile during SFT than dense models. hyperparams in training have a more drastic effect in MoE models, while dense models are more stable. MoE models also require constant learning rate schedules to avoid routing instability.

> code models are "insecure" by default (duh)
training on public repos makes models learn years of accumulated insecure coding patterns. safety fine-tuning often fails to work much on code. a model might refuse to write a hate speech email but will happily generate a SQL-injection vulnerable function because it "works."

read the full paper:
From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence (2511.18538)

ronantakizawa

posted an update 4 days ago

Post

1614

Introducing the tiktok-trending-hashtags dataset: a compilation of 1,830 unique trending hashtags on TikTok from 2022 to 2025. This dataset captures viral one-time and seasonal viral moments on TikTok and is perfect for researchers, marketers, and content creators studying viral content patterns on social media.

ronantakizawa/tiktok-trending-hashtags
#tiktok #trends #social-media

ronantakizawa

posted an update 7 days ago

Post

304

Reached 2500+ total downloads across my models and datasets! 🎉

Follow me for more @ronantakizawa

ronantakizawa

posted an update 9 days ago

Post

315

Introducing the india-trending-words dataset: a compilation of 900 trending Google searches from 2006-2024 based on https://trends.withgoogle.com. This dataset captures search trends in 80 categories, and is perfect for analyzing cultural shifts and predicting future trends in India.

#india #indiadataset #googlesearches

ronantakizawa/india-trending-words

ronantakizawa

posted an update 11 days ago

Post

2452

Introducing the japanese-trending-words dataset: a dataset consisting 593 words from Japan’s annual trending word rankings (流行語大賞) from 2006-2025. This dataset provides the top 30 words from each year and its meaning in Japanese and english. This resource is awesome for NLP tasks understanding recent Japanese culture and history.

ronantakizawa/japanese-trending-words

#japanese #japanesedataset #trending

longchen

authored 4 papers 14 days ago

ReSim: Reliable World Simulation for Autonomous Driving

Paper • 2506.09981 • Published Jun 11

SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment

Paper • 2503.09594 • Published Mar 12 • 1

ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving

Paper • 2506.08052 • Published Jun 9 • 1

MiMo-Embodied: X-Embodied Foundation Model Technical Report

Paper • 2511.16518 • Published 15 days ago • 23

ronantakizawa

posted an update 16 days ago

Post

989

Introducing the google-trending-words dataset: a compilation of 2784 trending Google searches from 2001-2024 based on https://trends.withgoogle.com. This dataset captures search trends in 93 categories, and is perfect for analyzing cultural shifts, predicting future trends, and understanding how global events shape online behavior.

#trends #google #googlesearches

ronantakizawa/trending-words-google

ronantakizawa

posted an update 18 days ago

Post

1622

Introducing the Japanese Character Difficulty Dataset: a collection of 3,003 Japanese characters (Kanji) labeled with official educational difficulty grades. It includes elementary (grades 1–6), secondary (grade 8), and advanced (grade 9) characters, making it useful for language learning, text difficulty analysis, and educational tool development 🎉

ronantakizawa/japanese-character-difficulty

#japanese #kanji #japanesedataset

ronantakizawa

posted an update 21 days ago

Post

2285

I built a demo on how to implement Cache-Augmented Generation (CAG) in an LLM and compare its performance gains to RAG (111 stars, 20 forks).

https://github.com/ronantakizawa/cacheaugmentedgeneration

CAG preloads document content into an LLM’s context as a precomputed key-value (KV) cache. This caching eliminates the need for real-time retrieval during inference, reducing token usage by up to 76% while maintaining answer quality.

CAG is particularly effective for constrained knowledge bases like internal documentation, FAQs, and customer support systems, where all relevant information can fit within the model's extended context window.

#rag #retrievalaugmentedgeneration

ronantakizawa

posted an update 22 days ago

Post

3286

Reached 1000+ total downloads across my models and datasets! 🎉

Follow me for more @ronantakizawa

2 replies

·

ronantakizawa

posted an update 23 days ago

Post

2970

Introducing the Japanese honorifics dataset: a dataset with 137 sentences covering the three main keigo forms: 尊敬語 (Sonkeigo), 謙譲語 (Kenjōgo), and 丁寧語 (Teineigo). Each entry includes the base form, all three honorific transformations, and English translations for essential phrases in Japanese. This dataset is perfect for training and evaluating the Japanese skill level of LLMs.

#japanese #japanesedataset

ronantakizawa/japanese-honorifics

ronantakizawa

posted an update 30 days ago

Post

1108

Introducing JFLEG-JA, a new Japanese language error correction benchmark with 1,335 sentences, each paired with 4 high-quality human corrections 🎉

Inspired by the English JFLEG dataset, this dataset covers diverse error types, including particle mistakes, kanji mix-ups, incorrect contextual verb, adjective, and literary technique usage.

You can use this for evaluating LLMs, few-shot learning, error analysis, or fine-tuning correction systems.

ronantakizawa/jfleg-japanese

#japanese #evals #benchmark

ronantakizawa

posted an update about 1 month ago

Post

1725

Introducing the Medical-o1-Reasoning-SFT-Japanese dataset 🎉

This dataset is a Japanese dataset consisting questions, reasoning, and answer results for complex medical topics.

#japanese #medical #dataset

ronantakizawa/Medical-o1-Reasoning-SFT-Japanese

ronantakizawa

posted an update about 1 month ago

Post

1480

Introducing the Finance-Instruct-500k-Japanese dataset 🎉

This is a Japanese-translated version of the @Josephgflowers Finance-Instruct-500k dataset, which includes complex questions and answers related to finance and Economics.

#datasets #finance #finance-instruct #japanese

ronantakizawa/Finance-Instruct-500k-Japanese

ronantakizawa

posted an update about 1 month ago

Post

1556

Excited to announce 4 AWQ quantized models from #AllenAI! 🎉

Molmo-7B-D AWQ (14GB→5GB): Efficient VLM performing between GPT-4V and GPT-4o on academic benchmarks, with just 6.1% perplexity degradation.

MolmoAct-7B-D AWQ (14GB→6GB): Specialized robotic manipulation model reduced by ~57%.

Molmo-72B AWQ (145GB→38GB): VLM with Qwen2-72B decoder that performs competitively with GPT-4, achieving only 10.5% perplexity degradation while saving 107GB of memory.

OLMo-2-32B-Instruct AWQ (64GB→17GB): LLM post-trained on Tülu 3 with 3% perplexity degradation while saving ~50GB.

All VLMs only had their text models quantized.

ronantakizawa/molmo-7b-d-awq
ronantakizawa/molmoact-7b-d-awq
ronantakizawa/molmo-72b-awq
ronantakizawa/olmo2-32b-instruct-awq

ronantakizawa

posted an update about 2 months ago

Post

3816

Introducing AWQ and GPTQ quantized versions of SmolVLM from Hugging Face!

These models only had their text models quantized, and had a 50% model size reduction (4GB~2GB) while keeping model degradation under 1% on the DocVQA benchmark.

#huggingface #smolvlm #smollm

ronantakizawa/SmolVLM-Instruct-awq

ronantakizawa/SmolVLM-Instruct-gptq

lora concepts library

AI & ML interests

Recent Activity

ReSim: Reliable World Simulation for Autonomous Driving

SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment

ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving

MiMo-Embodied: X-Embodied Foundation Model Technical Report

AI & ML interests

Recent Activity

Team members 613

lora-library's activity