Timestep-Aware Semantic Alignment for Gemma Encoder
- uses bidirectional attention mask for padded tokens
- text-image pairs specifically targeting low/high-frequency details based on the diffusion timestep
- explicit timestep conditioning integrated into the encoder
- as in previous releases, the dataset was cleaned of overused phrases and stop words
prompt_embeds = encoder.encode(text, t=torch.ones((1,)))
callback_on_step_end = lambda p, i, t, kwargs: {
'prompt_embeds': encoder.model.forward(*encoder, t=t / 1000.0)
}
References
- 2403.05135
Datasets
- animelover/touhou-images
- animesfw
- alfredplpl/artbench-pd-256x256
- anime-art-multicaptions (multicharacter interactions)
- colormix (synthetic color, fashion dataset)
- danbooru2023-florence2-caption (verb, action clauses)
- spatial-caption
- spright-coco
- picollect
- pixiv rank
- trojblue/danbooru2025-metadata
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
