Aduc-sdr-cinematic-video

Runtime error

App Files Files Community

Carlexxx commited on Aug 27

Commit

3470339

1 Parent(s): 7bdd354

aduc-sdr

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

NOTICE.md +76 -0
README.md +204 -7
configs/ltxv-13b-0.9.7-dev.yaml +34 -0
configs/ltxv-13b-0.9.7-distilled.yaml +28 -0
configs/ltxv-13b-0.9.8-dev-fp8.yaml +34 -0
configs/ltxv-13b-0.9.8-dev.yaml +34 -0
configs/ltxv-13b-0.9.8-distilled-fp8.yaml +29 -0
configs/ltxv-13b-0.9.8-distilled.yaml +29 -0
configs/ltxv-2b-0.9.1.yaml +17 -0
configs/ltxv-2b-0.9.5.yaml +17 -0
configs/ltxv-2b-0.9.6-dev.yaml +17 -0
configs/ltxv-2b-0.9.6-distilled.yaml +16 -0
configs/ltxv-2b-0.9.8-distilled-fp8.yaml +28 -0
configs/ltxv-2b-0.9.8-distilled.yaml +28 -0
configs/ltxv-2b-0.9.yaml +17 -0
deformes4D_engine.py +292 -0
dreamo/LICENSE.txt +201 -0
dreamo/README.md +135 -0
dreamo/dreamo_pipeline.py +507 -0
dreamo/transformer.py +187 -0
dreamo/utils.py +232 -0
flux_kontext_helpers.py +151 -0
gemini_helpers.py +257 -0
hardware_manager.py +35 -0
i18n.json +128 -0
image_specialist.py +98 -0
inference.py +774 -0
ltx_manager_helpers.py +198 -0
ltx_video/LICENSE.txt +201 -0
ltx_video/README.md +135 -0
ltx_video/__init__.py +0 -0
ltx_video/models/__init__.py +0 -0
ltx_video/models/autoencoders/__init__.py +0 -0
ltx_video/models/autoencoders/causal_conv3d.py +63 -0
ltx_video/models/autoencoders/causal_video_autoencoder.py +1398 -0
ltx_video/models/autoencoders/conv_nd_factory.py +90 -0
ltx_video/models/autoencoders/dual_conv3d.py +217 -0
ltx_video/models/autoencoders/latent_upsampler.py +203 -0
ltx_video/models/autoencoders/pixel_norm.py +12 -0
ltx_video/models/autoencoders/pixel_shuffle.py +33 -0
ltx_video/models/autoencoders/vae.py +380 -0
ltx_video/models/autoencoders/vae_encode.py +247 -0
ltx_video/models/autoencoders/video_autoencoder.py +1045 -0
ltx_video/models/transformers/__init__.py +0 -0
ltx_video/models/transformers/attention.py +1264 -0
ltx_video/models/transformers/embeddings.py +129 -0
ltx_video/models/transformers/symmetric_patchifier.py +84 -0
ltx_video/models/transformers/transformer3d.py +507 -0
ltx_video/pipelines/__init__.py +0 -0
ltx_video/pipelines/crf_compressor.py +50 -0

NOTICE.md ADDED Viewed

	@@ -0,0 +1,76 @@

+# NOTICE
+Copyright (C) 2025 Carlos Rodrigues dos Santos. All rights reserved.
+---
+## Aviso de Propriedade Intelectual e Licenciamento
+### **Processo de Patenteamento em Andamento (EM PORTUGUÊS):**
+O método e o sistema de orquestração de prompts denominados **ADUC (Automated Discovery and Orchestration of Complex tasks)**, conforme descritos neste documento e implementados neste software, estão atualmente em processo de patenteamento.
+O titular dos direitos, Carlos Rodrigues dos Santos, está buscando proteção legal para as inovações chave da arquitetura ADUC, incluindo, mas não se limitando a:
+*   Fragmentação e escalonamento de solicitações que excedem limites de contexto de modelos de IA.
+*   Distribuição inteligente de sub-tarefas para especialistas heterogêneos.
+*   Gerenciamento de estado persistido com avaliação iterativa e realimentação para o planejamento de próximas etapas.
+*   Planejamento e roteamento sensível a custo, latência e requisitos de qualidade.
+*   O uso de "tokens universais" para comunicação agnóstica a modelos.
+### **Reconhecimento e Implicações (EM PORTUGUÊS):**
+Ao acessar ou utilizar este software e a arquitetura ADUC aqui implementada, você reconhece:
+1.  A natureza inovadora e a importância da arquitetura ADUC no campo da orquestração de prompts para IA.
+2.  Que a essência desta arquitetura, ou suas implementações derivadas, podem estar sujeitas a direitos de propriedade intelectual, incluindo patentes.
+3.  Que o uso comercial, a reprodução da lógica central da ADUC em sistemas independentes, ou a exploração direta da invenção sem o devido licenciamento podem infringir os direitos de patente pendente.
+---
+### **Patent Pending (IN ENGLISH):**
+The method and system for prompt orchestration named **ADUC (Automated Discovery and Orchestration of Complex tasks)**, as described herein and implemented in this software, are currently in the process of being patented.
+The rights holder, Carlos Rodrigues dos Santos, is seeking legal protection for the key innovations of the ADUC architecture, including, but not limited to:
+*   Fragmentation and scaling of requests exceeding AI model context limits.
+*   Intelligent distribution of sub-tasks to heterogeneous specialists.
+*   Persistent state management with iterative evaluation and feedback for planning subsequent steps.
+*   Cost, latency, and quality-aware planning and routing.
+*   The use of "universal tokens" for model-agnostic communication.
+### **Acknowledgement and Implications (IN ENGLISH):**
+By accessing or using this software and the ADUC architecture implemented herein, you acknowledge:
+1.  The innovative nature and significance of the ADUC architecture in the field of AI prompt orchestration.
+2.  That the essence of this architecture, or its derivative implementations, may be subject to intellectual property rights, including patents.
+3.  That commercial use, reproduction of ADUC's core logic in independent systems, or direct exploitation of the invention without proper licensing may infringe upon pending patent rights.
+---
+## Licença AGPLv3
+This program is free software: you can redistribute it and/or modify
+it under the terms of the GNU Affero General Public License as published by
+the Free Software Foundation, either version 3 of the License, or
+(at your option) any later version.
+This program is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU Affero General Public License for more details.
+You should have received a copy of the GNU Affero General Public License
+along with this program.  If not, see <https://www.gnu.org/licenses/>.
+---
+**Contato para Consultas:**
+Para mais informações sobre a arquitetura ADUC, o status do patenteamento, ou para discutir licenciamento para usos comerciais ou não conformes com a AGPLv3, por favor, entre em contato:
+Carlos Rodrigues dos Santos
+[email protected]
+Rua Eduardo Carlos Pereira, 4125, B1 Ap32, Curitiba, PR, Brazil, CEP 8102025

README.md CHANGED Viewed

@@ -1,13 +1,210 @@
 ---
-title: Aduc Sdr VIDEO
-emoji: 📊
-colorFrom: gray
-colorTo: pink
 sdk: gradio
-sdk_version: 5.44.0
 app_file: app.py
-pinned: false
 license: agpl-3.0
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: Euia-AducSdr
+emoji: 🎥
+colorFrom: indigo
+colorTo: purple
 sdk: gradio
 app_file: app.py
+pinned: true
 license: agpl-3.0
+short_description: Uma implementação aberta e funcional da arquitetura ADUC-SDR
 ---
+### 🇧🇷 Português
+Uma implementação aberta e funcional da arquitetura ADUC-SDR (Arquitetura de Unificação Compositiva - Escala Dinâmica e Resiliente), projetada para a geração de vídeo coerente de longa duração. Este projeto materializa os princípios de fragmentação, navegação geométrica e um mecanismo de "eco causal 4bits memoria" para garantir a continuidade física e narrativa em sequências de vídeo geradas por múltiplos modelos de IA.
+**Licença:** Este projeto é licenciado sob os termos da **GNU Affero General Public License v3.0**. Isto significa que se você usar este software (ou qualquer trabalho derivado) para fornecer um serviço através de uma rede, você é **obrigado a disponibilizar o código-fonte completo** da sua versão para os usuários desse serviço.
+- **Copyright (C) 4 de Agosto de 2025, Carlos Rodrigues dos Santos**
+- Uma cópia completa da licença pode ser encontrada no arquivo [LICENSE](LICENSE).
+---
+### 🇬🇧 English
+An open and functional implementation of the ADUC-SDR (Architecture for Compositive Unification - Dynamic and Resilient Scaling) architecture, designed for long-form coherent video generation. This project materializes the principles of fragmentation, geometric navigation, and a "causal echo 4bits memori" mechanism to ensure physical and narrative continuity in video sequences generated by multiple AI models.
+**License:** This project is licensed under the terms of the **GNU Affero General Public License v3.0**. This means that if you use this software (or any derivative work) to provide a service over a network, you are **required to make the complete source code** of your version available to the users of that service.
+- **Copyright (C) August 4, 2025, Carlos Rodrigues dos Santos**
+- A full copy of the license can be found in the [LICENSE](LICENSE) file.
+---
+## **Aviso de Propriedade Intelectual e Patenteamento**
+### **Processo de Patenteamento em Andamento (EM PORTUGUÊS):**
+A arquitetura e o método **ADUC (Automated Discovery and Orchestration of Complex tasks)**, conforme descritos neste projeto e nas reivindicações associadas, estão **atualmente em processo de patenteamento**.
+O titular dos direitos, Carlos Rodrigues dos Santos, está buscando proteção legal para as inovações chave da arquitetura ADUC, que incluem, mas não se limitam a:
+*   Fragmentação e escalonamento de solicitações que excedem limites de contexto de modelos de IA.
+*   Distribuição inteligente de sub-tarefas para especialistas heterogêneos.
+*   Gerenciamento de estado persistido com avaliação iterativa e realimentação para o planejamento de próximas etapas.
+*   Planejamento e roteamento sensível a custo, latência e requisitos de qualidade.
+*   O uso de "tokens universais" para comunicação agnóstica a modelos.
+Ao utilizar este software e a arquitetura ADUC aqui implementada, você reconhece a natureza inovadora desta arquitetura e que a **reprodução ou exploração da lógica central da ADUC em sistemas independentes pode infringir direitos de patente pendente.**
+---
+### **Patent Pending (IN ENGLISH):**
+The **ADUC (Automated Discovery and Orchestration of Complex tasks)** architecture and method, as described in this project and its associated claims, are **currently in the process of being patented.**
+The rights holder, Carlos Rodrigues dos Santos, is seeking legal protection for the key innovations of the ADUC architecture, including, but not limited to:
+*   Fragmentation and scaling of requests exceeding AI model context limits.
+*   Intelligent distribution of sub-tasks to heterogeneous specialists.
+*   Persistent state management with iterative evaluation and feedback for planning subsequent steps.
+*   Cost, latency, and quality-aware planning and routing.
+*   The use of "universal tokens" for model-agnostic communication.
+By using this software and the ADUC architecture implemented herein, you acknowledge the innovative nature of this architecture and that **the reproduction or exploitation of ADUC's core logic in independent systems may infringe upon pending patent rights.**
+---
+### Detalhes Técnicos e Reivindicações da ADUC
+#### 🇧🇷 Definição Curta (para Tese e Patente)
+**ADUC** é um *framework pré-input* e *intermediário* de **gerenciamento de prompts** que:
+1.  **fragmenta** solicitações acima do limite de contexto de qualquer modelo,
+2.  **escala linearmente** (processo sequencial com memória persistida),
+3.  **distribui** sub-tarefas a **especialistas** (modelos/ferramentas heterogêneos), e
+4.  **realimenta** a próxima etapa com avaliação do que foi feito/esperado (LLM diretor).
+Não é um modelo; é uma **camada orquestradora** plugável antes do input de modelos existentes (texto, imagem, áudio, vídeo), usando *tokens universais* e a tecnologia atual.
+#### 🇬🇧 Short Definition (for Thesis and Patent)
+**ADUC** is a *pre-input* and *intermediate* **prompt management framework** that:
+1.  **fragments** requests exceeding any model's context limit,
+2.  **scales linearly** (sequential process with persisted memory),
+3.  **distributes** sub-tasks to **specialists** (heterogeneous models/tools), and
+4.  **feeds back** to the next step with an evaluation of what was done/expected (director LLM).
+It is not a model; it is a pluggable **orchestration layer** before the input of existing models (text, image, audio, video), using *universal tokens* and current technology.
+---
+#### 🇧🇷 Elementos Essenciais (Telegráfico)
+*   **Agnóstico a modelos:** opera com qualquer LLM/difusor/API.
+*   **Pré-input manager:** recebe pedido do usuário, **divide** em blocos ≤ limite de tokens, **prioriza**, **agenda** e **roteia**.
+*   **Memória persistida:** resultados/latentes/“eco” viram **estado compartilhado** para o próximo bloco (nada é ignorado).
+*   **Especialistas:** *routers* decidem quem faz o quê (ex.: “descrição → LLM-A”, “keyframe → Img-B”, “vídeo → Vid-C”).
+*   **Controle de qualidade:** LLM diretor compara *o que fez* × *o que deveria* × *o que falta* e **regenera objetivos** do próximo fragmento.
+*   **Custo/latência-aware:** planeja pela **VRAM/tempo/custo**, não tenta “abraçar tudo de uma vez”.
+#### 🇬🇧 Essential Elements (Telegraphic)
+*   **Model-agnostic:** operates with any LLM/diffuser/API.
+*   **Pre-input manager:** receives user request, **divides** into blocks ≤ token limit, **prioritizes**, **schedules**, and **routes**.
+*   **Persisted memory:** results/latents/“echo” become **shared state** for the next block (nothing is ignored).
+*   **Specialists:** *routers* decide who does what (e.g., “description → LLM-A”, “keyframe → Img-B”, “video → Vid-C”).
+*   **Quality control:** director LLM compares *what was done* × *what should be done* × *what is missing* and **regenerates objectives** for the next fragment.
+*   **Cost/latency-aware:** plans by **VRAM/time/cost**, does not try to “embrace everything at once”.
+---
+#### 🇧🇷 Reivindicações Independentes (Método e Sistema)
+**Reivindicação Independente (Método) — Versão Enxuta:**
+1.  **Método** de **orquestração de prompts** para execução de tarefas acima do limite de contexto de modelos de IA, compreendendo:
+    (a) **receber** uma solicitação que excede um limite de tokens;
+    (b) **analisar** a solicitação por um **LLM diretor** e **fragmentá-la** em sub-tarefas ≤ limite;
+    (c) **selecionar** especialistas de execução para cada sub-tarefa com base em capacidades declaradas;
+    (d) **gerar** prompts específicos por sub-tarefa em **tokens universais**, incluindo referências ao **estado persistido** de execuções anteriores;
+    (e) **executar sequencialmente** as sub-tarefas e **persistir** suas saídas como memória (incluindo latentes/eco/artefatos);
+    (f) **avaliar** automaticamente a saída versus metas declaradas e **regenerar objetivos** do próximo fragmento;
+    (g) **iterar** (b)–(f) até que os critérios de completude sejam atendidos, produzindo o resultado agregado;
+    em que o framework **escala linearmente** no tempo e armazenamento físico, **independente** da janela de contexto dos modelos subjacentes.
+**Reivindicação Independente (Sistema):**
+2.  **Sistema** de orquestração de prompts, compreendendo: um **planejador LLM diretor**; um **roteador de especialistas**; um **banco de estado persistido** (incl. memória cinética para vídeo); um **gerador de prompts universais**; e um **módulo de avaliação/realimentação**, acoplados por uma **API pré-input** a modelos heterogêneos.
+#### 🇬🇧 Independent Claims (Method and System)
+**Independent Claim (Method) — Concise Version:**
+1.  A **method** for **prompt orchestration** for executing tasks exceeding AI model context limits, comprising:
+    (a) **receiving** a request that exceeds a token limit;
+    (b) **analyzing** the request by a **director LLM** and **fragmenting it** into sub-tasks ≤ the limit;
+    (c) **selecting** execution specialists for each sub-task based on declared capabilities;
+    (d) **generating** specific prompts per sub-task in **universal tokens**, including references to the **persisted state** of previous executions;
+    (e) **sequentially executing** the sub-tasks and **persisting** their outputs as memory (including latents/echo/artifacts);
+    (f) **automatically evaluating** the output against declared goals and **regenerating objectives** for the next fragment;
+    (g) **iterating** (b)–(f) until completion criteria are met, producing the aggregated result;
+    wherein the framework **scales linearly** in time and physical storage, **independent** of the context window of the underlying models.
+**Independent Claim (System):**
+2.  A prompt orchestration **system**, comprising: a **director LLM planner**; a **specialist router**; a **persisted state bank** (incl. kinetic memory for video); a **universal prompt generator**; and an **evaluation/feedback module**, coupled via a **pre-input API** to heterogeneous models.
+---
+#### 🇧🇷 Dependentes Úteis
+*   (3) Onde o roteamento considera **custo/latência/VRAM** e metas de qualidade.
+*   (4) Onde o banco de estado inclui **eco cinético** para vídeo (últimos *n* frames/latentes/fluxo).
+*   (5) Onde a avaliação usa métricas específicas por domínio (Lflow, consistência semântica, etc.).
+*   (6) Onde *tokens universais* padronizam instruções entre especialistas.
+*   (7) Onde a orquestração decide **cut vs continuous** e **corte regenerativo** (Déjà-Vu) ao editar vídeo.
+*   (8) Onde o sistema **nunca descarta** conteúdo excedente: **reagenda** em novos fragmentos.
+#### 🇬🇧 Useful Dependents
+*   (3) Wherein routing considers **cost/latency/VRAM** and quality goals.
+*   (4) Wherein the state bank includes **kinetic echo** for video (last *n* frames/latents/flow).
+*   (5) Wherein evaluation uses domain-specific metrics (Lflow, semantic consistency, etc.).
+*   (6) Wherein *universal tokens* standardize instructions between specialists.
+*   (7) Wherein orchestration decides **cut vs continuous** and **regenerative cut** (Déjà-Vu) when editing video.
+*   (8) Wherein the system **never discards** excess content: it **reschedules** it in new fragments.
+---
+#### 🇧🇷 Como isso conversa com SDR (Vídeo)
+*   **Eco Cinético**: é um **tipo de estado persistido** consumido pelo próximo passo.
+*   **Déjà-Vu (Corte Regenerativo)**: é **uma política de orquestração** aplicada quando há edição; ADUC decide, monta os prompts certos e chama o especialista de vídeo.
+*   **Cut vs Continuous**: decisão do **diretor** com base em estado + metas; ADUC roteia e garante a sobreposição/remoção final.
+#### 🇬🇧 How this Converses with SDR (Video)
+*   **Kinetic Echo**: is a **type of persisted state** consumed by the next step.
+*   **Déjà-Vu (Regenerative Cut)**: is an **orchestration policy** applied during editing; ADUC decides, crafts the right prompts, and calls the video specialist.
+*   **Cut vs Continuous**: decision made by the **director** based on state + goals; ADUC routes and ensures the final overlap/removal.
+---
+#### 🇧🇷 Mensagem Clara ao Usuário (Experiência)
+> “Seu pedido excede o limite X do modelo Y. Em vez de truncar silenciosamente, o **ADUC** dividirá e **entregará 100%** do conteúdo por etapas coordenadas.”
+Isso é diferencial prático e jurídico: **não-obviedade** por transformar limite de contexto em **pipeline controlado**, com **persistência de estado** e **avaliação iterativa**.
+#### 🇬🇧 Clear User Message (Experience)
+> "Your request exceeds model Y's limit X. Instead of silently truncating, **ADUC** will divide and **deliver 100%** of the content through coordinated steps."
+This is a practical and legal differentiator: **non-obviousness** by transforming context limits into a **controlled pipeline**, with **state persistence** and **iterative evaluation**.
+---
+### Contact / Contato / Contacto
+- **Author / Autor:** Carlos Rodrigues dos Santos
+- **Email:** [email protected]
+- **GitHub:** [https://github.com/carlex22/Aduc-sdr](https://github.com/carlex22/Aduc-sdr)
+- **Hugging Face Spaces:**
+  - [Ltx-SuperTime-60Secondos](https://huggingface.co/spaces/Carlexx/Ltx-SuperTime-60Secondos/)
+  - [Novinho](https://huggingface.co/spaces/Carlexxx/Novinho/)
+---

configs/ltxv-13b-0.9.7-dev.yaml ADDED Viewed

	@@ -0,0 +1,34 @@

+pipeline_type: multi-scale
+checkpoint_path: "ltxv-13b-0.9.7-dev.safetensors"
+downscale_factor: 0.6666666
+spatial_upscaler_model_path: "ltxv-spatial-upscaler-0.9.7.safetensors"
+stg_mode: "attention_values" # options: "attention_values", "attention_skip", "residual", "transformer_block"
+decode_timestep: 0.05
+decode_noise_scale: 0.025
+text_encoder_model_name_or_path: "PixArt-alpha/PixArt-XL-2-1024-MS"
+precision: "bfloat16"
+sampler: "from_checkpoint" # options: "uniform", "linear-quadratic", "from_checkpoint"
+prompt_enhancement_words_threshold: 120
+prompt_enhancer_image_caption_model_name_or_path: "MiaoshouAI/Florence-2-large-PromptGen-v2.0"
+prompt_enhancer_llm_model_name_or_path: "unsloth/Llama-3.2-3B-Instruct"
+stochastic_sampling: false
+first_pass:
+  guidance_scale: [1, 1, 6, 8, 6, 1, 1]
+  stg_scale: [0, 0, 4, 4, 4, 2, 1]
+  rescaling_scale: [1, 1, 0.5, 0.5, 1, 1, 1]
+  guidance_timesteps: [1.0, 0.996,  0.9933, 0.9850, 0.9767, 0.9008, 0.6180]
+  skip_block_list: [[], [11, 25, 35, 39], [22, 35, 39], [28], [28], [28], [28]]
+  num_inference_steps: 30
+  skip_final_inference_steps: 3
+  cfg_star_rescale: true
+second_pass:
+  guidance_scale: [1]
+  stg_scale: [1]
+  rescaling_scale: [1]
+  guidance_timesteps: [1.0]
+  skip_block_list: [27]
+  num_inference_steps: 30
+  skip_initial_inference_steps: 17
+  cfg_star_rescale: true

configs/ltxv-13b-0.9.7-distilled.yaml ADDED Viewed

	@@ -0,0 +1,28 @@

+pipeline_type: multi-scale
+checkpoint_path: "ltxv-13b-0.9.7-distilled.safetensors"
+downscale_factor: 0.6666666
+spatial_upscaler_model_path: "ltxv-spatial-upscaler-0.9.7.safetensors"
+stg_mode: "attention_values" # options: "attention_values", "attention_skip", "residual", "transformer_block"
+decode_timestep: 0.05
+decode_noise_scale: 0.025
+text_encoder_model_name_or_path: "PixArt-alpha/PixArt-XL-2-1024-MS"
+precision: "bfloat16"
+sampler: "from_checkpoint" # options: "uniform", "linear-quadratic", "from_checkpoint"
+prompt_enhancement_words_threshold: 120
+prompt_enhancer_image_caption_model_name_or_path: "MiaoshouAI/Florence-2-large-PromptGen-v2.0"
+prompt_enhancer_llm_model_name_or_path: "unsloth/Llama-3.2-3B-Instruct"
+stochastic_sampling: false
+first_pass:
+  timesteps: [1.0000, 0.9937, 0.9875, 0.9812, 0.9750, 0.9094, 0.7250]
+  guidance_scale: 1
+  stg_scale: 0
+  rescaling_scale: 1
+  skip_block_list: [42]
+second_pass:
+  timesteps: [0.9094, 0.7250, 0.4219]
+  guidance_scale: 1
+  stg_scale: 0
+  rescaling_scale: 1
+  skip_block_list: [42]

configs/ltxv-13b-0.9.8-dev-fp8.yaml ADDED Viewed

	@@ -0,0 +1,34 @@

+pipeline_type: multi-scale
+checkpoint_path: "ltxv-13b-0.9.8-dev-fp8.safetensors"
+downscale_factor: 0.6666666
+spatial_upscaler_model_path: "ltxv-spatial-upscaler-0.9.8.safetensors"
+stg_mode: "attention_values" # options: "attention_values", "attention_skip", "residual", "transformer_block"
+decode_timestep: 0.05
+decode_noise_scale: 0.025
+text_encoder_model_name_or_path: "PixArt-alpha/PixArt-XL-2-1024-MS"
+precision: "float8_e4m3fn" # options: "float8_e4m3fn", "bfloat16", "mixed_precision"
+sampler: "from_checkpoint" # options: "uniform", "linear-quadratic", "from_checkpoint"
+prompt_enhancement_words_threshold: 120
+prompt_enhancer_image_caption_model_name_or_path: "MiaoshouAI/Florence-2-large-PromptGen-v2.0"
+prompt_enhancer_llm_model_name_or_path: "unsloth/Llama-3.2-3B-Instruct"
+stochastic_sampling: false
+first_pass:
+  guidance_scale: [1, 1, 6, 8, 6, 1, 1]
+  stg_scale: [0, 0, 4, 4, 4, 2, 1]
+  rescaling_scale: [1, 1, 0.5, 0.5, 1, 1, 1]
+  guidance_timesteps: [1.0, 0.996,  0.9933, 0.9850, 0.9767, 0.9008, 0.6180]
+  skip_block_list: [[], [11, 25, 35, 39], [22, 35, 39], [28], [28], [28], [28]]
+  num_inference_steps: 30
+  skip_final_inference_steps: 3
+  cfg_star_rescale: true
+second_pass:
+  guidance_scale: [1]
+  stg_scale: [1]
+  rescaling_scale: [1]
+  guidance_timesteps: [1.0]
+  skip_block_list: [27]
+  num_inference_steps: 30
+  skip_initial_inference_steps: 17
+  cfg_star_rescale: true

configs/ltxv-13b-0.9.8-dev.yaml ADDED Viewed

	@@ -0,0 +1,34 @@

+pipeline_type: multi-scale
+checkpoint_path: "ltxv-13b-0.9.8-dev.safetensors"
+downscale_factor: 0.6666666
+spatial_upscaler_model_path: "ltxv-spatial-upscaler-0.9.8.safetensors"
+stg_mode: "attention_values" # options: "attention_values", "attention_skip", "residual", "transformer_block"
+decode_timestep: 0.05
+decode_noise_scale: 0.025
+text_encoder_model_name_or_path: "PixArt-alpha/PixArt-XL-2-1024-MS"
+precision: "bfloat16"
+sampler: "from_checkpoint" # options: "uniform", "linear-quadratic", "from_checkpoint"
+prompt_enhancement_words_threshold: 120
+prompt_enhancer_image_caption_model_name_or_path: "MiaoshouAI/Florence-2-large-PromptGen-v2.0"
+prompt_enhancer_llm_model_name_or_path: "unsloth/Llama-3.2-3B-Instruct"
+stochastic_sampling: false
+first_pass:
+  guidance_scale: [1, 1, 6, 8, 6, 1, 1]
+  stg_scale: [0, 0, 4, 4, 4, 2, 1]
+  rescaling_scale: [1, 1, 0.5, 0.5, 1, 1, 1]
+  guidance_timesteps: [1.0, 0.996,  0.9933, 0.9850, 0.9767, 0.9008, 0.6180]
+  skip_block_list: [[], [11, 25, 35, 39], [22, 35, 39], [28], [28], [28], [28]]
+  num_inference_steps: 30
+  skip_final_inference_steps: 3
+  cfg_star_rescale: true
+second_pass:
+  guidance_scale: [1]
+  stg_scale: [1]
+  rescaling_scale: [1]
+  guidance_timesteps: [1.0]
+  skip_block_list: [27]
+  num_inference_steps: 30
+  skip_initial_inference_steps: 17
+  cfg_star_rescale: true

configs/ltxv-13b-0.9.8-distilled-fp8.yaml ADDED Viewed

	@@ -0,0 +1,29 @@

+pipeline_type: multi-scale
+checkpoint_path: "ltxv-13b-0.9.8-distilled-fp8.safetensors"
+downscale_factor: 0.6666666
+spatial_upscaler_model_path: "ltxv-spatial-upscaler-0.9.8.safetensors"
+stg_mode: "attention_values" # options: "attention_values", "attention_skip", "residual", "transformer_block"
+decode_timestep: 0.05
+decode_noise_scale: 0.025
+text_encoder_model_name_or_path: "PixArt-alpha/PixArt-XL-2-1024-MS"
+precision: "float8_e4m3fn" # options: "float8_e4m3fn", "bfloat16", "mixed_precision"
+sampler: "from_checkpoint" # options: "uniform", "linear-quadratic", "from_checkpoint"
+prompt_enhancement_words_threshold: 120
+prompt_enhancer_image_caption_model_name_or_path: "MiaoshouAI/Florence-2-large-PromptGen-v2.0"
+prompt_enhancer_llm_model_name_or_path: "unsloth/Llama-3.2-3B-Instruct"
+stochastic_sampling: false
+first_pass:
+  timesteps: [1.0000, 0.9937, 0.9875, 0.9812, 0.9750, 0.9094, 0.7250]
+  guidance_scale: 1
+  stg_scale: 0
+  rescaling_scale: 1
+  skip_block_list: [42]
+second_pass:
+  timesteps: [0.9094, 0.7250, 0.4219]
+  guidance_scale: 1
+  stg_scale: 0
+  rescaling_scale: 1
+  skip_block_list: [42]
+  tone_map_compression_ratio: 0.6

configs/ltxv-13b-0.9.8-distilled.yaml ADDED Viewed

	@@ -0,0 +1,29 @@

+pipeline_type: multi-scale
+checkpoint_path: "ltxv-13b-0.9.8-distilled.safetensors"
+downscale_factor: 0.6666666
+spatial_upscaler_model_path: "ltxv-spatial-upscaler-0.9.8.safetensors"
+stg_mode: "attention_values" # options: "attention_values", "attention_skip", "residual", "transformer_block"
+decode_timestep: 0.05
+decode_noise_scale: 0.025
+text_encoder_model_name_or_path: "PixArt-alpha/PixArt-XL-2-1024-MS"
+precision: "bfloat16"
+sampler: "from_checkpoint" # options: "uniform", "linear-quadratic", "from_checkpoint"
+prompt_enhancement_words_threshold: 120
+prompt_enhancer_image_caption_model_name_or_path: "MiaoshouAI/Florence-2-large-PromptGen-v2.0"
+prompt_enhancer_llm_model_name_or_path: "unsloth/Llama-3.2-3B-Instruct"
+stochastic_sampling: false
+first_pass:
+  timesteps: [1.0000, 0.9937, 0.9875, 0.9812, 0.9750, 0.9094, 0.7250]
+  guidance_scale: 1
+  stg_scale: 0
+  rescaling_scale: 1
+  skip_block_list: [42]
+second_pass:
+  timesteps: [0.9094, 0.7250, 0.4219]
+  guidance_scale: 1
+  stg_scale: 0
+  rescaling_scale: 1
+  skip_block_list: [42]
+  tone_map_compression_ratio: 0.6

configs/ltxv-2b-0.9.1.yaml ADDED Viewed

	@@ -0,0 +1,17 @@

+pipeline_type: base
+checkpoint_path: "ltx-video-2b-v0.9.1.safetensors"
+guidance_scale: 3
+stg_scale: 1
+rescaling_scale: 0.7
+skip_block_list: [19]
+num_inference_steps: 40
+stg_mode: "attention_values" # options: "attention_values", "attention_skip", "residual", "transformer_block"
+decode_timestep: 0.05
+decode_noise_scale: 0.025
+text_encoder_model_name_or_path: "PixArt-alpha/PixArt-XL-2-1024-MS"
+precision: "bfloat16"
+sampler: "from_checkpoint" # options: "uniform", "linear-quadratic", "from_checkpoint"
+prompt_enhancement_words_threshold: 120
+prompt_enhancer_image_caption_model_name_or_path: "MiaoshouAI/Florence-2-large-PromptGen-v2.0"
+prompt_enhancer_llm_model_name_or_path: "unsloth/Llama-3.2-3B-Instruct"
+stochastic_sampling: false

configs/ltxv-2b-0.9.5.yaml ADDED Viewed

	@@ -0,0 +1,17 @@

+pipeline_type: base
+checkpoint_path: "ltx-video-2b-v0.9.5.safetensors"
+guidance_scale: 3
+stg_scale: 1
+rescaling_scale: 0.7
+skip_block_list: [19]
+num_inference_steps: 40
+stg_mode: "attention_values" # options: "attention_values", "attention_skip", "residual", "transformer_block"
+decode_timestep: 0.05
+decode_noise_scale: 0.025
+text_encoder_model_name_or_path: "PixArt-alpha/PixArt-XL-2-1024-MS"
+precision: "bfloat16"
+sampler: "from_checkpoint" # options: "uniform", "linear-quadratic", "from_checkpoint"
+prompt_enhancement_words_threshold: 120
+prompt_enhancer_image_caption_model_name_or_path: "MiaoshouAI/Florence-2-large-PromptGen-v2.0"
+prompt_enhancer_llm_model_name_or_path: "unsloth/Llama-3.2-3B-Instruct"
+stochastic_sampling: false

configs/ltxv-2b-0.9.6-dev.yaml ADDED Viewed

	@@ -0,0 +1,17 @@

+pipeline_type: base
+checkpoint_path: "ltxv-2b-0.9.6-dev-04-25.safetensors"
+guidance_scale: 3
+stg_scale: 1
+rescaling_scale: 0.7
+skip_block_list: [19]
+num_inference_steps: 40
+stg_mode: "attention_values" # options: "attention_values", "attention_skip", "residual", "transformer_block"
+decode_timestep: 0.05
+decode_noise_scale: 0.025
+text_encoder_model_name_or_path: "PixArt-alpha/PixArt-XL-2-1024-MS"
+precision: "bfloat16"
+sampler: "from_checkpoint" # options: "uniform", "linear-quadratic", "from_checkpoint"
+prompt_enhancement_words_threshold: 120
+prompt_enhancer_image_caption_model_name_or_path: "MiaoshouAI/Florence-2-large-PromptGen-v2.0"
+prompt_enhancer_llm_model_name_or_path: "unsloth/Llama-3.2-3B-Instruct"
+stochastic_sampling: false

configs/ltxv-2b-0.9.6-distilled.yaml ADDED Viewed

	@@ -0,0 +1,16 @@

+pipeline_type: base
+checkpoint_path: "ltxv-2b-0.9.6-distilled-04-25.safetensors"
+guidance_scale: 1
+stg_scale: 0
+rescaling_scale: 1
+num_inference_steps: 8
+stg_mode: "attention_values" # options: "attention_values", "attention_skip", "residual", "transformer_block"
+decode_timestep: 0.05
+decode_noise_scale: 0.025
+text_encoder_model_name_or_path: "PixArt-alpha/PixArt-XL-2-1024-MS"
+precision: "bfloat16"
+sampler: "from_checkpoint" # options: "uniform", "linear-quadratic", "from_checkpoint"
+prompt_enhancement_words_threshold: 120
+prompt_enhancer_image_caption_model_name_or_path: "MiaoshouAI/Florence-2-large-PromptGen-v2.0"
+prompt_enhancer_llm_model_name_or_path: "unsloth/Llama-3.2-3B-Instruct"
+stochastic_sampling: true

configs/ltxv-2b-0.9.8-distilled-fp8.yaml ADDED Viewed

	@@ -0,0 +1,28 @@

+pipeline_type: multi-scale
+checkpoint_path: "ltxv-2b-0.9.8-distilled-fp8.safetensors"
+downscale_factor: 0.6666666
+spatial_upscaler_model_path: "ltxv-spatial-upscaler-0.9.8.safetensors"
+stg_mode: "attention_values" # options: "attention_values", "attention_skip", "residual", "transformer_block"
+decode_timestep: 0.05
+decode_noise_scale: 0.025
+text_encoder_model_name_or_path: "PixArt-alpha/PixArt-XL-2-1024-MS"
+precision: "float8_e4m3fn" # options: "float8_e4m3fn", "bfloat16", "mixed_precision"
+sampler: "from_checkpoint" # options: "uniform", "linear-quadratic", "from_checkpoint"
+prompt_enhancement_words_threshold: 120
+prompt_enhancer_image_caption_model_name_or_path: "MiaoshouAI/Florence-2-large-PromptGen-v2.0"
+prompt_enhancer_llm_model_name_or_path: "unsloth/Llama-3.2-3B-Instruct"
+stochastic_sampling: false
+first_pass:
+  timesteps: [1.0000, 0.9937, 0.9875, 0.9812, 0.9750, 0.9094, 0.7250]
+  guidance_scale: 1
+  stg_scale: 0
+  rescaling_scale: 1
+  skip_block_list: [42]
+second_pass:
+  timesteps: [0.9094, 0.7250, 0.4219]
+  guidance_scale: 1
+  stg_scale: 0
+  rescaling_scale: 1
+  skip_block_list: [42]

configs/ltxv-2b-0.9.8-distilled.yaml ADDED Viewed

	@@ -0,0 +1,28 @@

+pipeline_type: multi-scale
+checkpoint_path: "ltxv-2b-0.9.8-distilled.safetensors"
+downscale_factor: 0.6666666
+spatial_upscaler_model_path: "ltxv-spatial-upscaler-0.9.8.safetensors"
+stg_mode: "attention_values" # options: "attention_values", "attention_skip", "residual", "transformer_block"
+decode_timestep: 0.05
+decode_noise_scale: 0.025
+text_encoder_model_name_or_path: "PixArt-alpha/PixArt-XL-2-1024-MS"
+precision: "bfloat16"
+sampler: "from_checkpoint" # options: "uniform", "linear-quadratic", "from_checkpoint"
+prompt_enhancement_words_threshold: 120
+prompt_enhancer_image_caption_model_name_or_path: "MiaoshouAI/Florence-2-large-PromptGen-v2.0"
+prompt_enhancer_llm_model_name_or_path: "unsloth/Llama-3.2-3B-Instruct"
+stochastic_sampling: false
+first_pass:
+  timesteps: [1.0000, 0.9937, 0.9875, 0.9812, 0.9750, 0.9094, 0.7250]
+  guidance_scale: 1
+  stg_scale: 0
+  rescaling_scale: 1
+  skip_block_list: [42]
+second_pass:
+  timesteps: [0.9094, 0.7250, 0.4219]
+  guidance_scale: 1
+  stg_scale: 0
+  rescaling_scale: 1
+  skip_block_list: [42]

configs/ltxv-2b-0.9.yaml ADDED Viewed

	@@ -0,0 +1,17 @@

+pipeline_type: base
+checkpoint_path: "ltx-video-2b-v0.9.safetensors"
+guidance_scale: 3
+stg_scale: 1
+rescaling_scale: 0.7
+skip_block_list: [19]
+num_inference_steps: 40
+stg_mode: "attention_values" # options: "attention_values", "attention_skip", "residual", "transformer_block"
+decode_timestep: 0.05
+decode_noise_scale: 0.025
+text_encoder_model_name_or_path: "PixArt-alpha/PixArt-XL-2-1024-MS"
+precision: "bfloat16"
+sampler: "from_checkpoint" # options: "uniform", "linear-quadratic", "from_checkpoint"
+prompt_enhancement_words_threshold: 120
+prompt_enhancer_image_caption_model_name_or_path: "MiaoshouAI/Florence-2-large-PromptGen-v2.0"
+prompt_enhancer_llm_model_name_or_path: "unsloth/Llama-3.2-3B-Instruct"
+stochastic_sampling: false

deformes4D_engine.py ADDED Viewed

	@@ -0,0 +1,292 @@

+# deformes4D_engine.py
+# Copyright (C) 4 de Agosto de 2025  Carlos Rodrigues dos Santos
+#
+# MODIFICATIONS FOR ADUC-SDR:
+# Copyright (C) 2025 Carlos Rodrigues dos Santos. All rights reserved.
+#
+# This file is part of the ADUC-SDR project. It contains the core logic for
+# video fragment generation, latent manipulation, and dynamic editing,
+# governed by the ADUC orchestrator.
+# This component is licensed under the GNU Affero General Public License v3.0.
+#
+# AVISO DE PATENTE PENDENTE: O método e sistema ADUC implementado neste
+# software está em processo de patenteamento. Consulte NOTICE.md.
+import os
+import time
+import imageio
+import numpy as np
+import torch
+import logging
+from PIL import Image, ImageOps
+from dataclasses import dataclass
+import gradio as gr
+import subprocess
+import random
+import gc
+from audio_specialist import audio_specialist_singleton
+from ltx_manager_helpers import ltx_manager_singleton
+from flux_kontext_helpers import flux_kontext_singleton
+from gemini_helpers import gemini_singleton
+from ltx_video.models.autoencoders.vae_encode import vae_encode, vae_decode
+logger = logging.getLogger(__name__)
+@dataclass
+class LatentConditioningItem:
+    latent_tensor: torch.Tensor
+    media_frame_number: int
+    conditioning_strength: float
+class Deformes4DEngine:
+    def __init__(self, ltx_manager, workspace_dir="deformes_workspace"):
+        self.ltx_manager = ltx_manager
+        self.workspace_dir = workspace_dir
+        self._vae = None
+        self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
+        logger.info("Especialista Deformes4D (SDR Executor) inicializado.")
+    @property
+    def vae(self):
+        if self._vae is None:
+            self._vae = self.ltx_manager.workers[0].pipeline.vae
+        self._vae.to(self.device); self._vae.eval()
+        return self._vae
+    def save_latent_tensor(self, tensor: torch.Tensor, path: str):
+        torch.save(tensor.cpu(), path)
+        logger.info(f"Tensor latente salvo em: {path}")
+    def load_latent_tensor(self, path: str) -> torch.Tensor:
+        tensor = torch.load(path, map_location=self.device)
+        logger.info(f"Tensor latente carregado de: {path} para o dispositivo {self.device}")
+        return tensor
+    @torch.no_grad()
+    def pixels_to_latents(self, tensor: torch.Tensor) -> torch.Tensor:
+        tensor = tensor.to(self.device, dtype=self.vae.dtype)
+        return vae_encode(tensor, self.vae, vae_per_channel_normalize=True)
+    @torch.no_grad()
+    def latents_to_pixels(self, latent_tensor: torch.Tensor, decode_timestep: float = 0.05) -> torch.Tensor:
+        latent_tensor = latent_tensor.to(self.device, dtype=self.vae.dtype)
+        timestep_tensor = torch.tensor([decode_timestep] * latent_tensor.shape[0], device=self.device, dtype=latent_tensor.dtype)
+        return vae_decode(latent_tensor, self.vae, is_video=True, timestep=timestep_tensor, vae_per_channel_normalize=True)
+    def save_video_from_tensor(self, video_tensor: torch.Tensor, path: str, fps: int = 24):
+        if video_tensor is None or video_tensor.ndim != 5 or video_tensor.shape[2] == 0: return
+        video_tensor = video_tensor.squeeze(0).permute(1, 2, 3, 0)
+        video_tensor = (video_tensor.clamp(-1, 1) + 1) / 2.0
+        video_np = (video_tensor.detach().cpu().float().numpy() * 255).astype(np.uint8)
+        with imageio.get_writer(path, fps=fps, codec='libx264', quality=8) as writer:
+            for frame in video_np: writer.append_data(frame)
+        logger.info(f"Vídeo salvo em: {path}")
+    def _preprocess_image_for_latent_conversion(self, image: Image.Image, target_resolution: tuple) -> Image.Image:
+        if image.size != target_resolution:
+            logger.info(f"  - AÇÃO: Redimensionando imagem de {image.size} para {target_resolution} antes da conversão para latente.")
+            return ImageOps.fit(image, target_resolution, Image.Resampling.LANCZOS)
+        return image
+    def pil_to_latent(self, pil_image: Image.Image) -> torch.Tensor:
+        image_np = np.array(pil_image).astype(np.float32) / 255.0
+        tensor = torch.from_numpy(image_np).permute(2, 0, 1).unsqueeze(0).unsqueeze(2)
+        tensor = (tensor * 2.0) - 1.0
+        return self.pixels_to_latents(tensor)
+    def _generate_video_and_audio_from_latents(self, latent_tensor, audio_prompt, base_name):
+        silent_video_path = os.path.join(self.workspace_dir, f"{base_name}_silent.mp4")
+        pixel_tensor = self.latents_to_pixels(latent_tensor)
+        self.save_video_from_tensor(pixel_tensor, silent_video_path, fps=24)
+        del pixel_tensor; gc.collect()
+        try:
+            result = subprocess.run(
+                ["ffprobe", "-v", "error", "-show_entries", "format=duration", "-of", "default=noprint_wrappers=1:nokey=1", silent_video_path],
+                capture_output=True, text=True, check=True)
+            frag_duration = float(result.stdout.strip())
+        except (subprocess.CalledProcessError, ValueError, FileNotFoundError):
+             logger.warning(f"ffprobe falhou em {os.path.basename(silent_video_path)}. Calculando duração manualmente.")
+             num_pixel_frames = latent_tensor.shape[2] * 8
+             frag_duration = num_pixel_frames / 24.0
+        video_with_audio_path = audio_specialist_singleton.generate_audio_for_video(
+            video_path=silent_video_path, prompt=audio_prompt,
+            duration_seconds=frag_duration)
+        if os.path.exists(silent_video_path):
+             os.remove(silent_video_path)
+        return video_with_audio_path
+    def _generate_latent_tensor_internal(self, conditioning_items, ltx_params, target_resolution, total_frames_to_generate):
+        final_ltx_params = {
+            **ltx_params,
+            'width': target_resolution[0], 'height': target_resolution[1],
+            'video_total_frames': total_frames_to_generate, 'video_fps': 24,
+            'current_fragment_index': int(time.time()),
+            'conditioning_items_data': conditioning_items
+        }
+        new_full_latents, _ = self.ltx_manager.generate_latent_fragment(**final_ltx_params)
+        return new_full_latents
+    def concatenate_videos_ffmpeg(self, video_paths: list[str], output_path: str) -> str:
+        if not video_paths:
+            raise gr.Error("Nenhum fragmento de vídeo para montar.")
+        list_file_path = os.path.join(self.workspace_dir, "concat_list.txt")
+        with open(list_file_path, 'w', encoding='utf-8') as f:
+            for path in video_paths:
+                f.write(f"file '{os.path.abspath(path)}'\n")
+        cmd_list = ['ffmpeg', '-y', '-f', 'concat', '-safe', '0', '-i', list_file_path, '-c', 'copy', output_path]
+        logger.info("Executando concatenação FFmpeg...")
+        try:
+            subprocess.run(cmd_list, check=True, capture_output=True, text=True)
+        except subprocess.CalledProcessError as e:
+            logger.error(f"Erro no FFmpeg: {e.stderr}")
+            raise gr.Error(f"Falha na montagem final do vídeo. Detalhes: {e.stderr}")
+        return output_path
+    def generate_full_movie(self,
+                            keyframes: list,
+                            global_prompt: str,
+                            storyboard: list,
+                            seconds_per_fragment: float,
+                            overlap_percent: int,
+                            echo_frames: int,
+                            handler_strength: float,
+                            destination_convergence_strength: float,
+                            base_ltx_params: dict,
+                            video_resolution: int,
+                            use_continuity_director: bool,
+                            progress: gr.Progress = gr.Progress()):
+        keyframe_paths = [item[0] if isinstance(item, tuple) else item for item in keyframes]
+        video_clips_paths, story_history, audio_history = [], "", "This is the beginning of the film."
+        target_resolution_tuple = (video_resolution, video_resolution)
+        n_trim_latents = 24 #self._quantize_to_multiple(int(seconds_per_fragment * 24 * (overlap_percent / 100.0)), 8)
+        echo_frames = 8
+        previous_latents_path = None
+        num_transitions_to_generate = len(keyframe_paths) - 1
+        for i in range(num_transitions_to_generate):
+            progress((i + 1) / num_transitions_to_generate, desc=f"Produzindo Transição {i+1}/{num_transitions_to_generate}")
+            start_keyframe_path = keyframe_paths[i]
+            destination_keyframe_path = keyframe_paths[i+1]
+            present_scene_desc = storyboard[i]
+            is_first_fragment = previous_latents_path is None
+            if is_first_fragment:
+                transition_type = "start"
+                motion_prompt = gemini_singleton.get_initial_motion_prompt(
+                    global_prompt, start_keyframe_path, destination_keyframe_path, present_scene_desc
+                )
+            else:
+                past_keyframe_path = keyframe_paths[i-1]
+                past_scene_desc = storyboard[i-1]
+                future_scene_desc = storyboard[i+1] if (i+1) < len(storyboard) else "A cena final."
+                decision = gemini_singleton.get_cinematic_decision(
+                    global_prompt=global_prompt, story_history=story_history,
+                    past_keyframe_path=past_keyframe_path, present_keyframe_path=start_keyframe_path,
+                    future_keyframe_path=destination_keyframe_path, past_scene_desc=past_scene_desc,
+                    present_scene_desc=present_scene_desc, future_scene_desc=future_scene_desc
+                )
+                transition_type, motion_prompt = decision["transition_type"], decision["motion_prompt"]
+            story_history += f"\n- Ato {i+1} ({transition_type}): {motion_prompt}"
+            if use_continuity_director: # Assume-se que este checkbox controla os diretores de vídeo e som
+                if is_first_fragment:
+                    audio_prompt = gemini_singleton.get_sound_director_prompt(
+                        audio_history=audio_history,
+                        past_keyframe_path=start_keyframe_path, present_keyframe_path=start_keyframe_path,
+                        future_keyframe_path=destination_keyframe_path, present_scene_desc=present_scene_desc,
+                        motion_prompt=motion_prompt, future_scene_desc=storyboard[i+1] if (i+1) < len(storyboard) else "The final scene."
+                    )
+                else:
+                    audio_prompt = gemini_singleton.get_sound_director_prompt(
+                        audio_history=audio_history, past_keyframe_path=keyframe_paths[i-1],
+                        present_keyframe_path=start_keyframe_path, future_keyframe_path=destination_keyframe_path,
+                        present_scene_desc=present_scene_desc, motion_prompt=motion_prompt,
+                        future_scene_desc=storyboard[i+1] if (i+1) < len(storyboard) else "The final scene."
+                    )
+            else:
+                audio_prompt = present_scene_desc # Fallback para o prompt da cena se o diretor de som estiver desligado
+            audio_history = audio_prompt
+            conditioning_items = []
+            current_ltx_params = {**base_ltx_params, "handler_strength": handler_strength, "motion_prompt": motion_prompt}
+            total_frames_to_generate = self._quantize_to_multiple(int(seconds_per_fragment * 24), 8) + 1
+            if is_first_fragment:
+                img_start = self._preprocess_image_for_latent_conversion(Image.open(start_keyframe_path).convert("RGB"), target_resolution_tuple)
+                start_latent = self.pil_to_latent(img_start)
+                conditioning_items.append(LatentConditioningItem(start_latent, 0, 1.0))
+                if transition_type != "cut":
+                    img_dest = self._preprocess_image_for_latent_conversion(Image.open(destination_keyframe_path).convert("RGB"), target_resolution_tuple)
+                    destination_latent = self.pil_to_latent(img_dest)
+                    conditioning_items.append(LatentConditioningItem(destination_latent, total_frames_to_generate - 1, destination_convergence_strength))
+            else:
+                previous_latents = self.load_latent_tensor(previous_latents_path)
+                handler_latent = previous_latents[:, :, -1:, :, :]
+                trimmed_for_echo = previous_latents[:, :, :-n_trim_latents, :, :] if n_trim_latents > 0 and previous_latents.shape[2] > n_trim_latents else previous_latents
+                echo_latents = trimmed_for_echo[:, :, -echo_frames:, :, :]
+                handler_frame_position = n_trim_latents + echo_frames
+                conditioning_items = []
+                for i, echo_latent in enumerate(echo_frames):
+                      if i == 0:
+                           weight = 1.0
+                      else:
+                           weight = random.uniform(0.2, 0.7)
+                conditioning_items.append(LatentConditioningItem(echo_latent, 0, weight))
+                #conditioning_items.append(LatentConditioningItem(echo_latents, 0, 1.0))
+                conditioning_items.append(LatentConditioningItem(handler_latent, handler_frame_position, handler_strength))
+                del previous_latents, handler_latent, trimmed_for_echo, echo_latents; gc.collect()
+                if transition_type == "continuous":
+                    img_dest = self._preprocess_image_for_latent_conversion(Image.open(destination_keyframe_path).convert("RGB"), target_resolution_tuple)
+                    destination_latent = self.pil_to_latent(img_dest)
+                    conditioning_items.append(LatentConditioningItem(destination_latent, total_frames_to_generate - 1, destination_convergence_strength))
+            new_full_latents = self._generate_latent_tensor_internal(conditioning_items, current_ltx_params, target_resolution_tuple, total_frames_to_generate)
+            base_name = f"fragment_{i}_{int(time.time())}"
+            new_full_latents_path = os.path.join(self.workspace_dir, f"{base_name}_full.pt")
+            self.save_latent_tensor(new_full_latents, new_full_latents_path)
+            previous_latents_path = new_full_latents_path
+            latents_for_video = new_full_latents
+            if not is_first_fragment:
+                if echo_frames > 0 and latents_for_video.shape[2] > echo_frames: latents_for_video = latents_for_video[:, :, echo_frames:, :, :]
+                if n_trim_latents > 0 and latents_for_video.shape[2] > n_trim_latents: latents_for_video = latents_for_video[:, :, :-n_trim_latents, :, :]
+            else:
+                if n_trim_latents > 0 and latents_for_video.shape[2] > n_trim_latents: latents_for_video = latents_for_video[:, :, :-n_trim_latents, :, :]
+            video_with_audio_path = self._generate_video_and_audio_from_latents(latents_for_video, audio_prompt, base_name)
+            video_clips_paths.append(video_with_audio_path)
+            if transition_type == "cut":
+                previous_latents_path = None
+            yield {"fragment_path": video_with_audio_path}
+        final_movie_path = os.path.join(self.workspace_dir, f"final_movie_{int(time.time())}.mp4")
+        self.concatenate_videos_ffmpeg(video_clips_paths, final_movie_path)
+        logger.info(f"Filme completo salvo em: {final_movie_path}")
+        yield {"final_path": final_movie_path}
+    def _quantize_to_multiple(self, n, m):
+        if m == 0: return n
+        quantized = int(round(n / m) * m)
+        return m if n > 0 and quantized == 0 else quantized

dreamo/LICENSE.txt ADDED Viewed

	@@ -0,0 +1,201 @@

+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+   1. Definitions.
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+   END OF TERMS AND CONDITIONS
+   APPENDIX: How to apply the Apache License to your work.
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+   Copyright [yyyy] [name of copyright owner]
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.

dreamo/README.md ADDED Viewed

	@@ -0,0 +1,135 @@

+# 🛠️ helpers/ - Ferramentas de IA de Terceiros Adaptadas para ADUC-SDR
+Esta pasta contém implementações adaptadas de modelos e utilitários de IA de terceiros, que servem como "especialistas" ou "ferramentas" de baixo nível para a arquitetura ADUC-SDR.
+**IMPORTANTE:** O conteúdo desta pasta é de autoria de seus respectivos idealizadores e desenvolvedores originais. Esta pasta **NÃO FAZ PARTE** do projeto principal ADUC-SDR em termos de sua arquitetura inovadora. Ela serve como um repositório para as **dependências diretas e modificadas** que os `DeformesXDEngines` (os estágios do "foguete" ADUC-SDR) invocam para realizar tarefas específicas (geração de imagem, vídeo, áudio).
+As modificações realizadas nos arquivos aqui presentes visam principalmente:
+1.  **Adaptação de Interfaces:** Padronizar as interfaces para que se encaixem no fluxo de orquestração do ADUC-SDR.
+2.  **Gerenciamento de Recursos:** Integrar lógicas de carregamento/descarregamento de modelos (GPU management) e configurações via arquivos YAML.
+3.  **Otimização de Fluxo:** Ajustar as pipelines para aceitar formatos de entrada mais eficientes (ex: tensores pré-codificados em vez de caminhos de mídia, pulando etapas de codificação/decodificação redundantes).
+---
+## 📄 Licenciamento
+O conteúdo original dos projetos listados abaixo é licenciado sob a **Licença Apache 2.0**, ou outra licença especificada pelos autores originais. Todas as modificações e o uso desses arquivos dentro da estrutura `helpers/` do projeto ADUC-SDR estão em conformidade com os termos da **Licença Apache 2.0**.
+As licenças originais dos projetos podem ser encontradas nas suas respectivas fontes ou nos subdiretórios `incl_licenses/` dentro de cada módulo adaptado.
+---
+## 🛠️ API dos Helpers e Guia de Uso
+Esta seção detalha como cada helper (agente especialista) deve ser utilizado dentro do ecossistema ADUC-SDR. Todos os agentes são instanciados como **singletons** no `hardware_manager.py` para garantir o gerenciamento centralizado de recursos de GPU.
+### **gemini_helpers.py (GeminiAgent)**
+*   **Propósito:** Atua como o "Oráculo de Síntese Adaptativo", responsável por todas as tarefas de processamento de linguagem natural, como criação de storyboards, geração de prompts, e tomada de decisões narrativas.
+*   **Singleton Instance:** `gemini_agent_singleton`
+*   **Construtor:** `GeminiAgent()`
+    *   Lê `configs/gemini_config.yaml` para obter o nome do modelo, parâmetros de inferência e caminhos de templates de prompt. A chave da API é lida da variável de ambiente `GEMINI_API_KEY`.
+*   **Métodos Públicos:**
+    *   `generate_storyboard(prompt: str, num_keyframes: int, ref_image_paths: list[str])`
+        *   **Inputs:**
+            *   `prompt`: A ideia geral do filme (string).
+            *   `num_keyframes`: O número de cenas a serem geradas (int).
+            *   `ref_image_paths`: Lista de caminhos para as imagens de referência (list[str]).
+        *   **Output:** `tuple[list[str], str]` (Uma tupla contendo a lista de strings do storyboard e um relatório textual da operação).
+    *   `select_keyframes_from_pool(storyboard: list, base_image_paths: list[str], pool_image_paths: list[str])`
+        *   **Inputs:**
+            *   `storyboard`: A lista de strings do storyboard gerado.
+            *   `base_image_paths`: Imagens de referência base (list[str]).
+            *   `pool_image_paths`: O "banco de imagens" de onde selecionar (list[str]).
+        *   **Output:** `tuple[list[str], str]` (Uma tupla contendo a lista de caminhos de imagens selecionadas e um relatório textual).
+    *   `get_anticipatory_keyframe_prompt(...)`
+        *   **Inputs:** Contexto narrativo e visual para gerar um prompt de imagem.
+        *   **Output:** `tuple[str, str]` (Uma tupla contendo o prompt gerado para o modelo de imagem e um relatório textual).
+    *   `get_initial_motion_prompt(...)`
+        *   **Inputs:** Contexto narrativo e visual para a primeira transição de vídeo.
+        *   **Output:** `tuple[str, str]` (Uma tupla contendo o prompt de movimento gerado e um relatório textual).
+    *   `get_transition_decision(...)`
+        *   **Inputs:** Contexto narrativo e visual para uma transição de vídeo intermediária.
+        *   **Output:** `tuple[dict, str]` (Uma tupla contendo um dicionário `{"transition_type": "...", "motion_prompt": "..."}` e um relatório textual).
+    *   `generate_audio_prompts(...)`
+        *   **Inputs:** Contexto narrativo global.
+        *   **Output:** `tuple[dict, str]` (Uma tupla contendo um dicionário `{"music_prompt": "...", "sfx_prompt": "..."}` e um relatório textual).
+### **flux_kontext_helpers.py (FluxPoolManager)**
+*   **Propósito:** Especialista em geração de imagens de alta qualidade (keyframes) usando a pipeline FluxKontext. Gerencia um pool de workers para otimizar o uso de múltiplas GPUs.
+*   **Singleton Instance:** `flux_kontext_singleton`
+*   **Construtor:** `FluxPoolManager(device_ids: list[str], flux_config_file: str)`
+    *   Lê `configs/flux_config.yaml`.
+*   **Método Público:**
+    *   `generate_image(prompt: str, reference_images: list[Image.Image], width: int, height: int, seed: int = 42, callback: callable = None)`
+        *   **Inputs:**
+            *   `prompt`: Prompt textual para guiar a geração (string).
+            *   `reference_images`: Lista de objetos `PIL.Image` como referência visual.
+            *   `width`, `height`: Dimensões da imagem de saída (int).
+            *   `seed`: Semente para reprodutibilidade (int).
+            *   `callback`: Função de callback opcional para monitorar o progresso.
+        *   **Output:** `PIL.Image.Image` (O objeto da imagem gerada).
+### **dreamo_helpers.py (DreamOAgent)**
+*   **Propósito:** Especialista em geração de imagens de alta qualidade (keyframes) usando a pipeline DreamO, com capacidades avançadas de edição e estilo a partir de referências.
+*   **Singleton Instance:** `dreamo_agent_singleton`
+*   **Construtor:** `DreamOAgent(device_id: str = None)`
+    *   Lê `configs/dreamo_config.yaml`.
+*   **Método Público:**
+    *   `generate_image(prompt: str, reference_images: list[Image.Image], width: int, height: int)`
+        *   **Inputs:**
+            *   `prompt`: Prompt textual para guiar a geração (string).
+            *   `reference_images`: Lista de objetos `PIL.Image` como referência visual. A lógica interna atribui a primeira imagem como `style` e as demais como `ip`.
+            *   `width`, `height`: Dimensões da imagem de saída (int).
+        *   **Output:** `PIL.Image.Image` (O objeto da imagem gerada).
+### **ltx_manager_helpers.py (LtxPoolManager)**
+*   **Propósito:** Especialista na geração de fragmentos de vídeo no espaço latente usando a pipeline LTX-Video. Gerencia um pool de workers para otimizar o uso de múltiplas GPUs.
+*   **Singleton Instance:** `ltx_manager_singleton`
+*   **Construtor:** `LtxPoolManager(device_ids: list[str], ltx_model_config_file: str, ltx_global_config_file: str)`
+    *   Lê o `ltx_global_config_file` e o `ltx_model_config_file` para configurar a pipeline.
+*   **Método Público:**
+    *   `generate_latent_fragment(**kwargs)`
+        *   **Inputs:** Dicionário de keyword arguments (`kwargs`) contendo todos os parâmetros da pipeline LTX, incluindo:
+            *   `height`, `width`: Dimensões do vídeo (int).
+            *   `video_total_frames`: Número total de frames a serem gerados (int).
+            *   `video_fps`: Frames por segundo (int).
+            *   `motion_prompt`: Prompt de movimento (string).
+            *   `conditioning_items_data`: Lista de objetos `LatentConditioningItem` contendo os tensores latentes de condição.
+            *   `guidance_scale`, `stg_scale`, `num_inference_steps`, etc.
+        *   **Output:** `tuple[torch.Tensor, tuple]` (Uma tupla contendo o tensor latente gerado e os valores de padding utilizados).
+### **mmaudio_helper.py (MMAudioAgent)**
+*   **Propósito:** Especialista em geração de áudio para um determinado fragmento de vídeo.
+*   **Singleton Instance:** `mmaudio_agent_singleton`
+*   **Construtor:** `MMAudioAgent(workspace_dir: str, device_id: str = None, mmaudio_config_file: str)`
+    *   Lê `configs/mmaudio_config.yaml`.
+*   **Método Público:**
+    *   `generate_audio_for_video(video_path: str, prompt: str, negative_prompt: str, duration_seconds: float)`
+        *   **Inputs:**
+            *   `video_path`: Caminho para o arquivo de vídeo silencioso (string).
+            *   `prompt`: Prompt textual para guiar a geração de áudio (string).
+            *   `negative_prompt`: Prompt negativo para áudio (string).
+            *   `duration_seconds`: Duração exata do vídeo (float).
+        *   **Output:** `str` (O caminho para o novo arquivo de vídeo com a faixa de áudio integrada).
+---
+## 🔗 Projetos Originais e Atribuições
+(A seção de atribuições e licenças permanece a mesma que definimos anteriormente)
+### DreamO
+*   **Repositório Original:** [https://github.com/bytedance/DreamO](https://github.com/bytedance/DreamO)
+...
+### LTX-Video
+*   **Repositório Original:** [https://github.com/Lightricks/LTX-Video](https://github.com/Lightricks/LTX-Video)
+...
+### MMAudio
+*   **Repositório Original:** [https://github.com/hkchengrex/MMAudio](https://github.com/hkchengrex/MMAudio)
+...

dreamo/dreamo_pipeline.py ADDED Viewed

	@@ -0,0 +1,507 @@

+# Copyright (c) 2025 Bytedance Ltd. and/or its affiliates
+# Copyright 2024 Black Forest Labs and The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import Any, Callable, Dict, List, Optional, Union
+import diffusers
+import numpy as np
+import torch
+import torch.nn as nn
+from diffusers import FluxPipeline
+from diffusers.pipelines.flux.pipeline_flux import calculate_shift, retrieve_timesteps
+from diffusers.pipelines.flux.pipeline_output import FluxPipelineOutput
+from einops import repeat
+from huggingface_hub import hf_hub_download
+from safetensors.torch import load_file
+from dreamo.transformer import flux_transformer_forward
+from dreamo.utils import convert_flux_lora_to_diffusers
+diffusers.models.transformers.transformer_flux.FluxTransformer2DModel.forward = flux_transformer_forward
+def get_task_embedding_idx(task):
+    return 0
+class DreamOPipeline(FluxPipeline):
+    def __init__(self, scheduler, vae, text_encoder, tokenizer, text_encoder_2, tokenizer_2, transformer):
+        super().__init__(scheduler, vae, text_encoder, tokenizer, text_encoder_2, tokenizer_2, transformer)
+        self.t5_embedding = nn.Embedding(10, 4096)
+        self.task_embedding = nn.Embedding(2, 3072)
+        self.idx_embedding = nn.Embedding(10, 3072)
+    def load_dreamo_model(self, device, use_turbo=True, version='v1.1'):
+        # download models and load file
+        hf_hub_download(repo_id='ByteDance/DreamO', filename='dreamo.safetensors', local_dir='models')
+        hf_hub_download(repo_id='ByteDance/DreamO', filename='dreamo_cfg_distill.safetensors', local_dir='models')
+        if version == 'v1':
+            hf_hub_download(repo_id='ByteDance/DreamO', filename='dreamo_quality_lora_pos.safetensors',
+                            local_dir='models')
+            hf_hub_download(repo_id='ByteDance/DreamO', filename='dreamo_quality_lora_neg.safetensors',
+                            local_dir='models')
+            quality_lora_pos = load_file('models/dreamo_quality_lora_pos.safetensors')
+            quality_lora_neg = load_file('models/dreamo_quality_lora_neg.safetensors')
+        elif version == 'v1.1':
+            hf_hub_download(repo_id='ByteDance/DreamO', filename='v1.1/dreamo_sft_lora.safetensors', local_dir='models')
+            hf_hub_download(repo_id='ByteDance/DreamO', filename='v1.1/dreamo_dpo_lora.safetensors', local_dir='models')
+            sft_lora = load_file('models/v1.1/dreamo_sft_lora.safetensors')
+            dpo_lora = load_file('models/v1.1/dreamo_dpo_lora.safetensors')
+        else:
+            raise ValueError(f'there is no {version}')
+        dreamo_lora = load_file('models/dreamo.safetensors')
+        cfg_distill_lora = load_file('models/dreamo_cfg_distill.safetensors')
+        # load embedding
+        self.t5_embedding.weight.data = dreamo_lora.pop('dreamo_t5_embedding.weight')[-10:]
+        self.task_embedding.weight.data = dreamo_lora.pop('dreamo_task_embedding.weight')
+        self.idx_embedding.weight.data = dreamo_lora.pop('dreamo_idx_embedding.weight')
+        self._prepare_t5()
+        # main lora
+        dreamo_diffuser_lora = convert_flux_lora_to_diffusers(dreamo_lora)
+        adapter_names = ['dreamo']
+        adapter_weights = [1]
+        self.load_lora_weights(dreamo_diffuser_lora, adapter_name='dreamo')
+        # cfg lora to avoid true image cfg
+        cfg_diffuser_lora = convert_flux_lora_to_diffusers(cfg_distill_lora)
+        self.load_lora_weights(cfg_diffuser_lora, adapter_name='cfg')
+        adapter_names.append('cfg')
+        adapter_weights.append(1)
+        # turbo lora to speed up (from 25+ step to 12 step)
+        if use_turbo:
+            self.load_lora_weights(
+                hf_hub_download(
+                    "alimama-creative/FLUX.1-Turbo-Alpha", "diffusion_pytorch_model.safetensors", local_dir='models'
+                ),
+                adapter_name='turbo',
+            )
+            adapter_names.append('turbo')
+            adapter_weights.append(1)
+        if version == 'v1':
+            # quality loras, one pos, one neg
+            quality_lora_pos = convert_flux_lora_to_diffusers(quality_lora_pos)
+            self.load_lora_weights(quality_lora_pos, adapter_name='quality_pos')
+            adapter_names.append('quality_pos')
+            adapter_weights.append(0.15)
+            quality_lora_neg = convert_flux_lora_to_diffusers(quality_lora_neg)
+            self.load_lora_weights(quality_lora_neg, adapter_name='quality_neg')
+            adapter_names.append('quality_neg')
+            adapter_weights.append(-0.8)
+        elif version == 'v1.1':
+            self.load_lora_weights(sft_lora, adapter_name='sft_lora')
+            adapter_names.append('sft_lora')
+            adapter_weights.append(1)
+            self.load_lora_weights(dpo_lora, adapter_name='dpo_lora')
+            adapter_names.append('dpo_lora')
+            adapter_weights.append(1.25)
+        self.set_adapters(adapter_names, adapter_weights)
+        self.fuse_lora(adapter_names=adapter_names, lora_scale=1)
+        self.unload_lora_weights()
+        self.t5_embedding = self.t5_embedding.to(device)
+        self.task_embedding = self.task_embedding.to(device)
+        self.idx_embedding = self.idx_embedding.to(device)
+    def _prepare_t5(self):
+        self.text_encoder_2.resize_token_embeddings(len(self.tokenizer_2))
+        num_new_token = 10
+        new_token_list = [f"[ref#{i}]" for i in range(1, 10)] + ["[res]"]
+        self.tokenizer_2.add_tokens(new_token_list, special_tokens=False)
+        self.text_encoder_2.resize_token_embeddings(len(self.tokenizer_2))
+        input_embedding = self.text_encoder_2.get_input_embeddings().weight.data
+        input_embedding[-num_new_token:] = self.t5_embedding.weight.data
+    @staticmethod
+    def _prepare_latent_image_ids(batch_size, height, width, device, dtype, start_height=0, start_width=0):
+        latent_image_ids = torch.zeros(height // 2, width // 2, 3)
+        latent_image_ids[..., 1] = latent_image_ids[..., 1] + torch.arange(height // 2)[:, None] + start_height
+        latent_image_ids[..., 2] = latent_image_ids[..., 2] + torch.arange(width // 2)[None, :] + start_width
+        latent_image_id_height, latent_image_id_width, latent_image_id_channels = latent_image_ids.shape
+        latent_image_ids = latent_image_ids[None, :].repeat(batch_size, 1, 1, 1)
+        latent_image_ids = latent_image_ids.reshape(
+            batch_size, latent_image_id_height * latent_image_id_width, latent_image_id_channels
+        )
+        return latent_image_ids.to(device=device, dtype=dtype)
+    @staticmethod
+    def _prepare_style_latent_image_ids(batch_size, height, width, device, dtype, start_height=0, start_width=0):
+        latent_image_ids = torch.zeros(height // 2, width // 2, 3)
+        latent_image_ids[..., 1] = latent_image_ids[..., 1] + start_height
+        latent_image_ids[..., 2] = latent_image_ids[..., 2] + start_width
+        latent_image_id_height, latent_image_id_width, latent_image_id_channels = latent_image_ids.shape
+        latent_image_ids = latent_image_ids[None, :].repeat(batch_size, 1, 1, 1)
+        latent_image_ids = latent_image_ids.reshape(
+            batch_size, latent_image_id_height * latent_image_id_width, latent_image_id_channels
+        )
+        return latent_image_ids.to(device=device, dtype=dtype)
+    @torch.no_grad()
+    def __call__(
+        self,
+        prompt: Union[str, List[str]] = None,
+        prompt_2: Optional[Union[str, List[str]]] = None,
+        negative_prompt: Union[str, List[str]] = None,
+        negative_prompt_2: Optional[Union[str, List[str]]] = None,
+        true_cfg_scale: float = 1.0,
+        true_cfg_start_step: int = 1,
+        true_cfg_end_step: int = 1,
+        height: Optional[int] = None,
+        width: Optional[int] = None,
+        num_inference_steps: int = 28,
+        sigmas: Optional[List[float]] = None,
+        guidance_scale: float = 3.5,
+        neg_guidance_scale: float = 3.5,
+        num_images_per_prompt: Optional[int] = 1,
+        generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,
+        latents: Optional[torch.FloatTensor] = None,
+        prompt_embeds: Optional[torch.FloatTensor] = None,
+        pooled_prompt_embeds: Optional[torch.FloatTensor] = None,
+        negative_prompt_embeds: Optional[torch.FloatTensor] = None,
+        negative_pooled_prompt_embeds: Optional[torch.FloatTensor] = None,
+        output_type: Optional[str] = "pil",
+        return_dict: bool = True,
+        joint_attention_kwargs: Optional[Dict[str, Any]] = None,
+        callback_on_step_end: Optional[Callable[[int, int, Dict], None]] = None,
+        callback_on_step_end_tensor_inputs: List[str] = ["latents"],
+        max_sequence_length: int = 512,
+        ref_conds=None,
+        first_step_guidance_scale=3.5,
+    ):
+        r"""
+        Function invoked when calling the pipeline for generation.
+        Args:
+            prompt (`str` or `List[str]`, *optional*):
+                The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`.
+                instead.
+            prompt_2 (`str` or `List[str]`, *optional*):
+                The prompt or prompts to be sent to `tokenizer_2` and `text_encoder_2`. If not defined, `prompt` is
+                will be used instead.
+            negative_prompt (`str` or `List[str]`, *optional*):
+                The prompt or prompts not to guide the image generation. If not defined, one has to pass
+                `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `true_cfg_scale` is
+                not greater than `1`).
+            negative_prompt_2 (`str` or `List[str]`, *optional*):
+                The prompt or prompts not to guide the image generation to be sent to `tokenizer_2` and
+                `text_encoder_2`. If not defined, `negative_prompt` is used in all the text-encoders.
+            true_cfg_scale (`float`, *optional*, defaults to 1.0):
+                When > 1.0 and a provided `negative_prompt`, enables true classifier-free guidance.
+            height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor):
+                The height in pixels of the generated image. This is set to 1024 by default for the best results.
+            width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor):
+                The width in pixels of the generated image. This is set to 1024 by default for the best results.
+            num_inference_steps (`int`, *optional*, defaults to 50):
+                The number of denoising steps. More denoising steps usually lead to a higher quality image at the
+                expense of slower inference.
+            sigmas (`List[float]`, *optional*):
+                Custom sigmas to use for the denoising process with schedulers which support a `sigmas` argument in
+                their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is passed
+                will be used.
+            guidance_scale (`float`, *optional*, defaults to 3.5):
+                Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
+                `guidance_scale` is defined as `w` of equation 2. of [Imagen
+                Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
+                1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
+                usually at the expense of lower image quality.
+            num_images_per_prompt (`int`, *optional*, defaults to 1):
+                The number of images to generate per prompt.
+            generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
+                One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html)
+                to make generation deterministic.
+            latents (`torch.FloatTensor`, *optional*):
+                Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
+                generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
+                tensor will ge generated by sampling using the supplied random `generator`.
+            prompt_embeds (`torch.FloatTensor`, *optional*):
+                Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
+                provided, text embeddings will be generated from `prompt` input argument.
+            pooled_prompt_embeds (`torch.FloatTensor`, *optional*):
+                Pre-generated pooled text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting.
+                If not provided, pooled text embeddings will be generated from `prompt` input argument.
+            negative_prompt_embeds (`torch.FloatTensor`, *optional*):
+                Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
+                weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
+                argument.
+            negative_pooled_prompt_embeds (`torch.FloatTensor`, *optional*):
+                Pre-generated negative pooled text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
+                weighting. If not provided, pooled negative_prompt_embeds will be generated from `negative_prompt`
+                input argument.
+            output_type (`str`, *optional*, defaults to `"pil"`):
+                The output format of the generate image. Choose between
+                [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
+            return_dict (`bool`, *optional*, defaults to `True`):
+                Whether or not to return a [`~pipelines.flux.FluxPipelineOutput`] instead of a plain tuple.
+            joint_attention_kwargs (`dict`, *optional*):
+                A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
+                `self.processor` in
+                [diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
+            callback_on_step_end (`Callable`, *optional*):
+                A function that calls at the end of each denoising steps during the inference. The function is called
+                with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int,
+                callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by
+                `callback_on_step_end_tensor_inputs`.
+            callback_on_step_end_tensor_inputs (`List`, *optional*):
+                The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list
+                will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the
+                `._callback_tensor_inputs` attribute of your pipeline class.
+            max_sequence_length (`int` defaults to 512): Maximum sequence length to use with the `prompt`.
+        Examples:
+        Returns:
+            [`~pipelines.flux.FluxPipelineOutput`] or `tuple`: [`~pipelines.flux.FluxPipelineOutput`] if `return_dict`
+            is True, otherwise a `tuple`. When returning a tuple, the first element is a list with the generated
+            images.
+        """
+        height = height or self.default_sample_size * self.vae_scale_factor
+        width = width or self.default_sample_size * self.vae_scale_factor
+        # 1. Check inputs. Raise error if not correct
+        self.check_inputs(
+            prompt,
+            prompt_2,
+            height,
+            width,
+            prompt_embeds=prompt_embeds,
+            pooled_prompt_embeds=pooled_prompt_embeds,
+            callback_on_step_end_tensor_inputs=callback_on_step_end_tensor_inputs,
+            max_sequence_length=max_sequence_length,
+        )
+        self._guidance_scale = guidance_scale
+        self._joint_attention_kwargs = joint_attention_kwargs
+        self._current_timestep = None
+        self._interrupt = False
+        # 2. Define call parameters
+        if prompt is not None and isinstance(prompt, str):
+            batch_size = 1
+        elif prompt is not None and isinstance(prompt, list):
+            batch_size = len(prompt)
+        else:
+            batch_size = prompt_embeds.shape[0]
+        device = self._execution_device
+        lora_scale = (
+            self.joint_attention_kwargs.get("scale", None) if self.joint_attention_kwargs is not None else None
+        )
+        has_neg_prompt = negative_prompt is not None or (
+            negative_prompt_embeds is not None and negative_pooled_prompt_embeds is not None
+        )
+        do_true_cfg = true_cfg_scale > 1 and has_neg_prompt
+        (
+            prompt_embeds,
+            pooled_prompt_embeds,
+            text_ids,
+        ) = self.encode_prompt(
+            prompt=prompt,
+            prompt_2=prompt_2,
+            prompt_embeds=prompt_embeds,
+            pooled_prompt_embeds=pooled_prompt_embeds,
+            device=device,
+            num_images_per_prompt=num_images_per_prompt,
+            max_sequence_length=max_sequence_length,
+            lora_scale=lora_scale,
+        )
+        if do_true_cfg:
+            (
+                negative_prompt_embeds,
+                negative_pooled_prompt_embeds,
+                _,
+            ) = self.encode_prompt(
+                prompt=negative_prompt,
+                prompt_2=negative_prompt_2,
+                prompt_embeds=negative_prompt_embeds,
+                pooled_prompt_embeds=negative_pooled_prompt_embeds,
+                device=device,
+                num_images_per_prompt=num_images_per_prompt,
+                max_sequence_length=max_sequence_length,
+                lora_scale=lora_scale,
+            )
+        # 4. Prepare latent variables
+        num_channels_latents = self.transformer.config.in_channels // 4
+        latents, latent_image_ids = self.prepare_latents(
+            batch_size * num_images_per_prompt,
+            num_channels_latents,
+            height,
+            width,
+            prompt_embeds.dtype,
+            device,
+            generator,
+            latents,
+        )
+        # 4.1 concat ref tokens to latent
+        origin_img_len = latents.shape[1]
+        embeddings = repeat(self.task_embedding.weight[1], "c -> n l c", n=batch_size, l=origin_img_len)
+        ref_latents = []
+        ref_latent_image_idss = []
+        start_height = height // 16
+        start_width = width // 16
+        for ref_cond in ref_conds:
+            img = ref_cond['img']  # [b, 3, h, w], range [-1, 1]
+            task = ref_cond['task']
+            idx = ref_cond['idx']
+            # encode ref with VAE
+            img = img.to(latents)
+            ref_latent = self.vae.encode(img).latent_dist.sample()
+            ref_latent = (ref_latent - self.vae.config.shift_factor) * self.vae.config.scaling_factor
+            cur_height = ref_latent.shape[2]
+            cur_width = ref_latent.shape[3]
+            ref_latent = self._pack_latents(ref_latent, batch_size, num_channels_latents, cur_height, cur_width)
+            ref_latent_image_ids = self._prepare_latent_image_ids(
+                batch_size, cur_height, cur_width, device, prompt_embeds.dtype, start_height, start_width
+            )
+            start_height += cur_height // 2
+            start_width += cur_width // 2
+            # prepare task_idx_embedding
+            task_idx = get_task_embedding_idx(task)
+            cur_task_embedding = repeat(
+                self.task_embedding.weight[task_idx], "c -> n l c", n=batch_size, l=ref_latent.shape[1]
+            )
+            cur_idx_embedding = repeat(
+                self.idx_embedding.weight[idx], "c -> n l c", n=batch_size, l=ref_latent.shape[1]
+            )
+            cur_embedding = cur_task_embedding + cur_idx_embedding
+            # concat ref to latent
+            embeddings = torch.cat([embeddings, cur_embedding], dim=1)
+            ref_latents.append(ref_latent)
+            ref_latent_image_idss.append(ref_latent_image_ids)
+        # 5. Prepare timesteps
+        sigmas = np.linspace(1.0, 1 / num_inference_steps, num_inference_steps) if sigmas is None else sigmas
+        image_seq_len = latents.shape[1]
+        mu = calculate_shift(
+            image_seq_len,
+            self.scheduler.config.get("base_image_seq_len", 256),
+            self.scheduler.config.get("max_image_seq_len", 4096),
+            self.scheduler.config.get("base_shift", 0.5),
+            self.scheduler.config.get("max_shift", 1.15),
+        )
+        timesteps, num_inference_steps = retrieve_timesteps(
+            self.scheduler,
+            num_inference_steps,
+            device,
+            sigmas=sigmas,
+            mu=mu,
+        )
+        num_warmup_steps = max(len(timesteps) - num_inference_steps * self.scheduler.order, 0)
+        self._num_timesteps = len(timesteps)
+        # handle guidance
+        if self.transformer.config.guidance_embeds:
+            guidance = torch.full([1], guidance_scale, device=device, dtype=torch.float32)
+            guidance = guidance.expand(latents.shape[0])
+        else:
+            guidance = None
+        neg_guidance = torch.full([1], neg_guidance_scale, device=device, dtype=torch.float32)
+        neg_guidance = neg_guidance.expand(latents.shape[0])
+        first_step_guidance = torch.full([1], first_step_guidance_scale, device=device, dtype=torch.float32)
+        if self.joint_attention_kwargs is None:
+            self._joint_attention_kwargs = {}
+        # 6. Denoising loop
+        with self.progress_bar(total=num_inference_steps) as progress_bar:
+            for i, t in enumerate(timesteps):
+                if self.interrupt:
+                    continue
+                self._current_timestep = t
+                # broadcast to batch dimension in a way that's compatible with ONNX/Core ML
+                timestep = t.expand(latents.shape[0]).to(latents.dtype)
+                noise_pred = self.transformer(
+                    hidden_states=torch.cat((latents, *ref_latents), dim=1),
+                    timestep=timestep / 1000,
+                    guidance=guidance if i > 0 else first_step_guidance,
+                    pooled_projections=pooled_prompt_embeds,
+                    encoder_hidden_states=prompt_embeds,
+                    txt_ids=text_ids,
+                    img_ids=torch.cat((latent_image_ids, *ref_latent_image_idss), dim=1),
+                    joint_attention_kwargs=self.joint_attention_kwargs,
+                    return_dict=False,
+                    embeddings=embeddings,
+                )[0][:, :origin_img_len]
+                if do_true_cfg and i >= true_cfg_start_step and i < true_cfg_end_step:
+                    neg_noise_pred = self.transformer(
+                        hidden_states=latents,
+                        timestep=timestep / 1000,
+                        guidance=neg_guidance,
+                        pooled_projections=negative_pooled_prompt_embeds,
+                        encoder_hidden_states=negative_prompt_embeds,
+                        txt_ids=text_ids,
+                        img_ids=latent_image_ids,
+                        joint_attention_kwargs=self.joint_attention_kwargs,
+                        return_dict=False,
+                    )[0]
+                    noise_pred = neg_noise_pred + true_cfg_scale * (noise_pred - neg_noise_pred)
+                # compute the previous noisy sample x_t -> x_t-1
+                latents_dtype = latents.dtype
+                latents = self.scheduler.step(noise_pred, t, latents, return_dict=False)[0]
+                if latents.dtype != latents_dtype and torch.backends.mps.is_available():
+                    # some platforms (eg. apple mps) misbehave due to a pytorch bug: https://github.com/pytorch/pytorch/pull/99272
+                    latents = latents.to(latents_dtype)
+                if callback_on_step_end is not None:
+                    callback_kwargs = {}
+                    for k in callback_on_step_end_tensor_inputs:
+                        callback_kwargs[k] = locals()[k]
+                    callback_outputs = callback_on_step_end(self, i, t, callback_kwargs)
+                    latents = callback_outputs.pop("latents", latents)
+                    prompt_embeds = callback_outputs.pop("prompt_embeds", prompt_embeds)
+                # call the callback, if provided
+                if i == len(timesteps) - 1 or ((i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0):
+                    progress_bar.update()
+        self._current_timestep = None
+        if output_type == "latent":
+            image = latents
+        else:
+            latents = self._unpack_latents(latents, height, width, self.vae_scale_factor)
+            latents = (latents / self.vae.config.scaling_factor) + self.vae.config.shift_factor
+            image = self.vae.decode(latents, return_dict=False)[0]
+            image = self.image_processor.postprocess(image, output_type=output_type)
+        # Offload all models
+        self.maybe_free_model_hooks()
+        if not return_dict:
+            return (image,)
+        return FluxPipelineOutput(images=image)

dreamo/transformer.py ADDED Viewed

	@@ -0,0 +1,187 @@

+# Copyright (c) 2025 Bytedance Ltd. and/or its affiliates
+# Copyright 2024 Black Forest Labs and The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import Any, Dict, Optional, Union
+import numpy as np
+import torch
+from diffusers.models.modeling_outputs import Transformer2DModelOutput
+from diffusers.utils import (
+    USE_PEFT_BACKEND,
+    logging,
+    scale_lora_layers,
+    unscale_lora_layers,
+)
+logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
+def flux_transformer_forward(
+    self,
+    hidden_states: torch.Tensor,
+    encoder_hidden_states: torch.Tensor = None,
+    pooled_projections: torch.Tensor = None,
+    timestep: torch.LongTensor = None,
+    img_ids: torch.Tensor = None,
+    txt_ids: torch.Tensor = None,
+    guidance: torch.Tensor = None,
+    joint_attention_kwargs: Optional[Dict[str, Any]] = None,
+    controlnet_block_samples=None,
+    controlnet_single_block_samples=None,
+    return_dict: bool = True,
+    controlnet_blocks_repeat: bool = False,
+    embeddings: torch.Tensor = None,
+) -> Union[torch.Tensor, Transformer2DModelOutput]:
+    """
+    The [`FluxTransformer2DModel`] forward method.
+    Args:
+        hidden_states (`torch.Tensor` of shape `(batch_size, image_sequence_length, in_channels)`):
+            Input `hidden_states`.
+        encoder_hidden_states (`torch.Tensor` of shape `(batch_size, text_sequence_length, joint_attention_dim)`):
+            Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
+        pooled_projections (`torch.Tensor` of shape `(batch_size, projection_dim)`): Embeddings projected
+            from the embeddings of input conditions.
+        timestep ( `torch.LongTensor`):
+            Used to indicate denoising step.
+        block_controlnet_hidden_states: (`list` of `torch.Tensor`):
+            A list of tensors that if specified are added to the residuals of transformer blocks.
+        joint_attention_kwargs (`dict`, *optional*):
+            A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
+            `self.processor` in
+            [diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
+        return_dict (`bool`, *optional*, defaults to `True`):
+            Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
+            tuple.
+    Returns:
+        If `return_dict` is True, an [`~models.transformer_2d.Transformer2DModelOutput`] is returned, otherwise a
+        `tuple` where the first element is the sample tensor.
+    """
+    if joint_attention_kwargs is not None:
+        joint_attention_kwargs = joint_attention_kwargs.copy()
+        lora_scale = joint_attention_kwargs.pop("scale", 1.0)
+    else:
+        lora_scale = 1.0
+    if USE_PEFT_BACKEND:
+        # weight the lora layers by setting `lora_scale` for each PEFT layer
+        scale_lora_layers(self, lora_scale)
+    else:
+        if joint_attention_kwargs is not None and joint_attention_kwargs.get("scale", None) is not None:
+            logger.warning(
+                "Passing `scale` via `joint_attention_kwargs` when not using the PEFT backend is ineffective."
+            )
+    hidden_states = self.x_embedder(hidden_states)
+    # add task and idx embedding
+    if embeddings is not None:
+        hidden_states = hidden_states + embeddings
+    timestep = timestep.to(hidden_states.dtype) * 1000
+    guidance = guidance.to(hidden_states.dtype) * 1000 if guidance is not None else None
+    temb = (
+        self.time_text_embed(timestep, pooled_projections)
+        if guidance is None
+        else self.time_text_embed(timestep, guidance, pooled_projections)
+    )
+    encoder_hidden_states = self.context_embedder(encoder_hidden_states)
+    if txt_ids.ndim == 3:
+        # logger.warning(
+        #     "Passing `txt_ids` 3d torch.Tensor is deprecated."
+        #     "Please remove the batch dimension and pass it as a 2d torch Tensor"
+        # )
+        txt_ids = txt_ids[0]
+    if img_ids.ndim == 3:
+        # logger.warning(
+        #     "Passing `img_ids` 3d torch.Tensor is deprecated."
+        #     "Please remove the batch dimension and pass it as a 2d torch Tensor"
+        # )
+        img_ids = img_ids[0]
+    ids = torch.cat((txt_ids, img_ids), dim=0)
+    image_rotary_emb = self.pos_embed(ids)
+    for index_block, block in enumerate(self.transformer_blocks):
+        if torch.is_grad_enabled() and self.gradient_checkpointing:
+            encoder_hidden_states, hidden_states = self._gradient_checkpointing_func(
+                block,
+                hidden_states,
+                encoder_hidden_states,
+                temb,
+                image_rotary_emb,
+            )
+        else:
+            encoder_hidden_states, hidden_states = block(
+                hidden_states=hidden_states,
+                encoder_hidden_states=encoder_hidden_states,
+                temb=temb,
+                image_rotary_emb=image_rotary_emb,
+                joint_attention_kwargs=joint_attention_kwargs,
+            )
+        # controlnet residual
+        if controlnet_block_samples is not None:
+            interval_control = len(self.transformer_blocks) / len(controlnet_block_samples)
+            interval_control = int(np.ceil(interval_control))
+            # For Xlabs ControlNet.
+            if controlnet_blocks_repeat:
+                hidden_states = hidden_states + controlnet_block_samples[index_block % len(controlnet_block_samples)]
+            else:
+                hidden_states = hidden_states + controlnet_block_samples[index_block // interval_control]
+    hidden_states = torch.cat([encoder_hidden_states, hidden_states], dim=1)
+    for index_block, block in enumerate(self.single_transformer_blocks):
+        if torch.is_grad_enabled() and self.gradient_checkpointing:
+            hidden_states = self._gradient_checkpointing_func(
+                block,
+                hidden_states,
+                temb,
+                image_rotary_emb,
+            )
+        else:
+            hidden_states = block(
+                hidden_states=hidden_states,
+                temb=temb,
+                image_rotary_emb=image_rotary_emb,
+                joint_attention_kwargs=joint_attention_kwargs,
+            )
+        # controlnet residual
+        if controlnet_single_block_samples is not None:
+            interval_control = len(self.single_transformer_blocks) / len(controlnet_single_block_samples)
+            interval_control = int(np.ceil(interval_control))
+            hidden_states[:, encoder_hidden_states.shape[1] :, ...] = (
+                hidden_states[:, encoder_hidden_states.shape[1] :, ...]
+                + controlnet_single_block_samples[index_block // interval_control]
+            )
+    hidden_states = hidden_states[:, encoder_hidden_states.shape[1] :, ...]
+    hidden_states = self.norm_out(hidden_states, temb)
+    output = self.proj_out(hidden_states)
+    if USE_PEFT_BACKEND:
+        # remove `lora_scale` from each PEFT layer
+        unscale_lora_layers(self, lora_scale)
+    if not return_dict:
+        return (output,)
+    return Transformer2DModelOutput(sample=output)

dreamo/utils.py ADDED Viewed

	@@ -0,0 +1,232 @@

+# Copyright (c) 2025 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import math
+import re
+import cv2
+import numpy as np
+import torch
+from torchvision.utils import make_grid
+# from basicsr
+def img2tensor(imgs, bgr2rgb=True, float32=True):
+    """Numpy array to tensor.
+    Args:
+        imgs (list[ndarray] | ndarray): Input images.
+        bgr2rgb (bool): Whether to change bgr to rgb.
+        float32 (bool): Whether to change to float32.
+    Returns:
+        list[tensor] | tensor: Tensor images. If returned results only have
+            one element, just return tensor.
+    """
+    def _totensor(img, bgr2rgb, float32):
+        if img.shape[2] == 3 and bgr2rgb:
+            if img.dtype == 'float64':
+                img = img.astype('float32')
+            img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
+        img = torch.from_numpy(img.transpose(2, 0, 1))
+        if float32:
+            img = img.float()
+        return img
+    if isinstance(imgs, list):
+        return [_totensor(img, bgr2rgb, float32) for img in imgs]
+    return _totensor(imgs, bgr2rgb, float32)
+def tensor2img(tensor, rgb2bgr=True, out_type=np.uint8, min_max=(0, 1)):
+    """Convert torch Tensors into image numpy arrays.
+    After clamping to [min, max], values will be normalized to [0, 1].
+    Args:
+        tensor (Tensor or list[Tensor]): Accept shapes:
+            1) 4D mini-batch Tensor of shape (B x 3/1 x H x W);
+            2) 3D Tensor of shape (3/1 x H x W);
+            3) 2D Tensor of shape (H x W).
+            Tensor channel should be in RGB order.
+        rgb2bgr (bool): Whether to change rgb to bgr.
+        out_type (numpy type): output types. If ``np.uint8``, transform outputs
+            to uint8 type with range [0, 255]; otherwise, float type with
+            range [0, 1]. Default: ``np.uint8``.
+        min_max (tuple[int]): min and max values for clamp.
+    Returns:
+        (Tensor or list): 3D ndarray of shape (H x W x C) OR 2D ndarray of
+        shape (H x W). The channel order is BGR.
+    """
+    if not (torch.is_tensor(tensor) or (isinstance(tensor, list) and all(torch.is_tensor(t) for t in tensor))):
+        raise TypeError(f'tensor or list of tensors expected, got {type(tensor)}')
+    if torch.is_tensor(tensor):
+        tensor = [tensor]
+    result = []
+    for _tensor in tensor:
+        _tensor = _tensor.squeeze(0).float().detach().cpu().clamp_(*min_max)
+        _tensor = (_tensor - min_max[0]) / (min_max[1] - min_max[0])
+        n_dim = _tensor.dim()
+        if n_dim == 4:
+            img_np = make_grid(_tensor, nrow=int(math.sqrt(_tensor.size(0))), normalize=False).numpy()
+            img_np = img_np.transpose(1, 2, 0)
+            if rgb2bgr:
+                img_np = cv2.cvtColor(img_np, cv2.COLOR_RGB2BGR)
+        elif n_dim == 3:
+            img_np = _tensor.numpy()
+            img_np = img_np.transpose(1, 2, 0)
+            if img_np.shape[2] == 1:  # gray image
+                img_np = np.squeeze(img_np, axis=2)
+            else:
+                if rgb2bgr:
+                    img_np = cv2.cvtColor(img_np, cv2.COLOR_RGB2BGR)
+        elif n_dim == 2:
+            img_np = _tensor.numpy()
+        else:
+            raise TypeError(f'Only support 4D, 3D or 2D tensor. But received with dimension: {n_dim}')
+        if out_type == np.uint8:
+            # Unlike MATLAB, numpy.unit8() WILL NOT round by default.
+            img_np = (img_np * 255.0).round()
+        img_np = img_np.astype(out_type)
+        result.append(img_np)
+    if len(result) == 1:
+        result = result[0]
+    return result
+def resize_numpy_image_area(image, area=512 * 512):
+    h, w = image.shape[:2]
+    k = math.sqrt(area / (h * w))
+    h = int(h * k) - (int(h * k) % 16)
+    w = int(w * k) - (int(w * k) % 16)
+    image = cv2.resize(image, (w, h), interpolation=cv2.INTER_AREA)
+    return image
+def resize_numpy_image_long(image, long_edge=768):
+    h, w = image.shape[:2]
+    if max(h, w) <= long_edge:
+        return image
+    k = long_edge / max(h, w)
+    h = int(h * k)
+    w = int(w * k)
+    image = cv2.resize(image, (w, h), interpolation=cv2.INTER_AREA)
+    return image
+# reference: https://github.com/huggingface/diffusers/pull/9295/files
+def convert_flux_lora_to_diffusers(old_state_dict):
+    new_state_dict = {}
+    orig_keys = list(old_state_dict.keys())
+    def handle_qkv(sds_sd, ait_sd, sds_key, ait_keys, dims=None):
+        down_weight = sds_sd.pop(sds_key)
+        up_weight = sds_sd.pop(sds_key.replace(".down.weight", ".up.weight"))
+        # calculate dims if not provided
+        num_splits = len(ait_keys)
+        if dims is None:
+            dims = [up_weight.shape[0] // num_splits] * num_splits
+        else:
+            assert sum(dims) == up_weight.shape[0]
+        # make ai-toolkit weight
+        ait_down_keys = [k + ".lora_A.weight" for k in ait_keys]
+        ait_up_keys = [k + ".lora_B.weight" for k in ait_keys]
+        # down_weight is copied to each split
+        ait_sd.update({k: down_weight for k in ait_down_keys})
+        # up_weight is split to each split
+        ait_sd.update({k: v for k, v in zip(ait_up_keys, torch.split(up_weight, dims, dim=0))})  # noqa: C416
+    for old_key in orig_keys:
+        # Handle double_blocks
+        if 'double_blocks' in old_key:
+            block_num = re.search(r"double_blocks_(\d+)", old_key).group(1)
+            new_key = f"transformer.transformer_blocks.{block_num}"
+            if "proj_lora1" in old_key:
+                new_key += ".attn.to_out.0"
+            elif "proj_lora2" in old_key:
+                new_key += ".attn.to_add_out"
+            elif "qkv_lora2" in old_key and "up" not in old_key:
+                handle_qkv(
+                    old_state_dict,
+                    new_state_dict,
+                    old_key,
+                    [
+                        f"transformer.transformer_blocks.{block_num}.attn.add_q_proj",
+                        f"transformer.transformer_blocks.{block_num}.attn.add_k_proj",
+                        f"transformer.transformer_blocks.{block_num}.attn.add_v_proj",
+                    ],
+                )
+                # continue
+            elif "qkv_lora1" in old_key and "up" not in old_key:
+                handle_qkv(
+                    old_state_dict,
+                    new_state_dict,
+                    old_key,
+                    [
+                        f"transformer.transformer_blocks.{block_num}.attn.to_q",
+                        f"transformer.transformer_blocks.{block_num}.attn.to_k",
+                        f"transformer.transformer_blocks.{block_num}.attn.to_v",
+                    ],
+                )
+                # continue
+            if "down" in old_key:
+                new_key += ".lora_A.weight"
+            elif "up" in old_key:
+                new_key += ".lora_B.weight"
+        # Handle single_blocks
+        elif 'single_blocks' in old_key:
+            block_num = re.search(r"single_blocks_(\d+)", old_key).group(1)
+            new_key = f"transformer.single_transformer_blocks.{block_num}"
+            if "proj_lora" in old_key:
+                new_key += ".proj_out"
+            elif "qkv_lora" in old_key and "up" not in old_key:
+                handle_qkv(
+                    old_state_dict,
+                    new_state_dict,
+                    old_key,
+                    [
+                        f"transformer.single_transformer_blocks.{block_num}.attn.to_q",
+                        f"transformer.single_transformer_blocks.{block_num}.attn.to_k",
+                        f"transformer.single_transformer_blocks.{block_num}.attn.to_v",
+                    ],
+                )
+            if "down" in old_key:
+                new_key += ".lora_A.weight"
+            elif "up" in old_key:
+                new_key += ".lora_B.weight"
+        else:
+            # Handle other potential key patterns here
+            new_key = old_key
+        # Since we already handle qkv above.
+        if "qkv" not in old_key and 'embedding' not in old_key:
+            new_state_dict[new_key] = old_state_dict.pop(old_key)
+    # if len(old_state_dict) > 0:
+    #     raise ValueError(f"`old_state_dict` should be at this point but has: {list(old_state_dict.keys())}.")
+    return new_state_dict

flux_kontext_helpers.py ADDED Viewed

	@@ -0,0 +1,151 @@

+# flux_kontext_helpers.py (ADUC: O Especialista Pintor - com suporte a callback)
+# Copyright (C) 4 de Agosto de 2025  Carlos Rodrigues dos Santos
+import torch
+from PIL import Image, ImageOps
+import gc
+from diffusers import FluxKontextPipeline
+import huggingface_hub
+import os
+import threading
+import yaml
+import logging
+from hardware_manager import hardware_manager
+logger = logging.getLogger(__name__)
+class FluxWorker:
+    """Representa uma única instância do pipeline FluxKontext em um dispositivo."""
+    def __init__(self, device_id='cuda:0'):
+        self.cpu_device = torch.device('cpu')
+        self.device = torch.device(device_id if torch.cuda.is_available() else 'cpu')
+        self.pipe = None
+        self._load_pipe_to_cpu()
+    def _load_pipe_to_cpu(self):
+        if self.pipe is None:
+            logger.info(f"FLUX Worker ({self.device}): Carregando modelo para a CPU...")
+            self.pipe = FluxKontextPipeline.from_pretrained(
+                "black-forest-labs/FLUX.1-Kontext-dev", torch_dtype=torch.bfloat16
+            ).to(self.cpu_device)
+            logger.info(f"FLUX Worker ({self.device}): Modelo pronto na CPU.")
+    def to_gpu(self):
+        if self.device.type == 'cpu': return
+        logger.info(f"FLUX Worker: Movendo modelo para a GPU {self.device}...")
+        self.pipe.to(self.device)
+    def to_cpu(self):
+        if self.device.type == 'cpu': return
+        logger.info(f"FLUX Worker: Descarregando modelo da GPU {self.device}...")
+        self.pipe.to(self.cpu_device)
+        gc.collect()
+        if torch.cuda.is_available(): torch.cuda.empty_cache()
+    def _create_composite_reference(self, images: list[Image.Image], target_width: int, target_height: int) -> Image.Image:
+        if not images: return None
+        valid_images = [img.convert("RGB") for img in images if img is not None]
+        if not valid_images: return None
+        if len(valid_images) == 1:
+            if valid_images[0].size != (target_width, target_height):
+                return ImageOps.fit(valid_images[0], (target_width, target_height), Image.Resampling.LANCZOS)
+            return valid_images[0]
+        base_height = valid_images[0].height
+        resized_for_concat = []
+        for img in valid_images:
+            if img.height != base_height:
+                aspect_ratio = img.width / img.height
+                new_width = int(base_height * aspect_ratio)
+                resized_for_concat.append(img.resize((new_width, base_height), Image.Resampling.LANCZOS))
+            else:
+                resized_for_concat.append(img)
+        total_width = sum(img.width for img in resized_for_concat)
+        concatenated = Image.new('RGB', (total_width, base_height))
+        x_offset = 0
+        for img in resized_for_concat:
+            concatenated.paste(img, (x_offset, 0))
+            x_offset += img.width
+        final_reference = ImageOps.fit(concatenated, (target_width, target_height), Image.Resampling.LANCZOS)
+        return final_reference
+    @torch.inference_mode()
+    def generate_image_internal(self, reference_images: list[Image.Image], prompt: str, target_width: int, target_height: int, seed: int, callback: callable = None):
+        composite_reference = self._create_composite_reference(reference_images, target_width, target_height)
+        num_steps = 30 # Valor fixo otimizado
+        logger.info(f"\n===== [CHAMADA AO PIPELINE FLUX em {self.device}] =====\n"
+                    f"  - Prompt: '{prompt}'\n"
+                    f"  - Resolução: {target_width}x{target_height}, Seed: {seed}, Passos: {num_steps}\n"
+                    f"  - Nº de Imagens na Composição: {len(reference_images)}\n"
+                    f"==========================================")
+        generated_image = self.pipe(
+            image=composite_reference,
+            prompt=prompt,
+            guidance_scale=2.5,
+            width=target_width,
+            height=target_height,
+            num_inference_steps=num_steps,
+            generator=torch.Generator(device="cpu").manual_seed(seed),
+            callback_on_step_end=callback,
+            callback_on_step_end_tensor_inputs=["latents"] if callback else None
+        ).images[0]
+        return generated_image
+class FluxPoolManager:
+    def __init__(self, device_ids):
+        logger.info(f"FLUX POOL MANAGER: Criando workers para os dispositivos: {device_ids}")
+        self.workers = [FluxWorker(device_id) for device_id in device_ids]
+        self.current_worker_index = 0
+        self.lock = threading.Lock()
+        self.last_cleanup_thread = None
+    def _cleanup_worker_thread(self, worker):
+        logger.info(f"FLUX CLEANUP THREAD: Iniciando limpeza de {worker.device} em background...")
+        worker.to_cpu()
+    def generate_image(self, reference_images, prompt, width, height, seed=42, callback=None):
+        worker_to_use = None
+        try:
+            with self.lock:
+                if self.last_cleanup_thread and self.last_cleanup_thread.is_alive():
+                    self.last_cleanup_thread.join()
+                worker_to_use = self.workers[self.current_worker_index]
+                previous_worker_index = (self.current_worker_index - 1 + len(self.workers)) % len(self.workers)
+                worker_to_cleanup = self.workers[previous_worker_index]
+                cleanup_thread = threading.Thread(target=self._cleanup_worker_thread, args=(worker_to_cleanup,))
+                cleanup_thread.start()
+                self.last_cleanup_thread = cleanup_thread
+                worker_to_use.to_gpu()
+                self.current_worker_index = (self.current_worker_index + 1) % len(self.workers)
+            logger.info(f"FLUX POOL MANAGER: Gerando imagem em {worker_to_use.device}...")
+            return worker_to_use.generate_image_internal(
+                reference_images=reference_images,
+                prompt=prompt,
+                target_width=width,
+                target_height=height,
+                seed=seed,
+                callback=callback
+            )
+        except Exception as e:
+            logger.error(f"FLUX POOL MANAGER: Erro durante a geração: {e}", exc_info=True)
+            raise e
+        finally:
+            pass
+# --- Instanciação Singleton Dinâmica ---
+logger.info("Lendo config.yaml para inicializar o FluxKontext Pool Manager...")
+with open("config.yaml", 'r') as f: config = yaml.safe_load(f)
+hf_token = os.getenv('HF_TOKEN');
+if hf_token: huggingface_hub.login(token=hf_token)
+flux_gpus_required = config['specialists']['flux']['gpus_required']
+flux_device_ids = hardware_manager.allocate_gpus('Flux', flux_gpus_required)
+flux_kontext_singleton = FluxPoolManager(device_ids=flux_device_ids)
+logger.info("Especialista de Imagem (Flux) pronto.")

gemini_helpers.py ADDED Viewed

	@@ -0,0 +1,257 @@

+# gemini_helpers.py
+# Copyright (C) 4 de Agosto de 2025  Carlos Rodrigues dos Santos
+#
+# Este programa é software livre: você pode redistribuí-lo e/ou modificá-lo
+# sob os termos da Licença Pública Geral Affero GNU como publicada pela
+# Free Software Foundation, seja a versão 3 da Licença, ou
+# (a seu critério) qualquer versão posterior.
+#
+# AVISO DE PATENTE PENDENTE: O método e sistema ADUC implementado neste
+# software está em processo de patenteamento. Consulte NOTICE.md.
+import os
+import logging
+import json
+import gradio as gr
+from PIL import Image
+import google.generativeai as genai
+import re
+logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
+logger = logging.getLogger(__name__)
+def robust_json_parser(raw_text: str) -> dict:
+    clean_text = raw_text.strip()
+    try:
+        # Tenta encontrar o JSON delimitado por ```json ... ```
+        match = re.search(r'```json\s*(\{.*?\})\s*```', clean_text, re.DOTALL)
+        if match:
+            json_str = match.group(1)
+            return json.loads(json_str)
+        # Se não encontrar, tenta encontrar o primeiro '{' e o último '}'
+        start_index = clean_text.find('{')
+        end_index = clean_text.rfind('}')
+        if start_index != -1 and end_index != -1 and end_index > start_index:
+            json_str = clean_text[start_index : end_index + 1]
+            return json.loads(json_str)
+        else:
+            raise ValueError("Nenhum objeto JSON válido foi encontrado na resposta da IA.")
+    except json.JSONDecodeError as e:
+        logger.error(f"Falha ao decodificar JSON. A IA retornou o seguinte texto:\n---\n{raw_text}\n---")
+        raise ValueError(f"A IA retornou um formato de JSON inválido: {e}")
+class GeminiSingleton:
+    def __init__(self):
+        self.api_key = os.environ.get("GEMINI_API_KEY")
+        if self.api_key:
+            genai.configure(api_key=self.api_key)
+            # Modelo mais recente e capaz para tarefas complexas de visão e raciocínio.
+            self.model = genai.GenerativeModel('gemini-2.5-flash')
+            logger.info("Especialista Gemini (1.5 Pro) inicializado com sucesso.")
+        else:
+            self.model = None
+            logger.warning("Chave da API Gemini não encontrada. Especialista desativado.")
+    def _check_model(self):
+        if not self.model:
+            raise gr.Error("A chave da API do Google Gemini não está configurada (GEMINI_API_KEY).")
+    def _read_prompt_template(self, filename: str) -> str:
+        try:
+            with open(os.path.join("prompts", filename), "r", encoding="utf-8") as f:
+                return f.read()
+        except FileNotFoundError:
+            raise gr.Error(f"Arquivo de prompt não encontrado: prompts/{filename}")
+    def generate_storyboard(self, prompt: str, num_keyframes: int, ref_image_paths: list[str]) -> list[str]:
+        self._check_model()
+        try:
+            template = self._read_prompt_template("unified_storyboard_prompt.txt")
+            storyboard_prompt = template.format(user_prompt=prompt, num_fragments=num_keyframes)
+            model_contents = [storyboard_prompt] + [Image.open(p) for p in ref_image_paths]
+            response = self.model.generate_content(model_contents)
+            logger.info(f"--- RESPOSTA COMPLETA DO GEMINI (generate_storyboard) ---\n{response.text}\n--------------------")
+            storyboard_data = robust_json_parser(response.text)
+            storyboard = storyboard_data.get("scene_storyboard", [])
+            if not storyboard or len(storyboard) != num_keyframes: raise ValueError(f"Número incorreto de cenas gerado.")
+            return storyboard
+        except Exception as e:
+            raise gr.Error(f"O Roteirista (Gemini) falhou: {e}")
+    def select_keyframes_from_pool(self, storyboard: list, base_image_paths: list[str], pool_image_paths: list[str]) -> list[str]:
+        self._check_model()
+        if not pool_image_paths:
+            raise gr.Error("O 'banco de imagens' (Imagens Adicionais) está vazio.")
+        try:
+            template = self._read_prompt_template("keyframe_selection_prompt.txt")
+            image_map = {f"IMG-{i+1}": path for i, path in enumerate(pool_image_paths)}
+            base_image_map = {f"BASE-{i+1}": path for i, path in enumerate(base_image_paths)}
+            model_contents = ["# Reference Images (Story Base)"]
+            for identifier, path in base_image_map.items():
+                model_contents.extend([f"Identifier: {identifier}", Image.open(path)])
+            model_contents.append("\n# Image Pool (Scene Bank)")
+            for identifier, path in image_map.items():
+                model_contents.extend([f"Identifier: {identifier}", Image.open(path)])
+            storyboard_str = "\n".join([f"- Scene {i+1}: {s}" for i, s in enumerate(storyboard)])
+            selection_prompt = template.format(storyboard_str=storyboard_str, image_identifiers=list(image_map.keys()))
+            model_contents.append(selection_prompt)
+            response = self.model.generate_content(model_contents)
+            logger.info(f"--- RESPOSTA COMPLETA DO GEMINI (select_keyframes_from_pool) ---\n{response.text}\n--------------------")
+            selection_data = robust_json_parser(response.text)
+            selected_identifiers = selection_data.get("selected_image_identifiers", [])
+            if len(selected_identifiers) != len(storyboard):
+                raise ValueError("A IA não selecionou o número correto de imagens para as cenas.")
+            selected_paths = [image_map[identifier] for identifier in selected_identifiers]
+            return selected_paths
+        except Exception as e:
+            raise gr.Error(f"O Fotógrafo (Gemini) falhou ao selecionar as imagens: {e}")
+    def get_anticipatory_keyframe_prompt(self, global_prompt: str, scene_history: str, current_scene_desc: str, future_scene_desc: str, last_image_path: str, fixed_ref_paths: list[str]) -> str:
+        self._check_model()
+        try:
+            template = self._read_prompt_template("anticipatory_keyframe_prompt.txt")
+            director_prompt = template.format(
+                historico_prompt=scene_history,
+                cena_atual=current_scene_desc,
+                cena_futura=future_scene_desc
+            )
+            model_contents = [
+                "# CONTEXTO:",
+                f"- Global Story Goal: {global_prompt}",
+                "# VISUAL ASSETS:",
+                "Current Base Image [IMG-BASE]:",
+                Image.open(last_image_path)
+            ]
+            ref_counter = 1
+            for path in fixed_ref_paths:
+                if path != last_image_path:
+                    model_contents.extend([f"General Reference Image [IMG-REF-{ref_counter}]:", Image.open(path)])
+                    ref_counter += 1
+            model_contents.append(director_prompt)
+            response = self.model.generate_content(model_contents)
+            logger.info(f"--- RESPOSTA COMPLETA DO GEMINI (get_anticipatory_keyframe_prompt) ---\n{response.text}\n--------------------")
+            final_flux_prompt = response.text.strip()
+            return final_flux_prompt
+        except Exception as e:
+            raise gr.Error(f"O Diretor de Arte (Gemini) falhou: {e}")
+    def get_initial_motion_prompt(self, user_prompt: str, start_image_path: str, destination_image_path: str, dest_scene_desc: str) -> str:
+        """Gera o prompt de movimento para a PRIMEIRA transição, que não tem um 'passado'."""
+        self._check_model()
+        try:
+            template = self._read_prompt_template("initial_motion_prompt.txt")
+            prompt_text = template.format(user_prompt=user_prompt, destination_scene_description=dest_scene_desc)
+            model_contents = [
+                prompt_text,
+                "START Image:",
+                Image.open(start_image_path),
+                "DESTINATION Image:",
+                Image.open(destination_image_path)
+            ]
+            response = self.model.generate_content(model_contents)
+            logger.info(f"--- RESPOSTA COMPLETA DO GEMINI (get_initial_motion_prompt) ---\n{response.text}\n--------------------")
+            return response.text.strip()
+        except Exception as e:
+            raise gr.Error(f"O Cineasta Inicial (Gemini) falhou: {e}")
+    def get_cinematic_decision(self, global_prompt: str, story_history: str,
+                               past_keyframe_path: str, present_keyframe_path: str, future_keyframe_path: str,
+                               past_scene_desc: str, present_scene_desc: str, future_scene_desc: str) -> dict:
+        """
+        Atua como um 'Cineasta', analisando passado, presente e futuro para tomar decisões
+        de edição e gerar prompts de movimento detalhados.
+        """
+        self._check_model()
+        try:
+            template = self._read_prompt_template("cinematic_director_prompt.txt")
+            prompt_text = template.format(
+                global_prompt=global_prompt,
+                story_history=story_history,
+                past_scene_desc=past_scene_desc,
+                present_scene_desc=present_scene_desc,
+                future_scene_desc=future_scene_desc
+            )
+            model_contents = [
+                prompt_text,
+                "[PAST_IMAGE]:", Image.open(past_keyframe_path),
+                "[PRESENT_IMAGE]:", Image.open(present_keyframe_path),
+                "[FUTURE_IMAGE]:", Image.open(future_keyframe_path)
+            ]
+            response = self.model.generate_content(model_contents)
+            logger.info(f"--- RESPOSTA COMPLETA DO GEMINI (get_cinematic_decision) ---\n{response.text}\n--------------------")
+            decision_data = robust_json_parser(response.text)
+            if "transition_type" not in decision_data or "motion_prompt" not in decision_data:
+                raise ValueError("Resposta da IA (Cineasta) está mal formatada. Faltam 'transition_type' ou 'motion_prompt'.")
+            return decision_data
+        except Exception as e:
+            # Fallback para uma decisão segura em caso de erro
+            logger.error(f"O Diretor de Cinema (Gemini) falhou: {e}. Usando fallback para 'continuous'.")
+            return {
+                "transition_type": "continuous",
+                "motion_prompt": f"A smooth, continuous cinematic transition from '{present_scene_desc}' to '{future_scene_desc}'."
+            }
+    def get_sound_director_prompt(self, audio_history: str,
+                                  past_keyframe_path: str, present_keyframe_path: str, future_keyframe_path: str,
+                                  present_scene_desc: str, motion_prompt: str, future_scene_desc: str) -> str:
+        """
+        Atua como um 'Diretor de Som', analisando o contexto completo para criar um prompt
+        de áudio imersivo e contínuo para a cena atual.
+        """
+        self._check_model()
+        try:
+            template = self._read_prompt_template("sound_director_prompt.txt")
+            prompt_text = template.format(
+                audio_history=audio_history,
+                present_scene_desc=present_scene_desc,
+                motion_prompt=motion_prompt,
+                future_scene_desc=future_scene_desc
+            )
+            model_contents = [
+                prompt_text,
+                "[PAST_IMAGE]:", Image.open(past_keyframe_path),
+                "[PRESENT_IMAGE]:", Image.open(present_keyframe_path),
+                "[FUTURE_IMAGE]:", Image.open(future_keyframe_path)
+            ]
+            response = self.model.generate_content(model_contents)
+            logger.info(f"--- RESPOSTA COMPLETA DO GEMINI (get_sound_director_prompt) ---\n{response.text}\n--------------------")
+            return response.text.strip()
+        except Exception as e:
+            logger.error(f"O Diretor de Som (Gemini) falhou: {e}. Usando fallback.")
+            return f"Sound effects matching the scene: {present_scene_desc}"
+gemini_singleton = GeminiSingleton()

hardware_manager.py ADDED Viewed

	@@ -0,0 +1,35 @@

+# hardware_manager.py
+# Gerencia a detecção e alocação de GPUs para os especialistas.
+# Copyright (C) 2025 Carlos Rodrigues dos Santos
+import torch
+import logging
+logger = logging.getLogger(__name__)
+class HardwareManager:
+    def __init__(self):
+        self.gpus = []
+        self.allocated_gpus = set()
+        if torch.cuda.is_available():
+            self.gpus = [f'cuda:{i}' for i in range(torch.cuda.device_count())]
+        logger.info(f"Hardware Manager: Encontradas {len(self.gpus)} GPUs disponíveis: {self.gpus}")
+    def allocate_gpus(self, specialist_name: str, num_required: int) -> list[str]:
+        if not self.gpus or num_required == 0:
+            logger.warning(f"Nenhuma GPU disponível ou solicitada para '{specialist_name}'. Alocando para CPU.")
+            return ['cpu']
+        available_gpus = [gpu for gpu in self.gpus if gpu not in self.allocated_gpus]
+        if len(available_gpus) < num_required:
+            error_msg = f"Recursos de GPU insuficientes para '{specialist_name}'. Solicitado: {num_required}, Disponível: {len(available_gpus)}."
+            logger.error(error_msg)
+            raise RuntimeError(error_msg)
+        allocated = available_gpus[:num_required]
+        self.allocated_gpus.update(allocated)
+        logger.info(f"Hardware Manager: Alocando GPUs {allocated} para o especialista '{specialist_name}'.")
+        return allocated
+hardware_manager = HardwareManager()

i18n.json ADDED Viewed

	@@ -0,0 +1,128 @@

+{
+  "pt": {
+    "app_title": "ADUC-SDR 🎬 - O Diretor de Cinema IA",
+    "app_subtitle": "Crie um filme completo com vídeo e áudio, orquestrado por uma equipe de IAs.",
+    "lang_selector_label": "Idioma / Language",
+    "step1_accordion": "Etapa 1: Roteiro e Cenas-Chave",
+    "prompt_label": "Ideia Geral do Filme",
+    "ref_images_base_label": "Imagens de Referência (Base da História)",
+    "ref_images_extra_label": "Imagens Adicionais (Banco de Cenas para o Modo Fotógrafo)",
+    "keyframes_label": "Número de Cenas-Chave",
+    "storyboard_button": "1. Gerar Roteiro",
+    "storyboard_and_keyframes_button": "1A. Gerar Roteiro e Keyframes (Modo Diretor de Arte)",
+    "storyboard_from_photos_button": "1B. Gerar Roteiro a partir de Fotos (Modo Fotógrafo)",
+    "step1_mode_b_info": "Modo Fotógrafo: As 'Imagens Adicionais' são usadas como um banco de cenas e a IA escolherá a melhor para cada parte do roteiro.",
+    "storyboard_output_label": "Roteiro Gerado (Storyboard)",
+    "step2_accordion": "Etapa 2: Os Keyframes (Especialista: Flux)",
+    "step2_description": "O Diretor de Arte (Gemini) guiará o Pintor (Flux) para criar as imagens-chave da sua história.",
+    "art_director_label": "Usar Diretor de Arte IA (para prompts de keyframe)",
+    "keyframes_button": "2. Gerar Imagens-Chave",
+    "keyframes_gallery_label": "Galeria de Cenas-Chave (Keyframes)",
+    "manual_keyframes_label": "Carregar Keyframes Manualmente",
+    "manual_separator": "--- OU ---",
+    "step3_accordion": "Etapa 3: A Produção do Filme (Especialistas: LTX & MMAudio)",
+    "step3_description": "O Diretor de Continuidade e o Cineasta irão guiar a Câmera (LTX) para filmar as transições entre os keyframes.",
+    "continuity_director_label": "Usar Diretor de Continuidade IA (para cortes)",
+    "cinematographer_label": "Usar Cineasta IA (para prompts de movimento)",
+    "duration_label": "Duração por Cena (s)",
+    "n_corte_label": "Ponto de Corte Base (%)",
+    "n_corte_info": "Percentual base da cena a ser substituído pela transição. Será ajustado dinamicamente.",
+    "convergence_chunks_label": "Máx. Chunks de Convergência",
+    "convergence_chunks_info": "Nº máx. de chunks latentes (memória) para guiar a convergência do movimento. Será ajustado dinamicamente.",
+    "path_convergence_label": "Força do Handler (Tensor)",
+    "destination_convergence_label": "Convergência do Destino (Tensor)",
+    "produce_button": "3. 🎬 Produzir Filme Completo (com Som)",
+    "advanced_accordion_label": "Configurações Avançadas (LTX)",
+    "guidance_label": "Guidance Scale",
+    "stg_label": "STG Scale",
+    "rescaling_label": "Rescaling Scale",
+    "steps_label": "Passos de Inferência",
+    "steps_info": "Mais passos podem melhorar a qualidade, mas aumentam o tempo. Ignorado para modelos 'distilled'.",
+    "video_fragments_gallery_label": "Fragmentos do Filme Gerados",
+    "final_movie_with_audio_label": "🎉 FILME COMPLETO 🎉"
+  },
+  "en": {
+    "app_title": "ADUC-SDR 🎬 - The AI Film Director",
+    "app_subtitle": "Create a complete film with video and audio, orchestrated by a team of AIs.",
+    "lang_selector_label": "Language / Idioma",
+    "step1_accordion": "Step 1: Script & Key Scenes",
+    "prompt_label": "General Film Idea",
+    "ref_images_base_label": "Reference Images (Story Base)",
+    "ref_images_extra_label": "Additional Images (Scene Bank for Photographer Mode)",
+    "keyframes_label": "Number of Key-Scenes",
+    "storyboard_button": "1. Generate Script",
+    "storyboard_and_keyframes_button": "1A. Generate Script & Keyframes (Art Director Mode)",
+    "storyboard_from_photos_button": "1B. Generate Script from Photos (Photographer Mode)",
+    "step1_mode_b_info": "Photographer Mode: 'Additional Images' are used as a scene bank, and the AI will choose the best one for each script part.",
+    "storyboard_output_label": "Generated Script (Storyboard)",
+    "step2_accordion": "Step 2: The Keyframes (Specialist: Flux)",
+    "step2_description": "The Art Director (Gemini) will guide the Painter (Flux) to create the key images of your story.",
+    "art_director_label": "Use AI Art Director (for keyframe prompts)",
+    "keyframes_button": "2. Generate Key-Images",
+    "keyframes_gallery_label": "Key-Scenes Gallery (Keyframes)",
+    "manual_keyframes_label": "Upload Keyframes Manually",
+    "manual_separator": "--- OR ---",
+    "step3_accordion": "Step 3: Film Production (Specialists: LTX & MMAudio)",
+    "step3_description": "The Continuity Director and Cinematographer will guide the Camera (LTX) to shoot the transitions between keyframes.",
+    "continuity_director_label": "Use AI Continuity Director (for cuts)",
+    "cinematographer_label": "Use AI Cinematographer (for motion prompts)",
+    "duration_label": "Duration per Scene (s)",
+    "n_corte_label": "Base Cut Point (%)",
+    "n_corte_info": "Base percentage of the scene to be replaced by the transition. Will be adjusted dynamically.",
+    "convergence_chunks_label": "Max Convergence Chunks",
+    "convergence_chunks_info": "Max number of latent chunks (memory) to guide motion convergence. Will be adjusted dynamically.",
+    "path_convergence_label": "Handler Strength (Tensor)",
+    "destination_convergence_label": "Destination Convergence (Tensor)",
+    "produce_button": "3. 🎬 Produce Complete Film (with Sound)",
+    "advanced_accordion_label": "Advanced Settings (LTX)",
+    "guidance_label": "Guidance Scale",
+    "stg_label": "STG Scale",
+    "rescaling_label": "Rescaling Scale",
+    "steps_label": "Inference Steps",
+    "steps_info": "More steps can improve quality but increase generation time. Ignored for 'distilled' models.",
+    "video_fragments_gallery_label": "Generated Film Fragments",
+    "final_movie_with_audio_label": "🎉 COMPLETE MOVIE 🎉"
+  },
+  "zh": {
+    "app_title": "ADUC-SDR 🎬 - 人工智能电影导演",
+    "app_subtitle": "由人工智能团队精心策划，根据一个想法和参考图像创作一部完整的有声电影。",
+    "lang_selector_label": "语言 / Language",
+    "step1_accordion": "第 1 步：剧本和关键场景",
+    "prompt_label": "电影总体构想",
+    "ref_images_base_label": "参考图像 (故事基础)",
+    "ref_images_extra_label": "附加图像 (摄影师模式的场景库)",
+    "keyframes_label": "关键场景数量",
+    "storyboard_button": "1. 生成剧本",
+    "storyboard_and_keyframes_button": "1A. 生成剧本和关键帧 (艺术总监模式)",
+    "storyboard_from_photos_button": "1B. 从照片生成剧本 (摄影师模式)",
+    "step1_mode_b_info": "摄影师模式：“附加图像”被用作场景库，AI将为剧本的每个部分选择最佳图像。",
+    "storyboard_output_label": "生成的剧本",
+    "step2_accordion": "第 2 步：关键帧 (专家: Flux)",
+    "step2_description": "艺术总监 (Gemini) 将指导画家 (Flux) 创作故事的关键图像。",
+    "art_director_label": "使用AI艺术总监",
+    "keyframes_button": "2. 生成关键图像",
+    "keyframes_gallery_label": "关键场景画廊 (关键帧)",
+    "manual_keyframes_label": "手动上传关键帧",
+    "manual_separator": "--- 或者 ---",
+    "step3_accordion": "第 3 步：影片制作 (专家: LTX & MMAudio)",
+    "step3_description": "连续性导演和电影摄影师将指导摄像机 (LTX) 拍摄关键帧之间的过渡。",
+    "continuity_director_label": "使用AI连续性导演",
+    "cinematographer_label": "使用AI电影摄影师",
+    "duration_label": "每场景时长 (秒)",
+    "n_corte_label": "基础剪辑点 (%)",
+    "n_corte_info": "将被过渡替换的场景基础百分比。将动态调整。",
+    "convergence_chunks_label": "最大收敛块",
+    "convergence_chunks_info": "用于引导运动收敛的最大潜在块（内存）数量。将动态调整。",
+    "path_convergence_label": "处理器强度 (张量)",
+    "destination_convergence_label": "目标收敛 (张量)",
+    "produce_button": "3. 🎬 制作完整影片 (有声)",
+    "advanced_accordion_label": "高级设置 (LTX)",
+    "guidance_label": "引导比例",
+    "stg_label": "STG 比例",
+    "rescaling_label": "重缩放比例",
+    "steps_label": "推理步骤",
+    "steps_info": "更多步骤可以提高质量，但会增加生成时间。对“distilled”模型无效。",
+    "video_fragments_gallery_label": "生成的电影片段",
+    "final_movie_with_audio_label": "🎉 完整影片 🎉"
+  }
+}

image_specialist.py ADDED Viewed

	@@ -0,0 +1,98 @@

+# image_specialist.py
+# Copyright (C) 2025 Carlos Rodrigues dos Santos
+#
+# Este programa é software livre: você pode redistribuí-lo e/ou modificá-lo
+# sob os termos da Licença Pública Geral Affero GNU...
+# AVISO DE PATENTE PENDENTE: Consulte NOTICE.md.
+from PIL import Image
+import os
+import time
+import logging
+import gradio as gr
+import yaml
+from flux_kontext_helpers import flux_kontext_singleton
+from gemini_helpers import gemini_singleton
+logger = logging.getLogger(__name__)
+class ImageSpecialist:
+    """
+    Especialista ADUC para a geração de imagens estáticas (keyframes).
+    É responsável por todo o processo de transformar um roteiro em uma galeria de keyframes.
+    """
+    def __init__(self, workspace_dir):
+        self.workspace_dir = workspace_dir
+        self.image_generation_helper = flux_kontext_singleton
+        logger.info("Especialista de Imagem (Flux) pronto para receber ordens do Maestro.")
+    def _generate_single_keyframe(self, prompt: str, reference_images: list[Image.Image], output_filename: str, width: int, height: int, callback: callable = None) -> str:
+        """
+        Função de baixo nível que gera uma única imagem.
+        """
+        logger.info(f"Gerando keyframe '{output_filename}' com prompt: '{prompt}'")
+        generated_image = self.image_generation_helper.generate_image(
+            reference_images=reference_images, prompt=prompt, width=width,
+            height=height, seed=int(time.time()), callback=callback
+        )
+        final_path = os.path.join(self.workspace_dir, output_filename)
+        generated_image.save(final_path)
+        logger.info(f"Keyframe salvo com sucesso em: {final_path}")
+        return final_path
+    def generate_keyframes_from_storyboard(self, storyboard: list, initial_ref_path: str, global_prompt: str, keyframe_resolution: int, general_ref_paths: list, progress_callback_factory: callable = None):
+        """
+        Orquestra a geração de todos os keyframes a partir de um storyboard.
+        """
+        current_base_image_path = initial_ref_path
+        previous_prompt = "N/A (imagem inicial de referência)"
+        final_keyframes = [current_base_image_path]
+        width, height = keyframe_resolution, keyframe_resolution
+        # O número de keyframes a gerar é len(storyboard) - 1, pois o primeiro keyframe já existe (initial_ref_path)
+        # E o storyboard tem o mesmo número de elementos que o número total de keyframes desejados.
+        num_keyframes_to_generate = len(storyboard) - 1
+        logger.info(f"ESPECIALISTA DE IMAGEM: Recebi ordem para gerar {num_keyframes_to_generate} keyframes.")
+        for i in range(num_keyframes_to_generate):
+            # A cena atual é a transição de storyboard[i] para storyboard[i+1]
+            current_scene = storyboard[i]
+            future_scene = storyboard[i+1]
+            progress_callback = progress_callback_factory(i + 1, num_keyframes_to_generate) if progress_callback_factory else None
+            logger.info(f"--> Gerando Keyframe {i+1}/{num_keyframes_to_generate}...")
+            # O próprio especialista consulta o Gemini para o prompt de imagem
+            new_flux_prompt = gemini_singleton.get_anticipatory_keyframe_prompt(
+                global_prompt=global_prompt, scene_history=previous_prompt,
+                current_scene_desc=current_scene, future_scene_desc=future_scene,
+                last_image_path=current_base_image_path, fixed_ref_paths=general_ref_paths
+            )
+            images_for_flux_paths = list(set([current_base_image_path] + general_ref_paths))
+            images_for_flux = [Image.open(p) for p in images_for_flux_paths]
+            new_keyframe_path = self._generate_single_keyframe(
+                prompt=new_flux_prompt, reference_images=images_for_flux,
+                output_filename=f"keyframe_{i+1}.png", width=width, height=height,
+                callback=progress_callback
+            )
+            final_keyframes.append(new_keyframe_path)
+            current_base_image_path = new_keyframe_path
+            previous_prompt = new_flux_prompt
+        logger.info(f"ESPECIALISTA DE IMAGEM: Geração de keyframes concluída.")
+        return final_keyframes
+# Singleton instantiation - usa o workspace_dir da config
+try:
+    with open("config.yaml", 'r') as f:
+        config = yaml.safe_load(f)
+    WORKSPACE_DIR = config['application']['workspace_dir']
+    image_specialist_singleton = ImageSpecialist(workspace_dir=WORKSPACE_DIR)
+except Exception as e:
+    logger.error(f"Não foi possível inicializar o ImageSpecialist: {e}", exc_info=True)
+    image_specialist_singleton = None

inference.py ADDED Viewed

	@@ -0,0 +1,774 @@

+import argparse
+import os
+import random
+from datetime import datetime
+from pathlib import Path
+from diffusers.utils import logging
+from typing import Optional, List, Union
+import yaml
+import imageio
+import json
+import numpy as np
+import torch
+import cv2
+from safetensors import safe_open
+from PIL import Image
+from transformers import (
+    T5EncoderModel,
+    T5Tokenizer,
+    AutoModelForCausalLM,
+    AutoProcessor,
+    AutoTokenizer,
+)
+from huggingface_hub import hf_hub_download
+from ltx_video.models.autoencoders.causal_video_autoencoder import (
+    CausalVideoAutoencoder,
+)
+from ltx_video.models.transformers.symmetric_patchifier import SymmetricPatchifier
+from ltx_video.models.transformers.transformer3d import Transformer3DModel
+from ltx_video.pipelines.pipeline_ltx_video import (
+    ConditioningItem,
+    LTXVideoPipeline,
+    LTXMultiScalePipeline,
+)
+from ltx_video.schedulers.rf import RectifiedFlowScheduler
+from ltx_video.utils.skip_layer_strategy import SkipLayerStrategy
+from ltx_video.models.autoencoders.latent_upsampler import LatentUpsampler
+import ltx_video.pipelines.crf_compressor as crf_compressor
+MAX_HEIGHT = 720
+MAX_WIDTH = 1280
+MAX_NUM_FRAMES = 257
+logger = logging.get_logger("LTX-Video")
+def get_total_gpu_memory():
+    if torch.cuda.is_available():
+        total_memory = torch.cuda.get_device_properties(0).total_memory / (1024**3)
+        return total_memory
+    return 44
+def get_device():
+    if torch.cuda.is_available():
+        return "cuda"
+    elif torch.backends.mps.is_available():
+        return "mps"
+    return "cuda"
+def load_image_to_tensor_with_resize_and_crop(
+    image_input: Union[str, Image.Image],
+    target_height: int = 512,
+    target_width: int = 768,
+    just_crop: bool = False,
+) -> torch.Tensor:
+    """Load and process an image into a tensor.
+    Args:
+        image_input: Either a file path (str) or a PIL Image object
+        target_height: Desired height of output tensor
+        target_width: Desired width of output tensor
+        just_crop: If True, only crop the image to the target size without resizing
+    """
+    if isinstance(image_input, str):
+        image = Image.open(image_input).convert("RGB")
+    elif isinstance(image_input, Image.Image):
+        image = image_input
+    else:
+        raise ValueError("image_input must be either a file path or a PIL Image object")
+    input_width, input_height = image.size
+    aspect_ratio_target = target_width / target_height
+    aspect_ratio_frame = input_width / input_height
+    if aspect_ratio_frame > aspect_ratio_target:
+        new_width = int(input_height * aspect_ratio_target)
+        new_height = input_height
+        x_start = (input_width - new_width) // 2
+        y_start = 0
+    else:
+        new_width = input_width
+        new_height = int(input_width / aspect_ratio_target)
+        x_start = 0
+        y_start = (input_height - new_height) // 2
+    image = image.crop((x_start, y_start, x_start + new_width, y_start + new_height))
+    if not just_crop:
+        image = image.resize((target_width, target_height))
+    image = np.array(image)
+    image = cv2.GaussianBlur(image, (3, 3), 0)
+    frame_tensor = torch.from_numpy(image).float()
+    frame_tensor = crf_compressor.compress(frame_tensor / 255.0) * 255.0
+    frame_tensor = frame_tensor.permute(2, 0, 1)
+    frame_tensor = (frame_tensor / 127.5) - 1.0
+    # Create 5D tensor: (batch_size=1, channels=3, num_frames=1, height, width)
+    return frame_tensor.unsqueeze(0).unsqueeze(2)
+def calculate_padding(
+    source_height: int, source_width: int, target_height: int, target_width: int
+) -> tuple[int, int, int, int]:
+    # Calculate total padding needed
+    pad_height = target_height - source_height
+    pad_width = target_width - source_width
+    # Calculate padding for each side
+    pad_top = pad_height // 2
+    pad_bottom = pad_height - pad_top  # Handles odd padding
+    pad_left = pad_width // 2
+    pad_right = pad_width - pad_left  # Handles odd padding
+    # Return padded tensor
+    # Padding format is (left, right, top, bottom)
+    padding = (pad_left, pad_right, pad_top, pad_bottom)
+    return padding
+def convert_prompt_to_filename(text: str, max_len: int = 20) -> str:
+    # Remove non-letters and convert to lowercase
+    clean_text = "".join(
+        char.lower() for char in text if char.isalpha() or char.isspace()
+    )
+    # Split into words
+    words = clean_text.split()
+    # Build result string keeping track of length
+    result = []
+    current_length = 0
+    for word in words:
+        # Add word length plus 1 for underscore (except for first word)
+        new_length = current_length + len(word)
+        if new_length <= max_len:
+            result.append(word)
+            current_length += len(word)
+        else:
+            break
+    return "-".join(result)
+# Generate output video name
+def get_unique_filename(
+    base: str,
+    ext: str,
+    prompt: str,
+    seed: int,
+    resolution: tuple[int, int, int],
+    dir: Path,
+    endswith=None,
+    index_range=1000,
+) -> Path:
+    base_filename = f"{base}_{convert_prompt_to_filename(prompt, max_len=30)}_{seed}_{resolution[0]}x{resolution[1]}x{resolution[2]}"
+    for i in range(index_range):
+        filename = dir / f"{base_filename}_{i}{endswith if endswith else ''}{ext}"
+        if not os.path.exists(filename):
+            return filename
+    raise FileExistsError(
+        f"Could not find a unique filename after {index_range} attempts."
+    )
+def seed_everething(seed: int):
+    random.seed(seed)
+    np.random.seed(seed)
+    torch.manual_seed(seed)
+    if torch.cuda.is_available():
+        torch.cuda.manual_seed(seed)
+    if torch.backends.mps.is_available():
+        torch.mps.manual_seed(seed)
+def main():
+    parser = argparse.ArgumentParser(
+        description="Load models from separate directories and run the pipeline."
+    )
+    # Directories
+    parser.add_argument(
+        "--output_path",
+        type=str,
+        default=None,
+        help="Path to the folder to save output video, if None will save in outputs/ directory.",
+    )
+    parser.add_argument("--seed", type=int, default="171198")
+    # Pipeline parameters
+    parser.add_argument(
+        "--num_images_per_prompt",
+        type=int,
+        default=1,
+        help="Number of images per prompt",
+    )
+    parser.add_argument(
+        "--image_cond_noise_scale",
+        type=float,
+        default=0.15,
+        help="Amount of noise to add to the conditioned image",
+    )
+    parser.add_argument(
+        "--height",
+        type=int,
+        default=704,
+        help="Height of the output video frames. Optional if an input image provided.",
+    )
+    parser.add_argument(
+        "--width",
+        type=int,
+        default=1216,
+        help="Width of the output video frames. If None will infer from input image.",
+    )
+    parser.add_argument(
+        "--num_frames",
+        type=int,
+        default=121,
+        help="Number of frames to generate in the output video",
+    )
+    parser.add_argument(
+        "--frame_rate", type=int, default=30, help="Frame rate for the output video"
+    )
+    parser.add_argument(
+        "--device",
+        default=None,
+        help="Device to run inference on. If not specified, will automatically detect and use CUDA or MPS if available, else CPU.",
+    )
+    parser.add_argument(
+        "--pipeline_config",
+        type=str,
+        default="configs/ltxv-13b-0.9.7-dev.yaml",
+        help="The path to the config file for the pipeline, which contains the parameters for the pipeline",
+    )
+    # Prompts
+    parser.add_argument(
+        "--prompt",
+        type=str,
+        help="Text prompt to guide generation",
+    )
+    parser.add_argument(
+        "--negative_prompt",
+        type=str,
+        default="worst quality, inconsistent motion, blurry, jittery, distorted",
+        help="Negative prompt for undesired features",
+    )
+    parser.add_argument(
+        "--offload_to_cpu",
+        action="store_true",
+        help="Offloading unnecessary computations to CPU.",
+    )
+    # video-to-video arguments:
+    parser.add_argument(
+        "--input_media_path",
+        type=str,
+        default=None,
+        help="Path to the input video (or imaage) to be modified using the video-to-video pipeline",
+    )
+    # Conditioning arguments
+    parser.add_argument(
+        "--conditioning_media_paths",
+        type=str,
+        nargs="*",
+        help="List of paths to conditioning media (images or videos). Each path will be used as a conditioning item.",
+    )
+    parser.add_argument(
+        "--conditioning_strengths",
+        type=float,
+        nargs="*",
+        help="List of conditioning strengths (between 0 and 1) for each conditioning item. Must match the number of conditioning items.",
+    )
+    parser.add_argument(
+        "--conditioning_start_frames",
+        type=int,
+        nargs="*",
+        help="List of frame indices where each conditioning item should be applied. Must match the number of conditioning items.",
+    )
+    args = parser.parse_args()
+    logger.warning(f"Running generation with arguments: {args}")
+    infer(**vars(args))
+def create_ltx_video_pipeline(
+    ckpt_path: str,
+    precision: str,
+    text_encoder_model_name_or_path: str,
+    sampler: Optional[str] = None,
+    device: Optional[str] = None,
+    enhance_prompt: bool = False,
+    prompt_enhancer_image_caption_model_name_or_path: Optional[str] = None,
+    prompt_enhancer_llm_model_name_or_path: Optional[str] = None,
+) -> LTXVideoPipeline:
+    ckpt_path = Path(ckpt_path)
+    assert os.path.exists(
+        ckpt_path
+    ), f"Ckpt path provided (--ckpt_path) {ckpt_path} does not exist"
+    with safe_open(ckpt_path, framework="pt") as f:
+        metadata = f.metadata()
+        config_str = metadata.get("config")
+        configs = json.loads(config_str)
+        allowed_inference_steps = configs.get("allowed_inference_steps", None)
+    vae = CausalVideoAutoencoder.from_pretrained(ckpt_path)
+    transformer = Transformer3DModel.from_pretrained(ckpt_path)
+    # Use constructor if sampler is specified, otherwise use from_pretrained
+    if sampler == "from_checkpoint" or not sampler:
+        scheduler = RectifiedFlowScheduler.from_pretrained(ckpt_path)
+    else:
+        scheduler = RectifiedFlowScheduler(
+            sampler=("Uniform" if sampler.lower() == "uniform" else "LinearQuadratic")
+        )
+    text_encoder = T5EncoderModel.from_pretrained(
+        text_encoder_model_name_or_path, subfolder="text_encoder"
+    )
+    patchifier = SymmetricPatchifier(patch_size=1)
+    tokenizer = T5Tokenizer.from_pretrained(
+        text_encoder_model_name_or_path, subfolder="tokenizer"
+    )
+    transformer = transformer.to(device)
+    vae = vae.to(device)
+    text_encoder = text_encoder.to(device)
+    if enhance_prompt:
+        prompt_enhancer_image_caption_model = AutoModelForCausalLM.from_pretrained(
+            prompt_enhancer_image_caption_model_name_or_path, trust_remote_code=True
+        )
+        prompt_enhancer_image_caption_processor = AutoProcessor.from_pretrained(
+            prompt_enhancer_image_caption_model_name_or_path, trust_remote_code=True
+        )
+        prompt_enhancer_llm_model = AutoModelForCausalLM.from_pretrained(
+            prompt_enhancer_llm_model_name_or_path,
+            torch_dtype="bfloat16",
+        )
+        prompt_enhancer_llm_tokenizer = AutoTokenizer.from_pretrained(
+            prompt_enhancer_llm_model_name_or_path,
+        )
+    else:
+        prompt_enhancer_image_caption_model = None
+        prompt_enhancer_image_caption_processor = None
+        prompt_enhancer_llm_model = None
+        prompt_enhancer_llm_tokenizer = None
+    vae = vae.to(torch.bfloat16)
+    if precision == "bfloat16" and transformer.dtype != torch.bfloat16:
+        transformer = transformer.to(torch.bfloat16)
+    text_encoder = text_encoder.to(torch.bfloat16)
+    # Use submodels for the pipeline
+    submodel_dict = {
+        "transformer": transformer,
+        "patchifier": patchifier,
+        "text_encoder": text_encoder,
+        "tokenizer": tokenizer,
+        "scheduler": scheduler,
+        "vae": vae,
+        "prompt_enhancer_image_caption_model": prompt_enhancer_image_caption_model,
+        "prompt_enhancer_image_caption_processor": prompt_enhancer_image_caption_processor,
+        "prompt_enhancer_llm_model": prompt_enhancer_llm_model,
+        "prompt_enhancer_llm_tokenizer": prompt_enhancer_llm_tokenizer,
+        "allowed_inference_steps": allowed_inference_steps,
+    }
+    pipeline = LTXVideoPipeline(**submodel_dict)
+    pipeline = pipeline.to(device)
+    return pipeline
+def create_latent_upsampler(latent_upsampler_model_path: str, device: str):
+    latent_upsampler = LatentUpsampler.from_pretrained(latent_upsampler_model_path)
+    latent_upsampler.to(device)
+    latent_upsampler.eval()
+    return latent_upsampler
+def infer(
+    output_path: Optional[str],
+    seed: int,
+    pipeline_config: str,
+    image_cond_noise_scale: float,
+    height: Optional[int],
+    width: Optional[int],
+    num_frames: int,
+    frame_rate: int,
+    prompt: str,
+    negative_prompt: str,
+    offload_to_cpu: bool,
+    input_media_path: Optional[str] = None,
+    conditioning_media_paths: Optional[List[str]] = None,
+    conditioning_strengths: Optional[List[float]] = None,
+    conditioning_start_frames: Optional[List[int]] = None,
+    device: Optional[str] = None,
+    **kwargs,
+):
+    # check if pipeline_config is a file
+    if not os.path.isfile(pipeline_config):
+        raise ValueError(f"Pipeline config file {pipeline_config} does not exist")
+    with open(pipeline_config, "r") as f:
+        pipeline_config = yaml.safe_load(f)
+    models_dir = "MODEL_DIR"
+    ltxv_model_name_or_path = pipeline_config["checkpoint_path"]
+    if not os.path.isfile(ltxv_model_name_or_path):
+        ltxv_model_path = hf_hub_download(
+            repo_id="Lightricks/LTX-Video",
+            filename=ltxv_model_name_or_path,
+            local_dir=models_dir,
+            repo_type="model",
+        )
+    else:
+        ltxv_model_path = ltxv_model_name_or_path
+    spatial_upscaler_model_name_or_path = pipeline_config.get(
+        "spatial_upscaler_model_path"
+    )
+    if spatial_upscaler_model_name_or_path and not os.path.isfile(
+        spatial_upscaler_model_name_or_path
+    ):
+        spatial_upscaler_model_path = hf_hub_download(
+            repo_id="Lightricks/LTX-Video",
+            filename=spatial_upscaler_model_name_or_path,
+            local_dir=models_dir,
+            repo_type="model",
+        )
+    else:
+        spatial_upscaler_model_path = spatial_upscaler_model_name_or_path
+    if kwargs.get("input_image_path", None):
+        logger.warning(
+            "Please use conditioning_media_paths instead of input_image_path."
+        )
+        assert not conditioning_media_paths and not conditioning_start_frames
+        conditioning_media_paths = [kwargs["input_image_path"]]
+        conditioning_start_frames = [0]
+    # Validate conditioning arguments
+    if conditioning_media_paths:
+        # Use default strengths of 1.0
+        if not conditioning_strengths:
+            conditioning_strengths = [1.0] * len(conditioning_media_paths)
+        if not conditioning_start_frames:
+            raise ValueError(
+                "If `conditioning_media_paths` is provided, "
+                "`conditioning_start_frames` must also be provided"
+            )
+        if len(conditioning_media_paths) != len(conditioning_strengths) or len(
+            conditioning_media_paths
+        ) != len(conditioning_start_frames):
+            raise ValueError(
+                "`conditioning_media_paths`, `conditioning_strengths`, "
+                "and `conditioning_start_frames` must have the same length"
+            )
+        if any(s < 0 or s > 1 for s in conditioning_strengths):
+            raise ValueError("All conditioning strengths must be between 0 and 1")
+        if any(f < 0 or f >= num_frames for f in conditioning_start_frames):
+            raise ValueError(
+                f"All conditioning start frames must be between 0 and {num_frames-1}"
+            )
+    seed_everething(seed)
+    if offload_to_cpu and not torch.cuda.is_available():
+        logger.warning(
+            "offload_to_cpu is set to True, but offloading will not occur since the model is already running on CPU."
+        )
+        offload_to_cpu = False
+    else:
+        offload_to_cpu = offload_to_cpu and get_total_gpu_memory() < 30
+    output_dir = (
+        Path(output_path)
+        if output_path
+        else Path(f"outputs/{datetime.today().strftime('%Y-%m-%d')}")
+    )
+    output_dir.mkdir(parents=True, exist_ok=True)
+    # Adjust dimensions to be divisible by 32 and num_frames to be (N * 8 + 1)
+    height_padded = ((height - 1) // 32 + 1) * 32
+    width_padded = ((width - 1) // 32 + 1) * 32
+    num_frames_padded = ((num_frames - 2) // 8 + 1) * 8 + 1
+    padding = calculate_padding(height, width, height_padded, width_padded)
+    logger.warning(
+        f"Padded dimensions: {height_padded}x{width_padded}x{num_frames_padded}"
+    )
+    prompt_enhancement_words_threshold = pipeline_config[
+        "prompt_enhancement_words_threshold"
+    ]
+    prompt_word_count = len(prompt.split())
+    enhance_prompt = (
+        prompt_enhancement_words_threshold > 0
+        and prompt_word_count < prompt_enhancement_words_threshold
+    )
+    if prompt_enhancement_words_threshold > 0 and not enhance_prompt:
+        logger.info(
+            f"Prompt has {prompt_word_count} words, which exceeds the threshold of {prompt_enhancement_words_threshold}. Prompt enhancement disabled."
+        )
+    precision = pipeline_config["precision"]
+    text_encoder_model_name_or_path = pipeline_config["text_encoder_model_name_or_path"]
+    sampler = pipeline_config["sampler"]
+    prompt_enhancer_image_caption_model_name_or_path = pipeline_config[
+        "prompt_enhancer_image_caption_model_name_or_path"
+    ]
+    prompt_enhancer_llm_model_name_or_path = pipeline_config[
+        "prompt_enhancer_llm_model_name_or_path"
+    ]
+    pipeline = create_ltx_video_pipeline(
+        ckpt_path=ltxv_model_path,
+        precision=precision,
+        text_encoder_model_name_or_path=text_encoder_model_name_or_path,
+        sampler=sampler,
+        device=kwargs.get("device", get_device()),
+        enhance_prompt=enhance_prompt,
+        prompt_enhancer_image_caption_model_name_or_path=prompt_enhancer_image_caption_model_name_or_path,
+        prompt_enhancer_llm_model_name_or_path=prompt_enhancer_llm_model_name_or_path,
+    )
+    if pipeline_config.get("pipeline_type", None) == "multi-scale":
+        if not spatial_upscaler_model_path:
+            raise ValueError(
+                "spatial upscaler model path is missing from pipeline config file and is required for multi-scale rendering"
+            )
+        latent_upsampler = create_latent_upsampler(
+            spatial_upscaler_model_path, pipeline.device
+        )
+        pipeline = LTXMultiScalePipeline(pipeline, latent_upsampler=latent_upsampler)
+    media_item = None
+    if input_media_path:
+        media_item = load_media_file(
+            media_path=input_media_path,
+            height=height,
+            width=width,
+            max_frames=num_frames_padded,
+            padding=padding,
+        )
+    conditioning_items = (
+        prepare_conditioning(
+            conditioning_media_paths=conditioning_media_paths,
+            conditioning_strengths=conditioning_strengths,
+            conditioning_start_frames=conditioning_start_frames,
+            height=height,
+            width=width,
+            num_frames=num_frames,
+            padding=padding,
+            pipeline=pipeline,
+        )
+        if conditioning_media_paths
+        else None
+    )
+    stg_mode = pipeline_config.get("stg_mode", "attention_values")
+    del pipeline_config["stg_mode"]
+    if stg_mode.lower() == "stg_av" or stg_mode.lower() == "attention_values":
+        skip_layer_strategy = SkipLayerStrategy.AttentionValues
+    elif stg_mode.lower() == "stg_as" or stg_mode.lower() == "attention_skip":
+        skip_layer_strategy = SkipLayerStrategy.AttentionSkip
+    elif stg_mode.lower() == "stg_r" or stg_mode.lower() == "residual":
+        skip_layer_strategy = SkipLayerStrategy.Residual
+    elif stg_mode.lower() == "stg_t" or stg_mode.lower() == "transformer_block":
+        skip_layer_strategy = SkipLayerStrategy.TransformerBlock
+    else:
+        raise ValueError(f"Invalid spatiotemporal guidance mode: {stg_mode}")
+    # Prepare input for the pipeline
+    sample = {
+        "prompt": prompt,
+        "prompt_attention_mask": None,
+        "negative_prompt": negative_prompt,
+        "negative_prompt_attention_mask": None,
+    }
+    device = device or get_device()
+    generator = torch.Generator(device=device).manual_seed(seed)
+    images = pipeline(
+        **pipeline_config,
+        skip_layer_strategy=skip_layer_strategy,
+        generator=generator,
+        output_type="pt",
+        callback_on_step_end=None,
+        height=height_padded,
+        width=width_padded,
+        num_frames=num_frames_padded,
+        frame_rate=frame_rate,
+        **sample,
+        media_items=media_item,
+        conditioning_items=conditioning_items,
+        is_video=True,
+        vae_per_channel_normalize=True,
+        image_cond_noise_scale=image_cond_noise_scale,
+        mixed_precision=(precision == "mixed_precision"),
+        offload_to_cpu=offload_to_cpu,
+        device=device,
+        enhance_prompt=enhance_prompt,
+    ).images
+    # Crop the padded images to the desired resolution and number of frames
+    (pad_left, pad_right, pad_top, pad_bottom) = padding
+    pad_bottom = -pad_bottom
+    pad_right = -pad_right
+    if pad_bottom == 0:
+        pad_bottom = images.shape[3]
+    if pad_right == 0:
+        pad_right = images.shape[4]
+    images = images[:, :, :num_frames, pad_top:pad_bottom, pad_left:pad_right]
+    for i in range(images.shape[0]):
+        # Gathering from B, C, F, H, W to C, F, H, W and then permuting to F, H, W, C
+        video_np = images[i].permute(1, 2, 3, 0).cpu().float().numpy()
+        # Unnormalizing images to [0, 255] range
+        video_np = (video_np * 255).astype(np.uint8)
+        fps = frame_rate
+        height, width = video_np.shape[1:3]
+        # In case a single image is generated
+        if video_np.shape[0] == 1:
+            output_filename = get_unique_filename(
+                f"image_output_{i}",
+                ".png",
+                prompt=prompt,
+                seed=seed,
+                resolution=(height, width, num_frames),
+                dir=output_dir,
+            )
+            imageio.imwrite(output_filename, video_np[0])
+        else:
+            output_filename = get_unique_filename(
+                f"video_output_{i}",
+                ".mp4",
+                prompt=prompt,
+                seed=seed,
+                resolution=(height, width, num_frames),
+                dir=output_dir,
+            )
+            # Write video
+            with imageio.get_writer(output_filename, fps=fps) as video:
+                for frame in video_np:
+                    video.append_data(frame)
+        logger.warning(f"Output saved to {output_filename}")
+def prepare_conditioning(
+    conditioning_media_paths: List[str],
+    conditioning_strengths: List[float],
+    conditioning_start_frames: List[int],
+    height: int,
+    width: int,
+    num_frames: int,
+    padding: tuple[int, int, int, int],
+    pipeline: LTXVideoPipeline,
+) -> Optional[List[ConditioningItem]]:
+    """Prepare conditioning items based on input media paths and their parameters.
+    Args:
+        conditioning_media_paths: List of paths to conditioning media (images or videos)
+        conditioning_strengths: List of conditioning strengths for each media item
+        conditioning_start_frames: List of frame indices where each item should be applied
+        height: Height of the output frames
+        width: Width of the output frames
+        num_frames: Number of frames in the output video
+        padding: Padding to apply to the frames
+        pipeline: LTXVideoPipeline object used for condition video trimming
+    Returns:
+        A list of ConditioningItem objects.
+    """
+    conditioning_items = []
+    for path, strength, start_frame in zip(
+        conditioning_media_paths, conditioning_strengths, conditioning_start_frames
+    ):
+        num_input_frames = orig_num_input_frames = get_media_num_frames(path)
+        if hasattr(pipeline, "trim_conditioning_sequence") and callable(
+            getattr(pipeline, "trim_conditioning_sequence")
+        ):
+            num_input_frames = pipeline.trim_conditioning_sequence(
+                start_frame, orig_num_input_frames, num_frames
+            )
+        if num_input_frames < orig_num_input_frames:
+            logger.warning(
+                f"Trimming conditioning video {path} from {orig_num_input_frames} to {num_input_frames} frames."
+            )
+        media_tensor = load_media_file(
+            media_path=path,
+            height=height,
+            width=width,
+            max_frames=num_input_frames,
+            padding=padding,
+            just_crop=True,
+        )
+        conditioning_items.append(ConditioningItem(media_tensor, start_frame, strength))
+    return conditioning_items
+def get_media_num_frames(media_path: str) -> int:
+    is_video = any(
+        media_path.lower().endswith(ext) for ext in [".mp4", ".avi", ".mov", ".mkv"]
+    )
+    num_frames = 1
+    if is_video:
+        reader = imageio.get_reader(media_path)
+        num_frames = reader.count_frames()
+        reader.close()
+    return num_frames
+def load_media_file(
+    media_path: str,
+    height: int,
+    width: int,
+    max_frames: int,
+    padding: tuple[int, int, int, int],
+    just_crop: bool = False,
+) -> torch.Tensor:
+    is_video = any(
+        media_path.lower().endswith(ext) for ext in [".mp4", ".avi", ".mov", ".mkv"]
+    )
+    if is_video:
+        reader = imageio.get_reader(media_path)
+        num_input_frames = min(reader.count_frames(), max_frames)
+        # Read and preprocess the relevant frames from the video file.
+        frames = []
+        for i in range(num_input_frames):
+            frame = Image.fromarray(reader.get_data(i))
+            frame_tensor = load_image_to_tensor_with_resize_and_crop(
+                frame, height, width, just_crop=just_crop
+            )
+            frame_tensor = torch.nn.functional.pad(frame_tensor, padding)
+            frames.append(frame_tensor)
+        reader.close()
+        # Stack frames along the temporal dimension
+        media_tensor = torch.cat(frames, dim=2)
+    else:  # Input image
+        media_tensor = load_image_to_tensor_with_resize_and_crop(
+            media_path, height, width, just_crop=just_crop
+        )
+        media_tensor = torch.nn.functional.pad(media_tensor, padding)
+    return media_tensor
+if __name__ == "__main__":
+    main()

ltx_manager_helpers.py ADDED Viewed

	@@ -0,0 +1,198 @@

+# ltx_manager_helpers.py
+# Copyright (C) 4 de Agosto de 2025  Carlos Rodrigues dos Santos
+#
+# ORIGINAL SOURCE: LTX-Video by Lightricks Ltd. & other open-source projects.
+# Licensed under the Apache License, Version 2.0
+# https://github.com/Lightricks/LTX-Video
+#
+# MODIFICATIONS FOR ADUC-SDR_Video:
+# This file is part of ADUC-SDR_Video, a derivative work based on LTX-Video.
+# It has been modified to manage pools of LTX workers, handle GPU memory,
+# and prepare parameters for the ADUC-SDR orchestration framework.
+# All modifications are also licensed under the Apache License, Version 2.0.
+import torch
+import gc
+import os
+import yaml
+import logging
+import huggingface_hub
+import time
+import threading
+import json
+from optimization import optimize_ltx_worker, can_optimize_fp8
+from hardware_manager import hardware_manager
+from inference import create_ltx_video_pipeline, calculate_padding
+from ltx_video.pipelines.pipeline_ltx_video import LatentConditioningItem
+from ltx_video.models.autoencoders.vae_encode import vae_decode
+logger = logging.getLogger(__name__)
+class LtxWorker:
+    def __init__(self, device_id, ltx_config_file):
+        self.cpu_device = torch.device('cpu')
+        self.device = torch.device(device_id if torch.cuda.is_available() else 'cpu')
+        logger.info(f"LTX Worker ({self.device}): Inicializando com config '{ltx_config_file}'...")
+        with open(ltx_config_file, "r") as file:
+            self.config = yaml.safe_load(file)
+        self.is_distilled = "distilled" in self.config.get("checkpoint_path", "")
+        models_dir = "downloaded_models_gradio"
+        logger.info(f"LTX Worker ({self.device}): Carregando modelo para a CPU...")
+        model_path = os.path.join(models_dir, self.config["checkpoint_path"])
+        if not os.path.exists(model_path):
+             model_path = huggingface_hub.hf_hub_download(
+                repo_id="Lightricks/LTX-Video", filename=self.config["checkpoint_path"],
+                local_dir=models_dir, local_dir_use_symlinks=False
+            )
+        self.pipeline = create_ltx_video_pipeline(
+            ckpt_path=model_path, precision=self.config["precision"],
+            text_encoder_model_name_or_path=self.config["text_encoder_model_name_or_path"],
+            sampler=self.config["sampler"], device='cpu'
+        )
+        logger.info(f"LTX Worker ({self.device}): Modelo pronto na CPU. É um modelo destilado? {self.is_distilled}")
+        if self.device.type == 'cuda' and can_optimize_fp8():
+            logger.info(f"LTX Worker ({self.device}): GPU com suporte a FP8 detectada. Iniciando otimização...")
+            self.pipeline.to(self.device)
+            optimize_ltx_worker(self)
+            self.pipeline.to(self.cpu_device)
+            logger.info(f"LTX Worker ({self.device}): Otimização concluída. Modelo pronto.")
+        elif self.device.type == 'cuda':
+            logger.info(f"LTX Worker ({self.device}): Otimização FP8 não suportada ou desativada. Usando modelo padrão.")
+    def to_gpu(self):
+        if self.device.type == 'cpu': return
+        logger.info(f"LTX Worker: Movendo pipeline para a GPU {self.device}...")
+        self.pipeline.to(self.device)
+    def to_cpu(self):
+        if self.device.type == 'cpu': return
+        logger.info(f"LTX Worker: Descarregando pipeline da GPU {self.device}...")
+        self.pipeline.to('cpu')
+        gc.collect()
+        if torch.cuda.is_available(): torch.cuda.empty_cache()
+    def generate_video_fragment_internal(self, **kwargs):
+        return self.pipeline(**kwargs).images
+class LtxPoolManager:
+    def __init__(self, device_ids, ltx_config_file):
+        logger.info(f"LTX POOL MANAGER: Criando workers para os dispositivos: {device_ids}")
+        self.workers = [LtxWorker(dev_id, ltx_config_file) for dev_id in device_ids]
+        self.current_worker_index = 0
+        self.lock = threading.Lock()
+        self.last_cleanup_thread = None
+    def _cleanup_worker_thread(self, worker):
+        logger.info(f"LTX CLEANUP THREAD: Iniciando limpeza de {worker.device} em background...")
+        worker.to_cpu()
+    def _prepare_and_log_params(self, worker_to_use, **kwargs):
+        target_device = worker_to_use.device
+        height, width = kwargs['height'], kwargs['width']
+        conditioning_data = kwargs.get('conditioning_items_data', [])
+        final_conditioning_items = []
+        # --- LOG ADICIONADO: Detalhes dos tensores de condicionamento ---
+        conditioning_log_details = []
+        for i, item in enumerate(conditioning_data):
+            if hasattr(item, 'latent_tensor'):
+                item.latent_tensor = item.latent_tensor.to(target_device)
+                final_conditioning_items.append(item)
+                conditioning_log_details.append(
+                    f"  - Item {i}: frame={item.media_frame_number}, strength={item.conditioning_strength:.2f}, shape={list(item.latent_tensor.shape)}"
+                )
+        first_pass_config = worker_to_use.config.get("first_pass", {})
+        padded_h, padded_w = ((height - 1) // 32 + 1) * 32, ((width - 1) // 32 + 1) * 32
+        padding_vals = calculate_padding(height, width, padded_h, padded_w)
+        pipeline_params = {
+            "height": padded_h, "width": padded_w,
+            "num_frames": kwargs['video_total_frames'], "frame_rate": kwargs['video_fps'],
+            "generator": torch.Generator(device=target_device).manual_seed(int(kwargs.get('seed', time.time())) + kwargs['current_fragment_index']),
+            "conditioning_items": final_conditioning_items,
+            "is_video": True, "vae_per_channel_normalize": True,
+            "decode_timestep": float(kwargs.get('decode_timestep', worker_to_use.config.get("decode_timestep", 0.05))),
+            "decode_noise_scale": float(kwargs.get('decode_noise_scale', worker_to_use.config.get("decode_noise_scale", 0.025))),
+            "image_cond_noise_scale": float(kwargs.get('image_cond_noise_scale', 0.0)),
+            "stochastic_sampling": bool(kwargs.get('stochastic_sampling', worker_to_use.config.get("stochastic_sampling", False))),
+            "prompt": kwargs['motion_prompt'],
+            "negative_prompt": kwargs.get('negative_prompt', "blurry, distorted, static, bad quality, artifacts"),
+            "guidance_scale": float(kwargs.get('guidance_scale', 1.0)),
+            "stg_scale": float(kwargs.get('stg_scale', 0.0)),
+            "rescaling_scale": float(kwargs.get('rescaling_scale', 1.0)),
+        }
+        if worker_to_use.is_distilled:
+            pipeline_params["timesteps"] = first_pass_config.get("timesteps")
+            pipeline_params["num_inference_steps"] = len(pipeline_params["timesteps"]) if "timesteps" in first_pass_config else 8
+        else:
+            pipeline_params["num_inference_steps"] = int(kwargs.get('num_inference_steps', 7))
+        # --- LOG ADICIONADO: Exibição completa dos parâmetros da pipeline ---
+        log_friendly_params = pipeline_params.copy()
+        log_friendly_params.pop('generator', None)
+        log_friendly_params.pop('conditioning_items', None)
+        logger.info("="*60)
+        logger.info(f"CHAMADA AO PIPELINE LTX NO DISPOSITIVO: {worker_to_use.device}")
+        logger.info(f"Modelo: {'Distilled' if worker_to_use.is_distilled else 'Base'}")
+        logger.info("-" * 20 + " PARÂMETROS DA PIPELINE " + "-" * 20)
+        logger.info(json.dumps(log_friendly_params, indent=2))
+        logger.info("-" * 20 + " ITENS DE CONDICIONAMENTO " + "-" * 19)
+        logger.info("\n".join(conditioning_log_details))
+        logger.info("="*60)
+        # --- FIM DO LOG ADICIONADO ---
+        return pipeline_params, padding_vals
+    def generate_latent_fragment(self, **kwargs) -> (torch.Tensor, tuple):
+        worker_to_use = None
+        progress = kwargs.get('progress')
+        try:
+            with self.lock:
+                if self.last_cleanup_thread and self.last_cleanup_thread.is_alive():
+                    self.last_cleanup_thread.join()
+                worker_to_use = self.workers[self.current_worker_index]
+                previous_worker_index = (self.current_worker_index - 1 + len(self.workers)) % len(self.workers)
+                worker_to_cleanup = self.workers[previous_worker_index]
+                cleanup_thread = threading.Thread(target=self._cleanup_worker_thread, args=(worker_to_cleanup,))
+                cleanup_thread.start()
+                self.last_cleanup_thread = cleanup_thread
+                worker_to_use.to_gpu()
+                self.current_worker_index = (self.current_worker_index + 1) % len(self.workers)
+            pipeline_params, padding_vals = self._prepare_and_log_params(worker_to_use, **kwargs)
+            pipeline_params['output_type'] = "latent"
+            if progress: progress(0.1, desc=f"[Especialista LTX em {worker_to_use.device}] Gerando latentes...")
+            with torch.no_grad():
+                result_tensor = worker_to_use.generate_video_fragment_internal(**pipeline_params)
+            return result_tensor, padding_vals
+        except Exception as e:
+            logger.error(f"LTX POOL MANAGER: Erro durante a geração de latentes: {e}", exc_info=True)
+            raise e
+        finally:
+            if worker_to_use:
+                logger.info(f"LTX POOL MANAGER: Executando limpeza final para {worker_to_use.device}...")
+                worker_to_use.to_cpu()
+logger.info("Lendo config.yaml para inicializar o LTX Pool Manager...")
+with open("config.yaml", 'r') as f:
+    config = yaml.safe_load(f)
+ltx_gpus_required = config['specialists']['ltx']['gpus_required']
+ltx_device_ids = hardware_manager.allocate_gpus('LTX', ltx_gpus_required)
+ltx_config_path = config['specialists']['ltx']['config_file']
+ltx_manager_singleton = LtxPoolManager(device_ids=ltx_device_ids, ltx_config_file=ltx_config_path)
+logger.info("Especialista de Vídeo (LTX) pronto.")

ltx_video/LICENSE.txt ADDED Viewed

	@@ -0,0 +1,201 @@

+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+   1. Definitions.
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+   END OF TERMS AND CONDITIONS
+   APPENDIX: How to apply the Apache License to your work.
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+   Copyright [yyyy] [name of copyright owner]
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.

ltx_video/README.md ADDED Viewed

	@@ -0,0 +1,135 @@

+# 🛠️ helpers/ - Ferramentas de IA de Terceiros Adaptadas para ADUC-SDR
+Esta pasta contém implementações adaptadas de modelos e utilitários de IA de terceiros, que servem como "especialistas" ou "ferramentas" de baixo nível para a arquitetura ADUC-SDR.
+**IMPORTANTE:** O conteúdo desta pasta é de autoria de seus respectivos idealizadores e desenvolvedores originais. Esta pasta **NÃO FAZ PARTE** do projeto principal ADUC-SDR em termos de sua arquitetura inovadora. Ela serve como um repositório para as **dependências diretas e modificadas** que os `DeformesXDEngines` (os estágios do "foguete" ADUC-SDR) invocam para realizar tarefas específicas (geração de imagem, vídeo, áudio).
+As modificações realizadas nos arquivos aqui presentes visam principalmente:
+1.  **Adaptação de Interfaces:** Padronizar as interfaces para que se encaixem no fluxo de orquestração do ADUC-SDR.
+2.  **Gerenciamento de Recursos:** Integrar lógicas de carregamento/descarregamento de modelos (GPU management) e configurações via arquivos YAML.
+3.  **Otimização de Fluxo:** Ajustar as pipelines para aceitar formatos de entrada mais eficientes (ex: tensores pré-codificados em vez de caminhos de mídia, pulando etapas de codificação/decodificação redundantes).
+---
+## 📄 Licenciamento
+O conteúdo original dos projetos listados abaixo é licenciado sob a **Licença Apache 2.0**, ou outra licença especificada pelos autores originais. Todas as modificações e o uso desses arquivos dentro da estrutura `helpers/` do projeto ADUC-SDR estão em conformidade com os termos da **Licença Apache 2.0**.
+As licenças originais dos projetos podem ser encontradas nas suas respectivas fontes ou nos subdiretórios `incl_licenses/` dentro de cada módulo adaptado.
+---
+## 🛠️ API dos Helpers e Guia de Uso
+Esta seção detalha como cada helper (agente especialista) deve ser utilizado dentro do ecossistema ADUC-SDR. Todos os agentes são instanciados como **singletons** no `hardware_manager.py` para garantir o gerenciamento centralizado de recursos de GPU.
+### **gemini_helpers.py (GeminiAgent)**
+*   **Propósito:** Atua como o "Oráculo de Síntese Adaptativo", responsável por todas as tarefas de processamento de linguagem natural, como criação de storyboards, geração de prompts, e tomada de decisões narrativas.
+*   **Singleton Instance:** `gemini_agent_singleton`
+*   **Construtor:** `GeminiAgent()`
+    *   Lê `configs/gemini_config.yaml` para obter o nome do modelo, parâmetros de inferência e caminhos de templates de prompt. A chave da API é lida da variável de ambiente `GEMINI_API_KEY`.
+*   **Métodos Públicos:**
+    *   `generate_storyboard(prompt: str, num_keyframes: int, ref_image_paths: list[str])`
+        *   **Inputs:**
+            *   `prompt`: A ideia geral do filme (string).
+            *   `num_keyframes`: O número de cenas a serem geradas (int).
+            *   `ref_image_paths`: Lista de caminhos para as imagens de referência (list[str]).
+        *   **Output:** `tuple[list[str], str]` (Uma tupla contendo a lista de strings do storyboard e um relatório textual da operação).
+    *   `select_keyframes_from_pool(storyboard: list, base_image_paths: list[str], pool_image_paths: list[str])`
+        *   **Inputs:**
+            *   `storyboard`: A lista de strings do storyboard gerado.
+            *   `base_image_paths`: Imagens de referência base (list[str]).
+            *   `pool_image_paths`: O "banco de imagens" de onde selecionar (list[str]).
+        *   **Output:** `tuple[list[str], str]` (Uma tupla contendo a lista de caminhos de imagens selecionadas e um relatório textual).
+    *   `get_anticipatory_keyframe_prompt(...)`
+        *   **Inputs:** Contexto narrativo e visual para gerar um prompt de imagem.
+        *   **Output:** `tuple[str, str]` (Uma tupla contendo o prompt gerado para o modelo de imagem e um relatório textual).
+    *   `get_initial_motion_prompt(...)`
+        *   **Inputs:** Contexto narrativo e visual para a primeira transição de vídeo.
+        *   **Output:** `tuple[str, str]` (Uma tupla contendo o prompt de movimento gerado e um relatório textual).
+    *   `get_transition_decision(...)`
+        *   **Inputs:** Contexto narrativo e visual para uma transição de vídeo intermediária.
+        *   **Output:** `tuple[dict, str]` (Uma tupla contendo um dicionário `{"transition_type": "...", "motion_prompt": "..."}` e um relatório textual).
+    *   `generate_audio_prompts(...)`
+        *   **Inputs:** Contexto narrativo global.
+        *   **Output:** `tuple[dict, str]` (Uma tupla contendo um dicionário `{"music_prompt": "...", "sfx_prompt": "..."}` e um relatório textual).
+### **flux_kontext_helpers.py (FluxPoolManager)**
+*   **Propósito:** Especialista em geração de imagens de alta qualidade (keyframes) usando a pipeline FluxKontext. Gerencia um pool de workers para otimizar o uso de múltiplas GPUs.
+*   **Singleton Instance:** `flux_kontext_singleton`
+*   **Construtor:** `FluxPoolManager(device_ids: list[str], flux_config_file: str)`
+    *   Lê `configs/flux_config.yaml`.
+*   **Método Público:**
+    *   `generate_image(prompt: str, reference_images: list[Image.Image], width: int, height: int, seed: int = 42, callback: callable = None)`
+        *   **Inputs:**
+            *   `prompt`: Prompt textual para guiar a geração (string).
+            *   `reference_images`: Lista de objetos `PIL.Image` como referência visual.
+            *   `width`, `height`: Dimensões da imagem de saída (int).
+            *   `seed`: Semente para reprodutibilidade (int).
+            *   `callback`: Função de callback opcional para monitorar o progresso.
+        *   **Output:** `PIL.Image.Image` (O objeto da imagem gerada).
+### **dreamo_helpers.py (DreamOAgent)**
+*   **Propósito:** Especialista em geração de imagens de alta qualidade (keyframes) usando a pipeline DreamO, com capacidades avançadas de edição e estilo a partir de referências.
+*   **Singleton Instance:** `dreamo_agent_singleton`
+*   **Construtor:** `DreamOAgent(device_id: str = None)`
+    *   Lê `configs/dreamo_config.yaml`.
+*   **Método Público:**
+    *   `generate_image(prompt: str, reference_images: list[Image.Image], width: int, height: int)`
+        *   **Inputs:**
+            *   `prompt`: Prompt textual para guiar a geração (string).
+            *   `reference_images`: Lista de objetos `PIL.Image` como referência visual. A lógica interna atribui a primeira imagem como `style` e as demais como `ip`.
+            *   `width`, `height`: Dimensões da imagem de saída (int).
+        *   **Output:** `PIL.Image.Image` (O objeto da imagem gerada).
+### **ltx_manager_helpers.py (LtxPoolManager)**
+*   **Propósito:** Especialista na geração de fragmentos de vídeo no espaço latente usando a pipeline LTX-Video. Gerencia um pool de workers para otimizar o uso de múltiplas GPUs.
+*   **Singleton Instance:** `ltx_manager_singleton`
+*   **Construtor:** `LtxPoolManager(device_ids: list[str], ltx_model_config_file: str, ltx_global_config_file: str)`
+    *   Lê o `ltx_global_config_file` e o `ltx_model_config_file` para configurar a pipeline.
+*   **Método Público:**
+    *   `generate_latent_fragment(**kwargs)`
+        *   **Inputs:** Dicionário de keyword arguments (`kwargs`) contendo todos os parâmetros da pipeline LTX, incluindo:
+            *   `height`, `width`: Dimensões do vídeo (int).
+            *   `video_total_frames`: Número total de frames a serem gerados (int).
+            *   `video_fps`: Frames por segundo (int).
+            *   `motion_prompt`: Prompt de movimento (string).
+            *   `conditioning_items_data`: Lista de objetos `LatentConditioningItem` contendo os tensores latentes de condição.
+            *   `guidance_scale`, `stg_scale`, `num_inference_steps`, etc.
+        *   **Output:** `tuple[torch.Tensor, tuple]` (Uma tupla contendo o tensor latente gerado e os valores de padding utilizados).
+### **mmaudio_helper.py (MMAudioAgent)**
+*   **Propósito:** Especialista em geração de áudio para um determinado fragmento de vídeo.
+*   **Singleton Instance:** `mmaudio_agent_singleton`
+*   **Construtor:** `MMAudioAgent(workspace_dir: str, device_id: str = None, mmaudio_config_file: str)`
+    *   Lê `configs/mmaudio_config.yaml`.
+*   **Método Público:**
+    *   `generate_audio_for_video(video_path: str, prompt: str, negative_prompt: str, duration_seconds: float)`
+        *   **Inputs:**
+            *   `video_path`: Caminho para o arquivo de vídeo silencioso (string).
+            *   `prompt`: Prompt textual para guiar a geração de áudio (string).
+            *   `negative_prompt`: Prompt negativo para áudio (string).
+            *   `duration_seconds`: Duração exata do vídeo (float).
+        *   **Output:** `str` (O caminho para o novo arquivo de vídeo com a faixa de áudio integrada).
+---
+## 🔗 Projetos Originais e Atribuições
+(A seção de atribuições e licenças permanece a mesma que definimos anteriormente)
+### DreamO
+*   **Repositório Original:** [https://github.com/bytedance/DreamO](https://github.com/bytedance/DreamO)
+...
+### LTX-Video
+*   **Repositório Original:** [https://github.com/Lightricks/LTX-Video](https://github.com/Lightricks/LTX-Video)
+...
+### MMAudio
+*   **Repositório Original:** [https://github.com/hkchengrex/MMAudio](https://github.com/hkchengrex/MMAudio)
+...

ltx_video/__init__.py ADDED Viewed

File without changes

ltx_video/models/__init__.py ADDED Viewed

File without changes

ltx_video/models/autoencoders/__init__.py ADDED Viewed

File without changes

ltx_video/models/autoencoders/causal_conv3d.py ADDED Viewed

	@@ -0,0 +1,63 @@

+from typing import Tuple, Union
+import torch
+import torch.nn as nn
+class CausalConv3d(nn.Module):
+    def __init__(
+        self,
+        in_channels,
+        out_channels,
+        kernel_size: int = 3,
+        stride: Union[int, Tuple[int]] = 1,
+        dilation: int = 1,
+        groups: int = 1,
+        spatial_padding_mode: str = "zeros",
+        **kwargs,
+    ):
+        super().__init__()
+        self.in_channels = in_channels
+        self.out_channels = out_channels
+        kernel_size = (kernel_size, kernel_size, kernel_size)
+        self.time_kernel_size = kernel_size[0]
+        dilation = (dilation, 1, 1)
+        height_pad = kernel_size[1] // 2
+        width_pad = kernel_size[2] // 2
+        padding = (0, height_pad, width_pad)
+        self.conv = nn.Conv3d(
+            in_channels,
+            out_channels,
+            kernel_size,
+            stride=stride,
+            dilation=dilation,
+            padding=padding,
+            padding_mode=spatial_padding_mode,
+            groups=groups,
+        )
+    def forward(self, x, causal: bool = True):
+        if causal:
+            first_frame_pad = x[:, :, :1, :, :].repeat(
+                (1, 1, self.time_kernel_size - 1, 1, 1)
+            )
+            x = torch.concatenate((first_frame_pad, x), dim=2)
+        else:
+            first_frame_pad = x[:, :, :1, :, :].repeat(
+                (1, 1, (self.time_kernel_size - 1) // 2, 1, 1)
+            )
+            last_frame_pad = x[:, :, -1:, :, :].repeat(
+                (1, 1, (self.time_kernel_size - 1) // 2, 1, 1)
+            )
+            x = torch.concatenate((first_frame_pad, x, last_frame_pad), dim=2)
+        x = self.conv(x)
+        return x
+    @property
+    def weight(self):
+        return self.conv.weight

ltx_video/models/autoencoders/causal_video_autoencoder.py ADDED Viewed

	@@ -0,0 +1,1398 @@

+import json
+import os
+from functools import partial
+from types import SimpleNamespace
+from typing import Any, Mapping, Optional, Tuple, Union, List
+from pathlib import Path
+import torch
+import numpy as np
+from einops import rearrange
+from torch import nn
+from diffusers.utils import logging
+import torch.nn.functional as F
+from diffusers.models.embeddings import PixArtAlphaCombinedTimestepSizeEmbeddings
+from safetensors import safe_open
+from ltx_video.models.autoencoders.conv_nd_factory import make_conv_nd, make_linear_nd
+from ltx_video.models.autoencoders.pixel_norm import PixelNorm
+from ltx_video.models.autoencoders.pixel_shuffle import PixelShuffleND
+from ltx_video.models.autoencoders.vae import AutoencoderKLWrapper
+from ltx_video.models.transformers.attention import Attention
+from ltx_video.utils.diffusers_config_mapping import (
+    diffusers_and_ours_config_mapping,
+    make_hashable_key,
+    VAE_KEYS_RENAME_DICT,
+)
+PER_CHANNEL_STATISTICS_PREFIX = "per_channel_statistics."
+logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
+class CausalVideoAutoencoder(AutoencoderKLWrapper):
+    @classmethod
+    def from_pretrained(
+        cls,
+        pretrained_model_name_or_path: Optional[Union[str, os.PathLike]],
+        *args,
+        **kwargs,
+    ):
+        pretrained_model_name_or_path = Path(pretrained_model_name_or_path)
+        if (
+            pretrained_model_name_or_path.is_dir()
+            and (pretrained_model_name_or_path / "autoencoder.pth").exists()
+        ):
+            config_local_path = pretrained_model_name_or_path / "config.json"
+            config = cls.load_config(config_local_path, **kwargs)
+            model_local_path = pretrained_model_name_or_path / "autoencoder.pth"
+            state_dict = torch.load(model_local_path, map_location=torch.device("cpu"))
+            statistics_local_path = (
+                pretrained_model_name_or_path / "per_channel_statistics.json"
+            )
+            if statistics_local_path.exists():
+                with open(statistics_local_path, "r") as file:
+                    data = json.load(file)
+                transposed_data = list(zip(*data["data"]))
+                data_dict = {
+                    col: torch.tensor(vals)
+                    for col, vals in zip(data["columns"], transposed_data)
+                }
+                std_of_means = data_dict["std-of-means"]
+                mean_of_means = data_dict.get(
+                    "mean-of-means", torch.zeros_like(data_dict["std-of-means"])
+                )
+                state_dict[f"{PER_CHANNEL_STATISTICS_PREFIX}std-of-means"] = (
+                    std_of_means
+                )
+                state_dict[f"{PER_CHANNEL_STATISTICS_PREFIX}mean-of-means"] = (
+                    mean_of_means
+                )
+        elif pretrained_model_name_or_path.is_dir():
+            config_path = pretrained_model_name_or_path / "vae" / "config.json"
+            with open(config_path, "r") as f:
+                config = make_hashable_key(json.load(f))
+            assert config in diffusers_and_ours_config_mapping, (
+                "Provided diffusers checkpoint config for VAE is not suppported. "
+                "We only support diffusers configs found in Lightricks/LTX-Video."
+            )
+            config = diffusers_and_ours_config_mapping[config]
+            state_dict_path = (
+                pretrained_model_name_or_path
+                / "vae"
+                / "diffusion_pytorch_model.safetensors"
+            )
+            state_dict = {}
+            with safe_open(state_dict_path, framework="pt", device="cpu") as f:
+                for k in f.keys():
+                    state_dict[k] = f.get_tensor(k)
+            for key in list(state_dict.keys()):
+                new_key = key
+                for replace_key, rename_key in VAE_KEYS_RENAME_DICT.items():
+                    new_key = new_key.replace(replace_key, rename_key)
+                state_dict[new_key] = state_dict.pop(key)
+        elif pretrained_model_name_or_path.is_file() and str(
+            pretrained_model_name_or_path
+        ).endswith(".safetensors"):
+            state_dict = {}
+            with safe_open(
+                pretrained_model_name_or_path, framework="pt", device="cpu"
+            ) as f:
+                metadata = f.metadata()
+                for k in f.keys():
+                    state_dict[k] = f.get_tensor(k)
+            configs = json.loads(metadata["config"])
+            config = configs["vae"]
+        video_vae = cls.from_config(config)
+        if "torch_dtype" in kwargs:
+            video_vae.to(kwargs["torch_dtype"])
+        video_vae.load_state_dict(state_dict)
+        return video_vae
+    @staticmethod
+    def from_config(config):
+        assert (
+            config["_class_name"] == "CausalVideoAutoencoder"
+        ), "config must have _class_name=CausalVideoAutoencoder"
+        if isinstance(config["dims"], list):
+            config["dims"] = tuple(config["dims"])
+        assert config["dims"] in [2, 3, (2, 1)], "dims must be 2, 3 or (2, 1)"
+        double_z = config.get("double_z", True)
+        latent_log_var = config.get(
+            "latent_log_var", "per_channel" if double_z else "none"
+        )
+        use_quant_conv = config.get("use_quant_conv", True)
+        normalize_latent_channels = config.get("normalize_latent_channels", False)
+        if use_quant_conv and latent_log_var in ["uniform", "constant"]:
+            raise ValueError(
+                f"latent_log_var={latent_log_var} requires use_quant_conv=False"
+            )
+        encoder = Encoder(
+            dims=config["dims"],
+            in_channels=config.get("in_channels", 3),
+            out_channels=config["latent_channels"],
+            blocks=config.get("encoder_blocks", config.get("blocks")),
+            patch_size=config.get("patch_size", 1),
+            latent_log_var=latent_log_var,
+            norm_layer=config.get("norm_layer", "group_norm"),
+            base_channels=config.get("encoder_base_channels", 128),
+            spatial_padding_mode=config.get("spatial_padding_mode", "zeros"),
+        )
+        decoder = Decoder(
+            dims=config["dims"],
+            in_channels=config["latent_channels"],
+            out_channels=config.get("out_channels", 3),
+            blocks=config.get("decoder_blocks", config.get("blocks")),
+            patch_size=config.get("patch_size", 1),
+            norm_layer=config.get("norm_layer", "group_norm"),
+            causal=config.get("causal_decoder", False),
+            timestep_conditioning=config.get("timestep_conditioning", False),
+            base_channels=config.get("decoder_base_channels", 128),
+            spatial_padding_mode=config.get("spatial_padding_mode", "zeros"),
+        )
+        dims = config["dims"]
+        return CausalVideoAutoencoder(
+            encoder=encoder,
+            decoder=decoder,
+            latent_channels=config["latent_channels"],
+            dims=dims,
+            use_quant_conv=use_quant_conv,
+            normalize_latent_channels=normalize_latent_channels,
+        )
+    @property
+    def config(self):
+        return SimpleNamespace(
+            _class_name="CausalVideoAutoencoder",
+            dims=self.dims,
+            in_channels=self.encoder.conv_in.in_channels // self.encoder.patch_size**2,
+            out_channels=self.decoder.conv_out.out_channels
+            // self.decoder.patch_size**2,
+            latent_channels=self.decoder.conv_in.in_channels,
+            encoder_blocks=self.encoder.blocks_desc,
+            decoder_blocks=self.decoder.blocks_desc,
+            scaling_factor=1.0,
+            norm_layer=self.encoder.norm_layer,
+            patch_size=self.encoder.patch_size,
+            latent_log_var=self.encoder.latent_log_var,
+            use_quant_conv=self.use_quant_conv,
+            causal_decoder=self.decoder.causal,
+            timestep_conditioning=self.decoder.timestep_conditioning,
+            normalize_latent_channels=self.normalize_latent_channels,
+        )
+    @property
+    def is_video_supported(self):
+        """
+        Check if the model supports video inputs of shape (B, C, F, H, W). Otherwise, the model only supports 2D images.
+        """
+        return self.dims != 2
+    @property
+    def spatial_downscale_factor(self):
+        return (
+            2
+            ** len(
+                [
+                    block
+                    for block in self.encoder.blocks_desc
+                    if block[0]
+                    in [
+                        "compress_space",
+                        "compress_all",
+                        "compress_all_res",
+                        "compress_space_res",
+                    ]
+                ]
+            )
+            * self.encoder.patch_size
+        )
+    @property
+    def temporal_downscale_factor(self):
+        return 2 ** len(
+            [
+                block
+                for block in self.encoder.blocks_desc
+                if block[0]
+                in [
+                    "compress_time",
+                    "compress_all",
+                    "compress_all_res",
+                    "compress_time_res",
+                ]
+            ]
+        )
+    def to_json_string(self) -> str:
+        import json
+        return json.dumps(self.config.__dict__)
+    def load_state_dict(self, state_dict: Mapping[str, Any], strict: bool = True):
+        if any([key.startswith("vae.") for key in state_dict.keys()]):
+            state_dict = {
+                key.replace("vae.", ""): value
+                for key, value in state_dict.items()
+                if key.startswith("vae.")
+            }
+        ckpt_state_dict = {
+            key: value
+            for key, value in state_dict.items()
+            if not key.startswith(PER_CHANNEL_STATISTICS_PREFIX)
+        }
+        model_keys = set(name for name, _ in self.named_modules())
+        key_mapping = {
+            ".resnets.": ".res_blocks.",
+            "downsamplers.0": "downsample",
+            "upsamplers.0": "upsample",
+        }
+        converted_state_dict = {}
+        for key, value in ckpt_state_dict.items():
+            for k, v in key_mapping.items():
+                key = key.replace(k, v)
+            key_prefix = ".".join(key.split(".")[:-1])
+            if "norm" in key and key_prefix not in model_keys:
+                logger.info(
+                    f"Removing key {key} from state_dict as it is not present in the model"
+                )
+                continue
+            converted_state_dict[key] = value
+        super().load_state_dict(converted_state_dict, strict=strict)
+        data_dict = {
+            key.removeprefix(PER_CHANNEL_STATISTICS_PREFIX): value
+            for key, value in state_dict.items()
+            if key.startswith(PER_CHANNEL_STATISTICS_PREFIX)
+        }
+        if len(data_dict) > 0:
+            self.register_buffer("std_of_means", data_dict["std-of-means"])
+            self.register_buffer(
+                "mean_of_means",
+                data_dict.get(
+                    "mean-of-means", torch.zeros_like(data_dict["std-of-means"])
+                ),
+            )
+    def last_layer(self):
+        if hasattr(self.decoder, "conv_out"):
+            if isinstance(self.decoder.conv_out, nn.Sequential):
+                last_layer = self.decoder.conv_out[-1]
+            else:
+                last_layer = self.decoder.conv_out
+        else:
+            last_layer = self.decoder.layers[-1]
+        return last_layer
+    def set_use_tpu_flash_attention(self):
+        for block in self.decoder.up_blocks:
+            if isinstance(block, UNetMidBlock3D) and block.attention_blocks:
+                for attention_block in block.attention_blocks:
+                    attention_block.set_use_tpu_flash_attention()
+class Encoder(nn.Module):
+    r"""
+    The `Encoder` layer of a variational autoencoder that encodes its input into a latent representation.
+    Args:
+        dims (`int` or `Tuple[int, int]`, *optional*, defaults to 3):
+            The number of dimensions to use in convolutions.
+        in_channels (`int`, *optional*, defaults to 3):
+            The number of input channels.
+        out_channels (`int`, *optional*, defaults to 3):
+            The number of output channels.
+        blocks (`List[Tuple[str, int]]`, *optional*, defaults to `[("res_x", 1)]`):
+            The blocks to use. Each block is a tuple of the block name and the number of layers.
+        base_channels (`int`, *optional*, defaults to 128):
+            The number of output channels for the first convolutional layer.
+        norm_num_groups (`int`, *optional*, defaults to 32):
+            The number of groups for normalization.
+        patch_size (`int`, *optional*, defaults to 1):
+            The patch size to use. Should be a power of 2.
+        norm_layer (`str`, *optional*, defaults to `group_norm`):
+            The normalization layer to use. Can be either `group_norm` or `pixel_norm`.
+        latent_log_var (`str`, *optional*, defaults to `per_channel`):
+            The number of channels for the log variance. Can be either `per_channel`, `uniform`, `constant` or `none`.
+    """
+    def __init__(
+        self,
+        dims: Union[int, Tuple[int, int]] = 3,
+        in_channels: int = 3,
+        out_channels: int = 3,
+        blocks: List[Tuple[str, int | dict]] = [("res_x", 1)],
+        base_channels: int = 128,
+        norm_num_groups: int = 32,
+        patch_size: Union[int, Tuple[int]] = 1,
+        norm_layer: str = "group_norm",  # group_norm, pixel_norm
+        latent_log_var: str = "per_channel",
+        spatial_padding_mode: str = "zeros",
+    ):
+        super().__init__()
+        self.patch_size = patch_size
+        self.norm_layer = norm_layer
+        self.latent_channels = out_channels
+        self.latent_log_var = latent_log_var
+        self.blocks_desc = blocks
+        in_channels = in_channels * patch_size**2
+        output_channel = base_channels
+        self.conv_in = make_conv_nd(
+            dims=dims,
+            in_channels=in_channels,
+            out_channels=output_channel,
+            kernel_size=3,
+            stride=1,
+            padding=1,
+            causal=True,
+            spatial_padding_mode=spatial_padding_mode,
+        )
+        self.down_blocks = nn.ModuleList([])
+        for block_name, block_params in blocks:
+            input_channel = output_channel
+            if isinstance(block_params, int):
+                block_params = {"num_layers": block_params}
+            if block_name == "res_x":
+                block = UNetMidBlock3D(
+                    dims=dims,
+                    in_channels=input_channel,
+                    num_layers=block_params["num_layers"],
+                    resnet_eps=1e-6,
+                    resnet_groups=norm_num_groups,
+                    norm_layer=norm_layer,
+                    spatial_padding_mode=spatial_padding_mode,
+                )
+            elif block_name == "res_x_y":
+                output_channel = block_params.get("multiplier", 2) * output_channel
+                block = ResnetBlock3D(
+                    dims=dims,
+                    in_channels=input_channel,
+                    out_channels=output_channel,
+                    eps=1e-6,
+                    groups=norm_num_groups,
+                    norm_layer=norm_layer,
+                    spatial_padding_mode=spatial_padding_mode,
+                )
+            elif block_name == "compress_time":
+                block = make_conv_nd(
+                    dims=dims,
+                    in_channels=input_channel,
+                    out_channels=output_channel,
+                    kernel_size=3,
+                    stride=(2, 1, 1),
+                    causal=True,
+                    spatial_padding_mode=spatial_padding_mode,
+                )
+            elif block_name == "compress_space":
+                block = make_conv_nd(
+                    dims=dims,
+                    in_channels=input_channel,
+                    out_channels=output_channel,
+                    kernel_size=3,
+                    stride=(1, 2, 2),
+                    causal=True,
+                    spatial_padding_mode=spatial_padding_mode,
+                )
+            elif block_name == "compress_all":
+                block = make_conv_nd(
+                    dims=dims,
+                    in_channels=input_channel,
+                    out_channels=output_channel,
+                    kernel_size=3,
+                    stride=(2, 2, 2),
+                    causal=True,
+                    spatial_padding_mode=spatial_padding_mode,
+                )
+            elif block_name == "compress_all_x_y":
+                output_channel = block_params.get("multiplier", 2) * output_channel
+                block = make_conv_nd(
+                    dims=dims,
+                    in_channels=input_channel,
+                    out_channels=output_channel,
+                    kernel_size=3,
+                    stride=(2, 2, 2),
+                    causal=True,
+                    spatial_padding_mode=spatial_padding_mode,
+                )
+            elif block_name == "compress_all_res":
+                output_channel = block_params.get("multiplier", 2) * output_channel
+                block = SpaceToDepthDownsample(
+                    dims=dims,
+                    in_channels=input_channel,
+                    out_channels=output_channel,
+                    stride=(2, 2, 2),
+                    spatial_padding_mode=spatial_padding_mode,
+                )
+            elif block_name == "compress_space_res":
+                output_channel = block_params.get("multiplier", 2) * output_channel
+                block = SpaceToDepthDownsample(
+                    dims=dims,
+                    in_channels=input_channel,
+                    out_channels=output_channel,
+                    stride=(1, 2, 2),
+                    spatial_padding_mode=spatial_padding_mode,
+                )
+            elif block_name == "compress_time_res":
+                output_channel = block_params.get("multiplier", 2) * output_channel
+                block = SpaceToDepthDownsample(
+                    dims=dims,
+                    in_channels=input_channel,
+                    out_channels=output_channel,
+                    stride=(2, 1, 1),
+                    spatial_padding_mode=spatial_padding_mode,
+                )
+            else:
+                raise ValueError(f"unknown block: {block_name}")
+            self.down_blocks.append(block)
+        # out
+        if norm_layer == "group_norm":
+            self.conv_norm_out = nn.GroupNorm(
+                num_channels=output_channel, num_groups=norm_num_groups, eps=1e-6
+            )
+        elif norm_layer == "pixel_norm":
+            self.conv_norm_out = PixelNorm()
+        elif norm_layer == "layer_norm":
+            self.conv_norm_out = LayerNorm(output_channel, eps=1e-6)
+        self.conv_act = nn.SiLU()
+        conv_out_channels = out_channels
+        if latent_log_var == "per_channel":
+            conv_out_channels *= 2
+        elif latent_log_var == "uniform":
+            conv_out_channels += 1
+        elif latent_log_var == "constant":
+            conv_out_channels += 1
+        elif latent_log_var != "none":
+            raise ValueError(f"Invalid latent_log_var: {latent_log_var}")
+        self.conv_out = make_conv_nd(
+            dims,
+            output_channel,
+            conv_out_channels,
+            3,
+            padding=1,
+            causal=True,
+            spatial_padding_mode=spatial_padding_mode,
+        )
+        self.gradient_checkpointing = False
+    def forward(self, sample: torch.FloatTensor) -> torch.FloatTensor:
+        r"""The forward method of the `Encoder` class."""
+        sample = patchify(sample, patch_size_hw=self.patch_size, patch_size_t=1)
+        sample = self.conv_in(sample)
+        checkpoint_fn = (
+            partial(torch.utils.checkpoint.checkpoint, use_reentrant=False)
+            if self.gradient_checkpointing and self.training
+            else lambda x: x
+        )
+        for down_block in self.down_blocks:
+            sample = checkpoint_fn(down_block)(sample)
+        sample = self.conv_norm_out(sample)
+        sample = self.conv_act(sample)
+        sample = self.conv_out(sample)
+        if self.latent_log_var == "uniform":
+            last_channel = sample[:, -1:, ...]
+            num_dims = sample.dim()
+            if num_dims == 4:
+                # For shape (B, C, H, W)
+                repeated_last_channel = last_channel.repeat(
+                    1, sample.shape[1] - 2, 1, 1
+                )
+                sample = torch.cat([sample, repeated_last_channel], dim=1)
+            elif num_dims == 5:
+                # For shape (B, C, F, H, W)
+                repeated_last_channel = last_channel.repeat(
+                    1, sample.shape[1] - 2, 1, 1, 1
+                )
+                sample = torch.cat([sample, repeated_last_channel], dim=1)
+            else:
+                raise ValueError(f"Invalid input shape: {sample.shape}")
+        elif self.latent_log_var == "constant":
+            sample = sample[:, :-1, ...]
+            approx_ln_0 = (
+                -30
+            )  # this is the minimal clamp value in DiagonalGaussianDistribution objects
+            sample = torch.cat(
+                [sample, torch.ones_like(sample, device=sample.device) * approx_ln_0],
+                dim=1,
+            )
+        return sample
+class Decoder(nn.Module):
+    r"""
+    The `Decoder` layer of a variational autoencoder that decodes its latent representation into an output sample.
+    Args:
+        dims (`int` or `Tuple[int, int]`, *optional*, defaults to 3):
+            The number of dimensions to use in convolutions.
+        in_channels (`int`, *optional*, defaults to 3):
+            The number of input channels.
+        out_channels (`int`, *optional*, defaults to 3):
+            The number of output channels.
+        blocks (`List[Tuple[str, int]]`, *optional*, defaults to `[("res_x", 1)]`):
+            The blocks to use. Each block is a tuple of the block name and the number of layers.
+        base_channels (`int`, *optional*, defaults to 128):
+            The number of output channels for the first convolutional layer.
+        norm_num_groups (`int`, *optional*, defaults to 32):
+            The number of groups for normalization.
+        patch_size (`int`, *optional*, defaults to 1):
+            The patch size to use. Should be a power of 2.
+        norm_layer (`str`, *optional*, defaults to `group_norm`):
+            The normalization layer to use. Can be either `group_norm` or `pixel_norm`.
+        causal (`bool`, *optional*, defaults to `True`):
+            Whether to use causal convolutions or not.
+    """
+    def __init__(
+        self,
+        dims,
+        in_channels: int = 3,
+        out_channels: int = 3,
+        blocks: List[Tuple[str, int | dict]] = [("res_x", 1)],
+        base_channels: int = 128,
+        layers_per_block: int = 2,
+        norm_num_groups: int = 32,
+        patch_size: int = 1,
+        norm_layer: str = "group_norm",
+        causal: bool = True,
+        timestep_conditioning: bool = False,
+        spatial_padding_mode: str = "zeros",
+    ):
+        super().__init__()
+        self.patch_size = patch_size
+        self.layers_per_block = layers_per_block
+        out_channels = out_channels * patch_size**2
+        self.causal = causal
+        self.blocks_desc = blocks
+        # Compute output channel to be product of all channel-multiplier blocks
+        output_channel = base_channels
+        for block_name, block_params in list(reversed(blocks)):
+            block_params = block_params if isinstance(block_params, dict) else {}
+            if block_name == "res_x_y":
+                output_channel = output_channel * block_params.get("multiplier", 2)
+            if block_name.startswith("compress"):
+                output_channel = output_channel * block_params.get("multiplier", 1)
+        self.conv_in = make_conv_nd(
+            dims,
+            in_channels,
+            output_channel,
+            kernel_size=3,
+            stride=1,
+            padding=1,
+            causal=True,
+            spatial_padding_mode=spatial_padding_mode,
+        )
+        self.up_blocks = nn.ModuleList([])
+        for block_name, block_params in list(reversed(blocks)):
+            input_channel = output_channel
+            if isinstance(block_params, int):
+                block_params = {"num_layers": block_params}
+            if block_name == "res_x":
+                block = UNetMidBlock3D(
+                    dims=dims,
+                    in_channels=input_channel,
+                    num_layers=block_params["num_layers"],
+                    resnet_eps=1e-6,
+                    resnet_groups=norm_num_groups,
+                    norm_layer=norm_layer,
+                    inject_noise=block_params.get("inject_noise", False),
+                    timestep_conditioning=timestep_conditioning,
+                    spatial_padding_mode=spatial_padding_mode,
+                )
+            elif block_name == "attn_res_x":
+                block = UNetMidBlock3D(
+                    dims=dims,
+                    in_channels=input_channel,
+                    num_layers=block_params["num_layers"],
+                    resnet_groups=norm_num_groups,
+                    norm_layer=norm_layer,
+                    inject_noise=block_params.get("inject_noise", False),
+                    timestep_conditioning=timestep_conditioning,
+                    attention_head_dim=block_params["attention_head_dim"],
+                    spatial_padding_mode=spatial_padding_mode,
+                )
+            elif block_name == "res_x_y":
+                output_channel = output_channel // block_params.get("multiplier", 2)
+                block = ResnetBlock3D(
+                    dims=dims,
+                    in_channels=input_channel,
+                    out_channels=output_channel,
+                    eps=1e-6,
+                    groups=norm_num_groups,
+                    norm_layer=norm_layer,
+                    inject_noise=block_params.get("inject_noise", False),
+                    timestep_conditioning=False,
+                    spatial_padding_mode=spatial_padding_mode,
+                )
+            elif block_name == "compress_time":
+                block = DepthToSpaceUpsample(
+                    dims=dims,
+                    in_channels=input_channel,
+                    stride=(2, 1, 1),
+                    spatial_padding_mode=spatial_padding_mode,
+                )
+            elif block_name == "compress_space":
+                block = DepthToSpaceUpsample(
+                    dims=dims,
+                    in_channels=input_channel,
+                    stride=(1, 2, 2),
+                    spatial_padding_mode=spatial_padding_mode,
+                )
+            elif block_name == "compress_all":
+                output_channel = output_channel // block_params.get("multiplier", 1)
+                block = DepthToSpaceUpsample(
+                    dims=dims,
+                    in_channels=input_channel,
+                    stride=(2, 2, 2),
+                    residual=block_params.get("residual", False),
+                    out_channels_reduction_factor=block_params.get("multiplier", 1),
+                    spatial_padding_mode=spatial_padding_mode,
+                )
+            else:
+                raise ValueError(f"unknown layer: {block_name}")
+            self.up_blocks.append(block)
+        if norm_layer == "group_norm":
+            self.conv_norm_out = nn.GroupNorm(
+                num_channels=output_channel, num_groups=norm_num_groups, eps=1e-6
+            )
+        elif norm_layer == "pixel_norm":
+            self.conv_norm_out = PixelNorm()
+        elif norm_layer == "layer_norm":
+            self.conv_norm_out = LayerNorm(output_channel, eps=1e-6)
+        self.conv_act = nn.SiLU()
+        self.conv_out = make_conv_nd(
+            dims,
+            output_channel,
+            out_channels,
+            3,
+            padding=1,
+            causal=True,
+            spatial_padding_mode=spatial_padding_mode,
+        )
+        self.gradient_checkpointing = False
+        self.timestep_conditioning = timestep_conditioning
+        if timestep_conditioning:
+            self.timestep_scale_multiplier = nn.Parameter(
+                torch.tensor(1000.0, dtype=torch.float32)
+            )
+            self.last_time_embedder = PixArtAlphaCombinedTimestepSizeEmbeddings(
+                output_channel * 2, 0
+            )
+            self.last_scale_shift_table = nn.Parameter(
+                torch.randn(2, output_channel) / output_channel**0.5
+            )
+    def forward(
+        self,
+        sample: torch.FloatTensor,
+        target_shape,
+        timestep: Optional[torch.Tensor] = None,
+    ) -> torch.FloatTensor:
+        r"""The forward method of the `Decoder` class."""
+        assert target_shape is not None, "target_shape must be provided"
+        batch_size = sample.shape[0]
+        sample = self.conv_in(sample, causal=self.causal)
+        upscale_dtype = next(iter(self.up_blocks.parameters())).dtype
+        checkpoint_fn = (
+            partial(torch.utils.checkpoint.checkpoint, use_reentrant=False)
+            if self.gradient_checkpointing and self.training
+            else lambda x: x
+        )
+        sample = sample.to(upscale_dtype)
+        if self.timestep_conditioning:
+            assert (
+                timestep is not None
+            ), "should pass timestep with timestep_conditioning=True"
+            scaled_timestep = timestep * self.timestep_scale_multiplier
+        for up_block in self.up_blocks:
+            if self.timestep_conditioning and isinstance(up_block, UNetMidBlock3D):
+                sample = checkpoint_fn(up_block)(
+                    sample, causal=self.causal, timestep=scaled_timestep
+                )
+            else:
+                sample = checkpoint_fn(up_block)(sample, causal=self.causal)
+        sample = self.conv_norm_out(sample)
+        if self.timestep_conditioning:
+            embedded_timestep = self.last_time_embedder(
+                timestep=scaled_timestep.flatten(),
+                resolution=None,
+                aspect_ratio=None,
+                batch_size=sample.shape[0],
+                hidden_dtype=sample.dtype,
+            )
+            embedded_timestep = embedded_timestep.view(
+                batch_size, embedded_timestep.shape[-1], 1, 1, 1
+            )
+            ada_values = self.last_scale_shift_table[
+                None, ..., None, None, None
+            ] + embedded_timestep.reshape(
+                batch_size,
+                2,
+                -1,
+                embedded_timestep.shape[-3],
+                embedded_timestep.shape[-2],
+                embedded_timestep.shape[-1],
+            )
+            shift, scale = ada_values.unbind(dim=1)
+            sample = sample * (1 + scale) + shift
+        sample = self.conv_act(sample)
+        sample = self.conv_out(sample, causal=self.causal)
+        sample = unpatchify(sample, patch_size_hw=self.patch_size, patch_size_t=1)
+        return sample
+class UNetMidBlock3D(nn.Module):
+    """
+    A 3D UNet mid-block [`UNetMidBlock3D`] with multiple residual blocks.
+    Args:
+        in_channels (`int`): The number of input channels.
+        dropout (`float`, *optional*, defaults to 0.0): The dropout rate.
+        num_layers (`int`, *optional*, defaults to 1): The number of residual blocks.
+        resnet_eps (`float`, *optional*, 1e-6 ): The epsilon value for the resnet blocks.
+        resnet_groups (`int`, *optional*, defaults to 32):
+            The number of groups to use in the group normalization layers of the resnet blocks.
+        norm_layer (`str`, *optional*, defaults to `group_norm`):
+            The normalization layer to use. Can be either `group_norm` or `pixel_norm`.
+        inject_noise (`bool`, *optional*, defaults to `False`):
+            Whether to inject noise into the hidden states.
+        timestep_conditioning (`bool`, *optional*, defaults to `False`):
+            Whether to condition the hidden states on the timestep.
+        attention_head_dim (`int`, *optional*, defaults to -1):
+            The dimension of the attention head. If -1, no attention is used.
+    Returns:
+        `torch.FloatTensor`: The output of the last residual block, which is a tensor of shape `(batch_size,
+        in_channels, height, width)`.
+    """
+    def __init__(
+        self,
+        dims: Union[int, Tuple[int, int]],
+        in_channels: int,
+        dropout: float = 0.0,
+        num_layers: int = 1,
+        resnet_eps: float = 1e-6,
+        resnet_groups: int = 32,
+        norm_layer: str = "group_norm",
+        inject_noise: bool = False,
+        timestep_conditioning: bool = False,
+        attention_head_dim: int = -1,
+        spatial_padding_mode: str = "zeros",
+    ):
+        super().__init__()
+        resnet_groups = (
+            resnet_groups if resnet_groups is not None else min(in_channels // 4, 32)
+        )
+        self.timestep_conditioning = timestep_conditioning
+        if timestep_conditioning:
+            self.time_embedder = PixArtAlphaCombinedTimestepSizeEmbeddings(
+                in_channels * 4, 0
+            )
+        self.res_blocks = nn.ModuleList(
+            [
+                ResnetBlock3D(
+                    dims=dims,
+                    in_channels=in_channels,
+                    out_channels=in_channels,
+                    eps=resnet_eps,
+                    groups=resnet_groups,
+                    dropout=dropout,
+                    norm_layer=norm_layer,
+                    inject_noise=inject_noise,
+                    timestep_conditioning=timestep_conditioning,
+                    spatial_padding_mode=spatial_padding_mode,
+                )
+                for _ in range(num_layers)
+            ]
+        )
+        self.attention_blocks = None
+        if attention_head_dim > 0:
+            if attention_head_dim > in_channels:
+                raise ValueError(
+                    "attention_head_dim must be less than or equal to in_channels"
+                )
+            self.attention_blocks = nn.ModuleList(
+                [
+                    Attention(
+                        query_dim=in_channels,
+                        heads=in_channels // attention_head_dim,
+                        dim_head=attention_head_dim,
+                        bias=True,
+                        out_bias=True,
+                        qk_norm="rms_norm",
+                        residual_connection=True,
+                    )
+                    for _ in range(num_layers)
+                ]
+            )
+    def forward(
+        self,
+        hidden_states: torch.FloatTensor,
+        causal: bool = True,
+        timestep: Optional[torch.Tensor] = None,
+    ) -> torch.FloatTensor:
+        timestep_embed = None
+        if self.timestep_conditioning:
+            assert (
+                timestep is not None
+            ), "should pass timestep with timestep_conditioning=True"
+            batch_size = hidden_states.shape[0]
+            timestep_embed = self.time_embedder(
+                timestep=timestep.flatten(),
+                resolution=None,
+                aspect_ratio=None,
+                batch_size=batch_size,
+                hidden_dtype=hidden_states.dtype,
+            )
+            timestep_embed = timestep_embed.view(
+                batch_size, timestep_embed.shape[-1], 1, 1, 1
+            )
+        if self.attention_blocks:
+            for resnet, attention in zip(self.res_blocks, self.attention_blocks):
+                hidden_states = resnet(
+                    hidden_states, causal=causal, timestep=timestep_embed
+                )
+                # Reshape the hidden states to be (batch_size, frames * height * width, channel)
+                batch_size, channel, frames, height, width = hidden_states.shape
+                hidden_states = hidden_states.view(
+                    batch_size, channel, frames * height * width
+                ).transpose(1, 2)
+                if attention.use_tpu_flash_attention:
+                    # Pad the second dimension to be divisible by block_k_major (block in flash attention)
+                    seq_len = hidden_states.shape[1]
+                    block_k_major = 512
+                    pad_len = (block_k_major - seq_len % block_k_major) % block_k_major
+                    if pad_len > 0:
+                        hidden_states = F.pad(
+                            hidden_states, (0, 0, 0, pad_len), "constant", 0
+                        )
+                    # Create a mask with ones for the original sequence length and zeros for the padded indexes
+                    mask = torch.ones(
+                        (hidden_states.shape[0], seq_len),
+                        device=hidden_states.device,
+                        dtype=hidden_states.dtype,
+                    )
+                    if pad_len > 0:
+                        mask = F.pad(mask, (0, pad_len), "constant", 0)
+                hidden_states = attention(
+                    hidden_states,
+                    attention_mask=(
+                        None if not attention.use_tpu_flash_attention else mask
+                    ),
+                )
+                if attention.use_tpu_flash_attention:
+                    # Remove the padding
+                    if pad_len > 0:
+                        hidden_states = hidden_states[:, :-pad_len, :]
+                # Reshape the hidden states back to (batch_size, channel, frames, height, width, channel)
+                hidden_states = hidden_states.transpose(-1, -2).reshape(
+                    batch_size, channel, frames, height, width
+                )
+        else:
+            for resnet in self.res_blocks:
+                hidden_states = resnet(
+                    hidden_states, causal=causal, timestep=timestep_embed
+                )
+        return hidden_states
+class SpaceToDepthDownsample(nn.Module):
+    def __init__(self, dims, in_channels, out_channels, stride, spatial_padding_mode):
+        super().__init__()
+        self.stride = stride
+        self.group_size = in_channels * np.prod(stride) // out_channels
+        self.conv = make_conv_nd(
+            dims=dims,
+            in_channels=in_channels,
+            out_channels=out_channels // np.prod(stride),
+            kernel_size=3,
+            stride=1,
+            causal=True,
+            spatial_padding_mode=spatial_padding_mode,
+        )
+    def forward(self, x, causal: bool = True):
+        if self.stride[0] == 2:
+            x = torch.cat(
+                [x[:, :, :1, :, :], x], dim=2
+            )  # duplicate first frames for padding
+        # skip connection
+        x_in = rearrange(
+            x,
+            "b c (d p1) (h p2) (w p3) -> b (c p1 p2 p3) d h w",
+            p1=self.stride[0],
+            p2=self.stride[1],
+            p3=self.stride[2],
+        )
+        x_in = rearrange(x_in, "b (c g) d h w -> b c g d h w", g=self.group_size)
+        x_in = x_in.mean(dim=2)
+        # conv
+        x = self.conv(x, causal=causal)
+        x = rearrange(
+            x,
+            "b c (d p1) (h p2) (w p3) -> b (c p1 p2 p3) d h w",
+            p1=self.stride[0],
+            p2=self.stride[1],
+            p3=self.stride[2],
+        )
+        x = x + x_in
+        return x
+class DepthToSpaceUpsample(nn.Module):
+    def __init__(
+        self,
+        dims,
+        in_channels,
+        stride,
+        residual=False,
+        out_channels_reduction_factor=1,
+        spatial_padding_mode="zeros",
+    ):
+        super().__init__()
+        self.stride = stride
+        self.out_channels = (
+            np.prod(stride) * in_channels // out_channels_reduction_factor
+        )
+        self.conv = make_conv_nd(
+            dims=dims,
+            in_channels=in_channels,
+            out_channels=self.out_channels,
+            kernel_size=3,
+            stride=1,
+            causal=True,
+            spatial_padding_mode=spatial_padding_mode,
+        )
+        self.pixel_shuffle = PixelShuffleND(dims=dims, upscale_factors=stride)
+        self.residual = residual
+        self.out_channels_reduction_factor = out_channels_reduction_factor
+    def forward(self, x, causal: bool = True):
+        if self.residual:
+            # Reshape and duplicate the input to match the output shape
+            x_in = self.pixel_shuffle(x)
+            num_repeat = np.prod(self.stride) // self.out_channels_reduction_factor
+            x_in = x_in.repeat(1, num_repeat, 1, 1, 1)
+            if self.stride[0] == 2:
+                x_in = x_in[:, :, 1:, :, :]
+        x = self.conv(x, causal=causal)
+        x = self.pixel_shuffle(x)
+        if self.stride[0] == 2:
+            x = x[:, :, 1:, :, :]
+        if self.residual:
+            x = x + x_in
+        return x
+class LayerNorm(nn.Module):
+    def __init__(self, dim, eps, elementwise_affine=True) -> None:
+        super().__init__()
+        self.norm = nn.LayerNorm(dim, eps=eps, elementwise_affine=elementwise_affine)
+    def forward(self, x):
+        x = rearrange(x, "b c d h w -> b d h w c")
+        x = self.norm(x)
+        x = rearrange(x, "b d h w c -> b c d h w")
+        return x
+class ResnetBlock3D(nn.Module):
+    r"""
+    A Resnet block.
+    Parameters:
+        in_channels (`int`): The number of channels in the input.
+        out_channels (`int`, *optional*, default to be `None`):
+            The number of output channels for the first conv layer. If None, same as `in_channels`.
+        dropout (`float`, *optional*, defaults to `0.0`): The dropout probability to use.
+        groups (`int`, *optional*, default to `32`): The number of groups to use for the first normalization layer.
+        eps (`float`, *optional*, defaults to `1e-6`): The epsilon to use for the normalization.
+    """
+    def __init__(
+        self,
+        dims: Union[int, Tuple[int, int]],
+        in_channels: int,
+        out_channels: Optional[int] = None,
+        dropout: float = 0.0,
+        groups: int = 32,
+        eps: float = 1e-6,
+        norm_layer: str = "group_norm",
+        inject_noise: bool = False,
+        timestep_conditioning: bool = False,
+        spatial_padding_mode: str = "zeros",
+    ):
+        super().__init__()
+        self.in_channels = in_channels
+        out_channels = in_channels if out_channels is None else out_channels
+        self.out_channels = out_channels
+        self.inject_noise = inject_noise
+        if norm_layer == "group_norm":
+            self.norm1 = nn.GroupNorm(
+                num_groups=groups, num_channels=in_channels, eps=eps, affine=True
+            )
+        elif norm_layer == "pixel_norm":
+            self.norm1 = PixelNorm()
+        elif norm_layer == "layer_norm":
+            self.norm1 = LayerNorm(in_channels, eps=eps, elementwise_affine=True)
+        self.non_linearity = nn.SiLU()
+        self.conv1 = make_conv_nd(
+            dims,
+            in_channels,
+            out_channels,
+            kernel_size=3,
+            stride=1,
+            padding=1,
+            causal=True,
+            spatial_padding_mode=spatial_padding_mode,
+        )
+        if inject_noise:
+            self.per_channel_scale1 = nn.Parameter(torch.zeros((in_channels, 1, 1)))
+        if norm_layer == "group_norm":
+            self.norm2 = nn.GroupNorm(
+                num_groups=groups, num_channels=out_channels, eps=eps, affine=True
+            )
+        elif norm_layer == "pixel_norm":
+            self.norm2 = PixelNorm()
+        elif norm_layer == "layer_norm":
+            self.norm2 = LayerNorm(out_channels, eps=eps, elementwise_affine=True)
+        self.dropout = torch.nn.Dropout(dropout)
+        self.conv2 = make_conv_nd(
+            dims,
+            out_channels,
+            out_channels,
+            kernel_size=3,
+            stride=1,
+            padding=1,
+            causal=True,
+            spatial_padding_mode=spatial_padding_mode,
+        )
+        if inject_noise:
+            self.per_channel_scale2 = nn.Parameter(torch.zeros((in_channels, 1, 1)))
+        self.conv_shortcut = (
+            make_linear_nd(
+                dims=dims, in_channels=in_channels, out_channels=out_channels
+            )
+            if in_channels != out_channels
+            else nn.Identity()
+        )
+        self.norm3 = (
+            LayerNorm(in_channels, eps=eps, elementwise_affine=True)
+            if in_channels != out_channels
+            else nn.Identity()
+        )
+        self.timestep_conditioning = timestep_conditioning
+        if timestep_conditioning:
+            self.scale_shift_table = nn.Parameter(
+                torch.randn(4, in_channels) / in_channels**0.5
+            )
+    def _feed_spatial_noise(
+        self, hidden_states: torch.FloatTensor, per_channel_scale: torch.FloatTensor
+    ) -> torch.FloatTensor:
+        spatial_shape = hidden_states.shape[-2:]
+        device = hidden_states.device
+        dtype = hidden_states.dtype
+        # similar to the "explicit noise inputs" method in style-gan
+        spatial_noise = torch.randn(spatial_shape, device=device, dtype=dtype)[None]
+        scaled_noise = (spatial_noise * per_channel_scale)[None, :, None, ...]
+        hidden_states = hidden_states + scaled_noise
+        return hidden_states
+    def forward(
+        self,
+        input_tensor: torch.FloatTensor,
+        causal: bool = True,
+        timestep: Optional[torch.Tensor] = None,
+    ) -> torch.FloatTensor:
+        hidden_states = input_tensor
+        batch_size = hidden_states.shape[0]
+        hidden_states = self.norm1(hidden_states)
+        if self.timestep_conditioning:
+            assert (
+                timestep is not None
+            ), "should pass timestep with timestep_conditioning=True"
+            ada_values = self.scale_shift_table[
+                None, ..., None, None, None
+            ] + timestep.reshape(
+                batch_size,
+                4,
+                -1,
+                timestep.shape[-3],
+                timestep.shape[-2],
+                timestep.shape[-1],
+            )
+            shift1, scale1, shift2, scale2 = ada_values.unbind(dim=1)
+            hidden_states = hidden_states * (1 + scale1) + shift1
+        hidden_states = self.non_linearity(hidden_states)
+        hidden_states = self.conv1(hidden_states, causal=causal)
+        if self.inject_noise:
+            hidden_states = self._feed_spatial_noise(
+                hidden_states, self.per_channel_scale1
+            )
+        hidden_states = self.norm2(hidden_states)
+        if self.timestep_conditioning:
+            hidden_states = hidden_states * (1 + scale2) + shift2
+        hidden_states = self.non_linearity(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.conv2(hidden_states, causal=causal)
+        if self.inject_noise:
+            hidden_states = self._feed_spatial_noise(
+                hidden_states, self.per_channel_scale2
+            )
+        input_tensor = self.norm3(input_tensor)
+        batch_size = input_tensor.shape[0]
+        input_tensor = self.conv_shortcut(input_tensor)
+        output_tensor = input_tensor + hidden_states
+        return output_tensor
+def patchify(x, patch_size_hw, patch_size_t=1):
+    if patch_size_hw == 1 and patch_size_t == 1:
+        return x
+    if x.dim() == 4:
+        x = rearrange(
+            x, "b c (h q) (w r) -> b (c r q) h w", q=patch_size_hw, r=patch_size_hw
+        )
+    elif x.dim() == 5:
+        x = rearrange(
+            x,
+            "b c (f p) (h q) (w r) -> b (c p r q) f h w",
+            p=patch_size_t,
+            q=patch_size_hw,
+            r=patch_size_hw,
+        )
+    else:
+        raise ValueError(f"Invalid input shape: {x.shape}")
+    return x
+def unpatchify(x, patch_size_hw, patch_size_t=1):
+    if patch_size_hw == 1 and patch_size_t == 1:
+        return x
+    if x.dim() == 4:
+        x = rearrange(
+            x, "b (c r q) h w -> b c (h q) (w r)", q=patch_size_hw, r=patch_size_hw
+        )
+    elif x.dim() == 5:
+        x = rearrange(
+            x,
+            "b (c p r q) f h w -> b c (f p) (h q) (w r)",
+            p=patch_size_t,
+            q=patch_size_hw,
+            r=patch_size_hw,
+        )
+    return x
+def create_video_autoencoder_demo_config(
+    latent_channels: int = 64,
+):
+    encoder_blocks = [
+        ("res_x", {"num_layers": 2}),
+        ("compress_space_res", {"multiplier": 2}),
+        ("compress_time_res", {"multiplier": 2}),
+        ("compress_all_res", {"multiplier": 2}),
+        ("compress_all_res", {"multiplier": 2}),
+        ("res_x", {"num_layers": 1}),
+    ]
+    decoder_blocks = [
+        ("res_x", {"num_layers": 2, "inject_noise": False}),
+        ("compress_all", {"residual": True, "multiplier": 2}),
+        ("compress_all", {"residual": True, "multiplier": 2}),
+        ("compress_all", {"residual": True, "multiplier": 2}),
+        ("res_x", {"num_layers": 2, "inject_noise": False}),
+    ]
+    return {
+        "_class_name": "CausalVideoAutoencoder",
+        "dims": 3,
+        "encoder_blocks": encoder_blocks,
+        "decoder_blocks": decoder_blocks,
+        "latent_channels": latent_channels,
+        "norm_layer": "pixel_norm",
+        "patch_size": 4,
+        "latent_log_var": "uniform",
+        "use_quant_conv": False,
+        "causal_decoder": False,
+        "timestep_conditioning": True,
+        "spatial_padding_mode": "replicate",
+    }
+def test_vae_patchify_unpatchify():
+    import torch
+    x = torch.randn(2, 3, 8, 64, 64)
+    x_patched = patchify(x, patch_size_hw=4, patch_size_t=4)
+    x_unpatched = unpatchify(x_patched, patch_size_hw=4, patch_size_t=4)
+    assert torch.allclose(x, x_unpatched)
+def demo_video_autoencoder_forward_backward():
+    # Configuration for the VideoAutoencoder
+    config = create_video_autoencoder_demo_config()
+    # Instantiate the VideoAutoencoder with the specified configuration
+    video_autoencoder = CausalVideoAutoencoder.from_config(config)
+    print(video_autoencoder)
+    video_autoencoder.eval()
+    # Print the total number of parameters in the video autoencoder
+    total_params = sum(p.numel() for p in video_autoencoder.parameters())
+    print(f"Total number of parameters in VideoAutoencoder: {total_params:,}")
+    # Create a mock input tensor simulating a batch of videos
+    # Shape: (batch_size, channels, depth, height, width)
+    # E.g., 4 videos, each with 3 color channels, 16 frames, and 64x64 pixels per frame
+    input_videos = torch.randn(2, 3, 17, 64, 64)
+    # Forward pass: encode and decode the input videos
+    latent = video_autoencoder.encode(input_videos).latent_dist.mode()
+    print(f"input shape={input_videos.shape}")
+    print(f"latent shape={latent.shape}")
+    timestep = torch.ones(input_videos.shape[0]) * 0.1
+    reconstructed_videos = video_autoencoder.decode(
+        latent, target_shape=input_videos.shape, timestep=timestep
+    ).sample
+    print(f"reconstructed shape={reconstructed_videos.shape}")
+    # Validate that single image gets treated the same way as first frame
+    input_image = input_videos[:, :, :1, :, :]
+    image_latent = video_autoencoder.encode(input_image).latent_dist.mode()
+    _ = video_autoencoder.decode(
+        image_latent, target_shape=image_latent.shape, timestep=timestep
+    ).sample
+    first_frame_latent = latent[:, :, :1, :, :]
+    assert torch.allclose(image_latent, first_frame_latent, atol=1e-6)
+    # assert torch.allclose(reconstructed_image, reconstructed_videos[:, :, :1, :, :], atol=1e-6)
+    # assert torch.allclose(image_latent, first_frame_latent, atol=1e-6)
+    # assert (reconstructed_image == reconstructed_videos[:, :, :1, :, :]).all()
+    # Calculate the loss (e.g., mean squared error)
+    loss = torch.nn.functional.mse_loss(input_videos, reconstructed_videos)
+    # Perform backward pass
+    loss.backward()
+    print(f"Demo completed with loss: {loss.item()}")
+# Ensure to call the demo function to execute the forward and backward pass
+if __name__ == "__main__":
+    demo_video_autoencoder_forward_backward()

ltx_video/models/autoencoders/conv_nd_factory.py ADDED Viewed

	@@ -0,0 +1,90 @@

+from typing import Tuple, Union
+import torch
+from ltx_video.models.autoencoders.dual_conv3d import DualConv3d
+from ltx_video.models.autoencoders.causal_conv3d import CausalConv3d
+def make_conv_nd(
+    dims: Union[int, Tuple[int, int]],
+    in_channels: int,
+    out_channels: int,
+    kernel_size: int,
+    stride=1,
+    padding=0,
+    dilation=1,
+    groups=1,
+    bias=True,
+    causal=False,
+    spatial_padding_mode="zeros",
+    temporal_padding_mode="zeros",
+):
+    if not (spatial_padding_mode == temporal_padding_mode or causal):
+        raise NotImplementedError("spatial and temporal padding modes must be equal")
+    if dims == 2:
+        return torch.nn.Conv2d(
+            in_channels=in_channels,
+            out_channels=out_channels,
+            kernel_size=kernel_size,
+            stride=stride,
+            padding=padding,
+            dilation=dilation,
+            groups=groups,
+            bias=bias,
+            padding_mode=spatial_padding_mode,
+        )
+    elif dims == 3:
+        if causal:
+            return CausalConv3d(
+                in_channels=in_channels,
+                out_channels=out_channels,
+                kernel_size=kernel_size,
+                stride=stride,
+                padding=padding,
+                dilation=dilation,
+                groups=groups,
+                bias=bias,
+                spatial_padding_mode=spatial_padding_mode,
+            )
+        return torch.nn.Conv3d(
+            in_channels=in_channels,
+            out_channels=out_channels,
+            kernel_size=kernel_size,
+            stride=stride,
+            padding=padding,
+            dilation=dilation,
+            groups=groups,
+            bias=bias,
+            padding_mode=spatial_padding_mode,
+        )
+    elif dims == (2, 1):
+        return DualConv3d(
+            in_channels=in_channels,
+            out_channels=out_channels,
+            kernel_size=kernel_size,
+            stride=stride,
+            padding=padding,
+            bias=bias,
+            padding_mode=spatial_padding_mode,
+        )
+    else:
+        raise ValueError(f"unsupported dimensions: {dims}")
+def make_linear_nd(
+    dims: int,
+    in_channels: int,
+    out_channels: int,
+    bias=True,
+):
+    if dims == 2:
+        return torch.nn.Conv2d(
+            in_channels=in_channels, out_channels=out_channels, kernel_size=1, bias=bias
+        )
+    elif dims == 3 or dims == (2, 1):
+        return torch.nn.Conv3d(
+            in_channels=in_channels, out_channels=out_channels, kernel_size=1, bias=bias
+        )
+    else:
+        raise ValueError(f"unsupported dimensions: {dims}")

ltx_video/models/autoencoders/dual_conv3d.py ADDED Viewed

	@@ -0,0 +1,217 @@

+import math
+from typing import Tuple, Union
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from einops import rearrange
+class DualConv3d(nn.Module):
+    def __init__(
+        self,
+        in_channels,
+        out_channels,
+        kernel_size,
+        stride: Union[int, Tuple[int, int, int]] = 1,
+        padding: Union[int, Tuple[int, int, int]] = 0,
+        dilation: Union[int, Tuple[int, int, int]] = 1,
+        groups=1,
+        bias=True,
+        padding_mode="zeros",
+    ):
+        super(DualConv3d, self).__init__()
+        self.in_channels = in_channels
+        self.out_channels = out_channels
+        self.padding_mode = padding_mode
+        # Ensure kernel_size, stride, padding, and dilation are tuples of length 3
+        if isinstance(kernel_size, int):
+            kernel_size = (kernel_size, kernel_size, kernel_size)
+        if kernel_size == (1, 1, 1):
+            raise ValueError(
+                "kernel_size must be greater than 1. Use make_linear_nd instead."
+            )
+        if isinstance(stride, int):
+            stride = (stride, stride, stride)
+        if isinstance(padding, int):
+            padding = (padding, padding, padding)
+        if isinstance(dilation, int):
+            dilation = (dilation, dilation, dilation)
+        # Set parameters for convolutions
+        self.groups = groups
+        self.bias = bias
+        # Define the size of the channels after the first convolution
+        intermediate_channels = (
+            out_channels if in_channels < out_channels else in_channels
+        )
+        # Define parameters for the first convolution
+        self.weight1 = nn.Parameter(
+            torch.Tensor(
+                intermediate_channels,
+                in_channels // groups,
+                1,
+                kernel_size[1],
+                kernel_size[2],
+            )
+        )
+        self.stride1 = (1, stride[1], stride[2])
+        self.padding1 = (0, padding[1], padding[2])
+        self.dilation1 = (1, dilation[1], dilation[2])
+        if bias:
+            self.bias1 = nn.Parameter(torch.Tensor(intermediate_channels))
+        else:
+            self.register_parameter("bias1", None)
+        # Define parameters for the second convolution
+        self.weight2 = nn.Parameter(
+            torch.Tensor(
+                out_channels, intermediate_channels // groups, kernel_size[0], 1, 1
+            )
+        )
+        self.stride2 = (stride[0], 1, 1)
+        self.padding2 = (padding[0], 0, 0)
+        self.dilation2 = (dilation[0], 1, 1)
+        if bias:
+            self.bias2 = nn.Parameter(torch.Tensor(out_channels))
+        else:
+            self.register_parameter("bias2", None)
+        # Initialize weights and biases
+        self.reset_parameters()
+    def reset_parameters(self):
+        nn.init.kaiming_uniform_(self.weight1, a=math.sqrt(5))
+        nn.init.kaiming_uniform_(self.weight2, a=math.sqrt(5))
+        if self.bias:
+            fan_in1, _ = nn.init._calculate_fan_in_and_fan_out(self.weight1)
+            bound1 = 1 / math.sqrt(fan_in1)
+            nn.init.uniform_(self.bias1, -bound1, bound1)
+            fan_in2, _ = nn.init._calculate_fan_in_and_fan_out(self.weight2)
+            bound2 = 1 / math.sqrt(fan_in2)
+            nn.init.uniform_(self.bias2, -bound2, bound2)
+    def forward(self, x, use_conv3d=False, skip_time_conv=False):
+        if use_conv3d:
+            return self.forward_with_3d(x=x, skip_time_conv=skip_time_conv)
+        else:
+            return self.forward_with_2d(x=x, skip_time_conv=skip_time_conv)
+    def forward_with_3d(self, x, skip_time_conv):
+        # First convolution
+        x = F.conv3d(
+            x,
+            self.weight1,
+            self.bias1,
+            self.stride1,
+            self.padding1,
+            self.dilation1,
+            self.groups,
+            padding_mode=self.padding_mode,
+        )
+        if skip_time_conv:
+            return x
+        # Second convolution
+        x = F.conv3d(
+            x,
+            self.weight2,
+            self.bias2,
+            self.stride2,
+            self.padding2,
+            self.dilation2,
+            self.groups,
+            padding_mode=self.padding_mode,
+        )
+        return x
+    def forward_with_2d(self, x, skip_time_conv):
+        b, c, d, h, w = x.shape
+        # First 2D convolution
+        x = rearrange(x, "b c d h w -> (b d) c h w")
+        # Squeeze the depth dimension out of weight1 since it's 1
+        weight1 = self.weight1.squeeze(2)
+        # Select stride, padding, and dilation for the 2D convolution
+        stride1 = (self.stride1[1], self.stride1[2])
+        padding1 = (self.padding1[1], self.padding1[2])
+        dilation1 = (self.dilation1[1], self.dilation1[2])
+        x = F.conv2d(
+            x,
+            weight1,
+            self.bias1,
+            stride1,
+            padding1,
+            dilation1,
+            self.groups,
+            padding_mode=self.padding_mode,
+        )
+        _, _, h, w = x.shape
+        if skip_time_conv:
+            x = rearrange(x, "(b d) c h w -> b c d h w", b=b)
+            return x
+        # Second convolution which is essentially treated as a 1D convolution across the 'd' dimension
+        x = rearrange(x, "(b d) c h w -> (b h w) c d", b=b)
+        # Reshape weight2 to match the expected dimensions for conv1d
+        weight2 = self.weight2.squeeze(-1).squeeze(-1)
+        # Use only the relevant dimension for stride, padding, and dilation for the 1D convolution
+        stride2 = self.stride2[0]
+        padding2 = self.padding2[0]
+        dilation2 = self.dilation2[0]
+        x = F.conv1d(
+            x,
+            weight2,
+            self.bias2,
+            stride2,
+            padding2,
+            dilation2,
+            self.groups,
+            padding_mode=self.padding_mode,
+        )
+        x = rearrange(x, "(b h w) c d -> b c d h w", b=b, h=h, w=w)
+        return x
+    @property
+    def weight(self):
+        return self.weight2
+def test_dual_conv3d_consistency():
+    # Initialize parameters
+    in_channels = 3
+    out_channels = 5
+    kernel_size = (3, 3, 3)
+    stride = (2, 2, 2)
+    padding = (1, 1, 1)
+    # Create an instance of the DualConv3d class
+    dual_conv3d = DualConv3d(
+        in_channels=in_channels,
+        out_channels=out_channels,
+        kernel_size=kernel_size,
+        stride=stride,
+        padding=padding,
+        bias=True,
+    )
+    # Example input tensor
+    test_input = torch.randn(1, 3, 10, 10, 10)
+    # Perform forward passes with both 3D and 2D settings
+    output_conv3d = dual_conv3d(test_input, use_conv3d=True)
+    output_2d = dual_conv3d(test_input, use_conv3d=False)
+    # Assert that the outputs from both methods are sufficiently close
+    assert torch.allclose(
+        output_conv3d, output_2d, atol=1e-6
+    ), "Outputs are not consistent between 3D and 2D convolutions."

ltx_video/models/autoencoders/latent_upsampler.py ADDED Viewed

	@@ -0,0 +1,203 @@

+from typing import Optional, Union
+from pathlib import Path
+import os
+import json
+import torch
+import torch.nn as nn
+from einops import rearrange
+from diffusers import ConfigMixin, ModelMixin
+from safetensors.torch import safe_open
+from ltx_video.models.autoencoders.pixel_shuffle import PixelShuffleND
+class ResBlock(nn.Module):
+    def __init__(
+        self, channels: int, mid_channels: Optional[int] = None, dims: int = 3
+    ):
+        super().__init__()
+        if mid_channels is None:
+            mid_channels = channels
+        Conv = nn.Conv2d if dims == 2 else nn.Conv3d
+        self.conv1 = Conv(channels, mid_channels, kernel_size=3, padding=1)
+        self.norm1 = nn.GroupNorm(32, mid_channels)
+        self.conv2 = Conv(mid_channels, channels, kernel_size=3, padding=1)
+        self.norm2 = nn.GroupNorm(32, channels)
+        self.activation = nn.SiLU()
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        residual = x
+        x = self.conv1(x)
+        x = self.norm1(x)
+        x = self.activation(x)
+        x = self.conv2(x)
+        x = self.norm2(x)
+        x = self.activation(x + residual)
+        return x
+class LatentUpsampler(ModelMixin, ConfigMixin):
+    """
+    Model to spatially upsample VAE latents.
+    Args:
+        in_channels (`int`): Number of channels in the input latent
+        mid_channels (`int`): Number of channels in the middle layers
+        num_blocks_per_stage (`int`): Number of ResBlocks to use in each stage (pre/post upsampling)
+        dims (`int`): Number of dimensions for convolutions (2 or 3)
+        spatial_upsample (`bool`): Whether to spatially upsample the latent
+        temporal_upsample (`bool`): Whether to temporally upsample the latent
+    """
+    def __init__(
+        self,
+        in_channels: int = 128,
+        mid_channels: int = 512,
+        num_blocks_per_stage: int = 4,
+        dims: int = 3,
+        spatial_upsample: bool = True,
+        temporal_upsample: bool = False,
+    ):
+        super().__init__()
+        self.in_channels = in_channels
+        self.mid_channels = mid_channels
+        self.num_blocks_per_stage = num_blocks_per_stage
+        self.dims = dims
+        self.spatial_upsample = spatial_upsample
+        self.temporal_upsample = temporal_upsample
+        Conv = nn.Conv2d if dims == 2 else nn.Conv3d
+        self.initial_conv = Conv(in_channels, mid_channels, kernel_size=3, padding=1)
+        self.initial_norm = nn.GroupNorm(32, mid_channels)
+        self.initial_activation = nn.SiLU()
+        self.res_blocks = nn.ModuleList(
+            [ResBlock(mid_channels, dims=dims) for _ in range(num_blocks_per_stage)]
+        )
+        if spatial_upsample and temporal_upsample:
+            self.upsampler = nn.Sequential(
+                nn.Conv3d(mid_channels, 8 * mid_channels, kernel_size=3, padding=1),
+                PixelShuffleND(3),
+            )
+        elif spatial_upsample:
+            self.upsampler = nn.Sequential(
+                nn.Conv2d(mid_channels, 4 * mid_channels, kernel_size=3, padding=1),
+                PixelShuffleND(2),
+            )
+        elif temporal_upsample:
+            self.upsampler = nn.Sequential(
+                nn.Conv3d(mid_channels, 2 * mid_channels, kernel_size=3, padding=1),
+                PixelShuffleND(1),
+            )
+        else:
+            raise ValueError(
+                "Either spatial_upsample or temporal_upsample must be True"
+            )
+        self.post_upsample_res_blocks = nn.ModuleList(
+            [ResBlock(mid_channels, dims=dims) for _ in range(num_blocks_per_stage)]
+        )
+        self.final_conv = Conv(mid_channels, in_channels, kernel_size=3, padding=1)
+    def forward(self, latent: torch.Tensor) -> torch.Tensor:
+        b, c, f, h, w = latent.shape
+        if self.dims == 2:
+            x = rearrange(latent, "b c f h w -> (b f) c h w")
+            x = self.initial_conv(x)
+            x = self.initial_norm(x)
+            x = self.initial_activation(x)
+            for block in self.res_blocks:
+                x = block(x)
+            x = self.upsampler(x)
+            for block in self.post_upsample_res_blocks:
+                x = block(x)
+            x = self.final_conv(x)
+            x = rearrange(x, "(b f) c h w -> b c f h w", b=b, f=f)
+        else:
+            x = self.initial_conv(latent)
+            x = self.initial_norm(x)
+            x = self.initial_activation(x)
+            for block in self.res_blocks:
+                x = block(x)
+            if self.temporal_upsample:
+                x = self.upsampler(x)
+                x = x[:, :, 1:, :, :]
+            else:
+                x = rearrange(x, "b c f h w -> (b f) c h w")
+                x = self.upsampler(x)
+                x = rearrange(x, "(b f) c h w -> b c f h w", b=b, f=f)
+            for block in self.post_upsample_res_blocks:
+                x = block(x)
+            x = self.final_conv(x)
+        return x
+    @classmethod
+    def from_config(cls, config):
+        return cls(
+            in_channels=config.get("in_channels", 4),
+            mid_channels=config.get("mid_channels", 128),
+            num_blocks_per_stage=config.get("num_blocks_per_stage", 4),
+            dims=config.get("dims", 2),
+            spatial_upsample=config.get("spatial_upsample", True),
+            temporal_upsample=config.get("temporal_upsample", False),
+        )
+    def config(self):
+        return {
+            "_class_name": "LatentUpsampler",
+            "in_channels": self.in_channels,
+            "mid_channels": self.mid_channels,
+            "num_blocks_per_stage": self.num_blocks_per_stage,
+            "dims": self.dims,
+            "spatial_upsample": self.spatial_upsample,
+            "temporal_upsample": self.temporal_upsample,
+        }
+    @classmethod
+    def from_pretrained(
+        cls,
+        pretrained_model_path: Optional[Union[str, os.PathLike]],
+        *args,
+        **kwargs,
+    ):
+        pretrained_model_path = Path(pretrained_model_path)
+        if pretrained_model_path.is_file() and str(pretrained_model_path).endswith(
+            ".safetensors"
+        ):
+            state_dict = {}
+            with safe_open(pretrained_model_path, framework="pt", device="cpu") as f:
+                metadata = f.metadata()
+                for k in f.keys():
+                    state_dict[k] = f.get_tensor(k)
+            config = json.loads(metadata["config"])
+            with torch.device("meta"):
+                latent_upsampler = LatentUpsampler.from_config(config)
+            latent_upsampler.load_state_dict(state_dict, assign=True)
+        return latent_upsampler
+if __name__ == "__main__":
+    latent_upsampler = LatentUpsampler(num_blocks_per_stage=4, dims=3)
+    print(latent_upsampler)
+    total_params = sum(p.numel() for p in latent_upsampler.parameters())
+    print(f"Total number of parameters: {total_params:,}")
+    latent = torch.randn(1, 128, 9, 16, 16)
+    upsampled_latent = latent_upsampler(latent)
+    print(f"Upsampled latent shape: {upsampled_latent.shape}")

ltx_video/models/autoencoders/pixel_norm.py ADDED Viewed

	@@ -0,0 +1,12 @@

+import torch
+from torch import nn
+class PixelNorm(nn.Module):
+    def __init__(self, dim=1, eps=1e-8):
+        super(PixelNorm, self).__init__()
+        self.dim = dim
+        self.eps = eps
+    def forward(self, x):
+        return x / torch.sqrt(torch.mean(x**2, dim=self.dim, keepdim=True) + self.eps)

ltx_video/models/autoencoders/pixel_shuffle.py ADDED Viewed

	@@ -0,0 +1,33 @@

+import torch.nn as nn
+from einops import rearrange
+class PixelShuffleND(nn.Module):
+    def __init__(self, dims, upscale_factors=(2, 2, 2)):
+        super().__init__()
+        assert dims in [1, 2, 3], "dims must be 1, 2, or 3"
+        self.dims = dims
+        self.upscale_factors = upscale_factors
+    def forward(self, x):
+        if self.dims == 3:
+            return rearrange(
+                x,
+                "b (c p1 p2 p3) d h w -> b c (d p1) (h p2) (w p3)",
+                p1=self.upscale_factors[0],
+                p2=self.upscale_factors[1],
+                p3=self.upscale_factors[2],
+            )
+        elif self.dims == 2:
+            return rearrange(
+                x,
+                "b (c p1 p2) h w -> b c (h p1) (w p2)",
+                p1=self.upscale_factors[0],
+                p2=self.upscale_factors[1],
+            )
+        elif self.dims == 1:
+            return rearrange(
+                x,
+                "b (c p1) f h w -> b c (f p1) h w",
+                p1=self.upscale_factors[0],
+            )

ltx_video/models/autoencoders/vae.py ADDED Viewed

	@@ -0,0 +1,380 @@

+from typing import Optional, Union
+import torch
+import inspect
+import math
+import torch.nn as nn
+from diffusers import ConfigMixin, ModelMixin
+from diffusers.models.autoencoders.vae import (
+    DecoderOutput,
+    DiagonalGaussianDistribution,
+)
+from diffusers.models.modeling_outputs import AutoencoderKLOutput
+from ltx_video.models.autoencoders.conv_nd_factory import make_conv_nd
+class AutoencoderKLWrapper(ModelMixin, ConfigMixin):
+    """Variational Autoencoder (VAE) model with KL loss.
+    VAE from the paper Auto-Encoding Variational Bayes by Diederik P. Kingma and Max Welling.
+    This model is a wrapper around an encoder and a decoder, and it adds a KL loss term to the reconstruction loss.
+    Args:
+        encoder (`nn.Module`):
+            Encoder module.
+        decoder (`nn.Module`):
+            Decoder module.
+        latent_channels (`int`, *optional*, defaults to 4):
+            Number of latent channels.
+    """
+    def __init__(
+        self,
+        encoder: nn.Module,
+        decoder: nn.Module,
+        latent_channels: int = 4,
+        dims: int = 2,
+        sample_size=512,
+        use_quant_conv: bool = True,
+        normalize_latent_channels: bool = False,
+    ):
+        super().__init__()
+        # pass init params to Encoder
+        self.encoder = encoder
+        self.use_quant_conv = use_quant_conv
+        self.normalize_latent_channels = normalize_latent_channels
+        # pass init params to Decoder
+        quant_dims = 2 if dims == 2 else 3
+        self.decoder = decoder
+        if use_quant_conv:
+            self.quant_conv = make_conv_nd(
+                quant_dims, 2 * latent_channels, 2 * latent_channels, 1
+            )
+            self.post_quant_conv = make_conv_nd(
+                quant_dims, latent_channels, latent_channels, 1
+            )
+        else:
+            self.quant_conv = nn.Identity()
+            self.post_quant_conv = nn.Identity()
+        if normalize_latent_channels:
+            if dims == 2:
+                self.latent_norm_out = nn.BatchNorm2d(latent_channels, affine=False)
+            else:
+                self.latent_norm_out = nn.BatchNorm3d(latent_channels, affine=False)
+        else:
+            self.latent_norm_out = nn.Identity()
+        self.use_z_tiling = False
+        self.use_hw_tiling = False
+        self.dims = dims
+        self.z_sample_size = 1
+        self.decoder_params = inspect.signature(self.decoder.forward).parameters
+        # only relevant if vae tiling is enabled
+        self.set_tiling_params(sample_size=sample_size, overlap_factor=0.25)
+    def set_tiling_params(self, sample_size: int = 512, overlap_factor: float = 0.25):
+        self.tile_sample_min_size = sample_size
+        num_blocks = len(self.encoder.down_blocks)
+        self.tile_latent_min_size = int(sample_size / (2 ** (num_blocks - 1)))
+        self.tile_overlap_factor = overlap_factor
+    def enable_z_tiling(self, z_sample_size: int = 8):
+        r"""
+        Enable tiling during VAE decoding.
+        When this option is enabled, the VAE will split the input tensor in tiles to compute decoding in several
+        steps. This is useful to save some memory and allow larger batch sizes.
+        """
+        self.use_z_tiling = z_sample_size > 1
+        self.z_sample_size = z_sample_size
+        assert (
+            z_sample_size % 8 == 0 or z_sample_size == 1
+        ), f"z_sample_size must be a multiple of 8 or 1. Got {z_sample_size}."
+    def disable_z_tiling(self):
+        r"""
+        Disable tiling during VAE decoding. If `use_tiling` was previously invoked, this method will go back to computing
+        decoding in one step.
+        """
+        self.use_z_tiling = False
+    def enable_hw_tiling(self):
+        r"""
+        Enable tiling during VAE decoding along the height and width dimension.
+        """
+        self.use_hw_tiling = True
+    def disable_hw_tiling(self):
+        r"""
+        Disable tiling during VAE decoding along the height and width dimension.
+        """
+        self.use_hw_tiling = False
+    def _hw_tiled_encode(self, x: torch.FloatTensor, return_dict: bool = True):
+        overlap_size = int(self.tile_sample_min_size * (1 - self.tile_overlap_factor))
+        blend_extent = int(self.tile_latent_min_size * self.tile_overlap_factor)
+        row_limit = self.tile_latent_min_size - blend_extent
+        # Split the image into 512x512 tiles and encode them separately.
+        rows = []
+        for i in range(0, x.shape[3], overlap_size):
+            row = []
+            for j in range(0, x.shape[4], overlap_size):
+                tile = x[
+                    :,
+                    :,
+                    :,
+                    i : i + self.tile_sample_min_size,
+                    j : j + self.tile_sample_min_size,
+                ]
+                tile = self.encoder(tile)
+                tile = self.quant_conv(tile)
+                row.append(tile)
+            rows.append(row)
+        result_rows = []
+        for i, row in enumerate(rows):
+            result_row = []
+            for j, tile in enumerate(row):
+                # blend the above tile and the left tile
+                # to the current tile and add the current tile to the result row
+                if i > 0:
+                    tile = self.blend_v(rows[i - 1][j], tile, blend_extent)
+                if j > 0:
+                    tile = self.blend_h(row[j - 1], tile, blend_extent)
+                result_row.append(tile[:, :, :, :row_limit, :row_limit])
+            result_rows.append(torch.cat(result_row, dim=4))
+        moments = torch.cat(result_rows, dim=3)
+        return moments
+    def blend_z(
+        self, a: torch.Tensor, b: torch.Tensor, blend_extent: int
+    ) -> torch.Tensor:
+        blend_extent = min(a.shape[2], b.shape[2], blend_extent)
+        for z in range(blend_extent):
+            b[:, :, z, :, :] = a[:, :, -blend_extent + z, :, :] * (
+                1 - z / blend_extent
+            ) + b[:, :, z, :, :] * (z / blend_extent)
+        return b
+    def blend_v(
+        self, a: torch.Tensor, b: torch.Tensor, blend_extent: int
+    ) -> torch.Tensor:
+        blend_extent = min(a.shape[3], b.shape[3], blend_extent)
+        for y in range(blend_extent):
+            b[:, :, :, y, :] = a[:, :, :, -blend_extent + y, :] * (
+                1 - y / blend_extent
+            ) + b[:, :, :, y, :] * (y / blend_extent)
+        return b
+    def blend_h(
+        self, a: torch.Tensor, b: torch.Tensor, blend_extent: int
+    ) -> torch.Tensor:
+        blend_extent = min(a.shape[4], b.shape[4], blend_extent)
+        for x in range(blend_extent):
+            b[:, :, :, :, x] = a[:, :, :, :, -blend_extent + x] * (
+                1 - x / blend_extent
+            ) + b[:, :, :, :, x] * (x / blend_extent)
+        return b
+    def _hw_tiled_decode(self, z: torch.FloatTensor, target_shape):
+        overlap_size = int(self.tile_latent_min_size * (1 - self.tile_overlap_factor))
+        blend_extent = int(self.tile_sample_min_size * self.tile_overlap_factor)
+        row_limit = self.tile_sample_min_size - blend_extent
+        tile_target_shape = (
+            *target_shape[:3],
+            self.tile_sample_min_size,
+            self.tile_sample_min_size,
+        )
+        # Split z into overlapping 64x64 tiles and decode them separately.
+        # The tiles have an overlap to avoid seams between tiles.
+        rows = []
+        for i in range(0, z.shape[3], overlap_size):
+            row = []
+            for j in range(0, z.shape[4], overlap_size):
+                tile = z[
+                    :,
+                    :,
+                    :,
+                    i : i + self.tile_latent_min_size,
+                    j : j + self.tile_latent_min_size,
+                ]
+                tile = self.post_quant_conv(tile)
+                decoded = self.decoder(tile, target_shape=tile_target_shape)
+                row.append(decoded)
+            rows.append(row)
+        result_rows = []
+        for i, row in enumerate(rows):
+            result_row = []
+            for j, tile in enumerate(row):
+                # blend the above tile and the left tile
+                # to the current tile and add the current tile to the result row
+                if i > 0:
+                    tile = self.blend_v(rows[i - 1][j], tile, blend_extent)
+                if j > 0:
+                    tile = self.blend_h(row[j - 1], tile, blend_extent)
+                result_row.append(tile[:, :, :, :row_limit, :row_limit])
+            result_rows.append(torch.cat(result_row, dim=4))
+        dec = torch.cat(result_rows, dim=3)
+        return dec
+    def encode(
+        self, z: torch.FloatTensor, return_dict: bool = True
+    ) -> Union[DecoderOutput, torch.FloatTensor]:
+        if self.use_z_tiling and z.shape[2] > self.z_sample_size > 1:
+            num_splits = z.shape[2] // self.z_sample_size
+            sizes = [self.z_sample_size] * num_splits
+            sizes = (
+                sizes + [z.shape[2] - sum(sizes)]
+                if z.shape[2] - sum(sizes) > 0
+                else sizes
+            )
+            tiles = z.split(sizes, dim=2)
+            moments_tiles = [
+                (
+                    self._hw_tiled_encode(z_tile, return_dict)
+                    if self.use_hw_tiling
+                    else self._encode(z_tile)
+                )
+                for z_tile in tiles
+            ]
+            moments = torch.cat(moments_tiles, dim=2)
+        else:
+            moments = (
+                self._hw_tiled_encode(z, return_dict)
+                if self.use_hw_tiling
+                else self._encode(z)
+            )
+        posterior = DiagonalGaussianDistribution(moments)
+        if not return_dict:
+            return (posterior,)
+        return AutoencoderKLOutput(latent_dist=posterior)
+    def _normalize_latent_channels(self, z: torch.FloatTensor) -> torch.FloatTensor:
+        if isinstance(self.latent_norm_out, nn.BatchNorm3d):
+            _, c, _, _, _ = z.shape
+            z = torch.cat(
+                [
+                    self.latent_norm_out(z[:, : c // 2, :, :, :]),
+                    z[:, c // 2 :, :, :, :],
+                ],
+                dim=1,
+            )
+        elif isinstance(self.latent_norm_out, nn.BatchNorm2d):
+            raise NotImplementedError("BatchNorm2d not supported")
+        return z
+    def _unnormalize_latent_channels(self, z: torch.FloatTensor) -> torch.FloatTensor:
+        if isinstance(self.latent_norm_out, nn.BatchNorm3d):
+            running_mean = self.latent_norm_out.running_mean.view(1, -1, 1, 1, 1)
+            running_var = self.latent_norm_out.running_var.view(1, -1, 1, 1, 1)
+            eps = self.latent_norm_out.eps
+            z = z * torch.sqrt(running_var + eps) + running_mean
+        elif isinstance(self.latent_norm_out, nn.BatchNorm3d):
+            raise NotImplementedError("BatchNorm2d not supported")
+        return z
+    def _encode(self, x: torch.FloatTensor) -> AutoencoderKLOutput:
+        h = self.encoder(x)
+        moments = self.quant_conv(h)
+        moments = self._normalize_latent_channels(moments)
+        return moments
+    def _decode(
+        self,
+        z: torch.FloatTensor,
+        target_shape=None,
+        timestep: Optional[torch.Tensor] = None,
+    ) -> Union[DecoderOutput, torch.FloatTensor]:
+        z = self._unnormalize_latent_channels(z)
+        z = self.post_quant_conv(z)
+        if "timestep" in self.decoder_params:
+            dec = self.decoder(z, target_shape=target_shape, timestep=timestep)
+        else:
+            dec = self.decoder(z, target_shape=target_shape)
+        return dec
+    def decode(
+        self,
+        z: torch.FloatTensor,
+        return_dict: bool = True,
+        target_shape=None,
+        timestep: Optional[torch.Tensor] = None,
+    ) -> Union[DecoderOutput, torch.FloatTensor]:
+        assert target_shape is not None, "target_shape must be provided for decoding"
+        if self.use_z_tiling and z.shape[2] > self.z_sample_size > 1:
+            reduction_factor = int(
+                self.encoder.patch_size_t
+                * 2
+                ** (
+                    len(self.encoder.down_blocks)
+                    - 1
+                    - math.sqrt(self.encoder.patch_size)
+                )
+            )
+            split_size = self.z_sample_size // reduction_factor
+            num_splits = z.shape[2] // split_size
+            # copy target shape, and divide frame dimension (=2) by the context size
+            target_shape_split = list(target_shape)
+            target_shape_split[2] = target_shape[2] // num_splits
+            decoded_tiles = [
+                (
+                    self._hw_tiled_decode(z_tile, target_shape_split)
+                    if self.use_hw_tiling
+                    else self._decode(z_tile, target_shape=target_shape_split)
+                )
+                for z_tile in torch.tensor_split(z, num_splits, dim=2)
+            ]
+            decoded = torch.cat(decoded_tiles, dim=2)
+        else:
+            decoded = (
+                self._hw_tiled_decode(z, target_shape)
+                if self.use_hw_tiling
+                else self._decode(z, target_shape=target_shape, timestep=timestep)
+            )
+        if not return_dict:
+            return (decoded,)
+        return DecoderOutput(sample=decoded)
+    def forward(
+        self,
+        sample: torch.FloatTensor,
+        sample_posterior: bool = False,
+        return_dict: bool = True,
+        generator: Optional[torch.Generator] = None,
+    ) -> Union[DecoderOutput, torch.FloatTensor]:
+        r"""
+        Args:
+            sample (`torch.FloatTensor`): Input sample.
+            sample_posterior (`bool`, *optional*, defaults to `False`):
+                Whether to sample from the posterior.
+            return_dict (`bool`, *optional*, defaults to `True`):
+                Whether to return a [`DecoderOutput`] instead of a plain tuple.
+            generator (`torch.Generator`, *optional*):
+                Generator used to sample from the posterior.
+        """
+        x = sample
+        posterior = self.encode(x).latent_dist
+        if sample_posterior:
+            z = posterior.sample(generator=generator)
+        else:
+            z = posterior.mode()
+        dec = self.decode(z, target_shape=sample.shape).sample
+        if not return_dict:
+            return (dec,)
+        return DecoderOutput(sample=dec)

ltx_video/models/autoencoders/vae_encode.py ADDED Viewed

	@@ -0,0 +1,247 @@

+from typing import Tuple
+import torch
+from diffusers import AutoencoderKL
+from einops import rearrange
+from torch import Tensor
+from ltx_video.models.autoencoders.causal_video_autoencoder import (
+    CausalVideoAutoencoder,
+)
+from ltx_video.models.autoencoders.video_autoencoder import (
+    Downsample3D,
+    VideoAutoencoder,
+)
+try:
+    import torch_xla.core.xla_model as xm
+except ImportError:
+    xm = None
+def vae_encode(
+    media_items: Tensor,
+    vae: AutoencoderKL,
+    split_size: int = 1,
+    vae_per_channel_normalize=False,
+) -> Tensor:
+    """
+    Encodes media items (images or videos) into latent representations using a specified VAE model.
+    The function supports processing batches of images or video frames and can handle the processing
+    in smaller sub-batches if needed.
+    Args:
+        media_items (Tensor): A torch Tensor containing the media items to encode. The expected
+            shape is (batch_size, channels, height, width) for images or (batch_size, channels,
+            frames, height, width) for videos.
+        vae (AutoencoderKL): An instance of the `AutoencoderKL` class from the `diffusers` library,
+            pre-configured and loaded with the appropriate model weights.
+        split_size (int, optional): The number of sub-batches to split the input batch into for encoding.
+            If set to more than 1, the input media items are processed in smaller batches according to
+            this value. Defaults to 1, which processes all items in a single batch.
+    Returns:
+        Tensor: A torch Tensor of the encoded latent representations. The shape of the tensor is adjusted
+            to match the input shape, scaled by the model's configuration.
+    Examples:
+        >>> import torch
+        >>> from diffusers import AutoencoderKL
+        >>> vae = AutoencoderKL.from_pretrained('your-model-name')
+        >>> images = torch.rand(10, 3, 8 256, 256)  # Example tensor with 10 videos of 8 frames.
+        >>> latents = vae_encode(images, vae)
+        >>> print(latents.shape)  # Output shape will depend on the model's latent configuration.
+    Note:
+        In case of a video, the function encodes the media item frame-by frame.
+    """
+    is_video_shaped = media_items.dim() == 5
+    batch_size, channels = media_items.shape[0:2]
+    if channels != 3:
+        raise ValueError(f"Expects tensors with 3 channels, got {channels}.")
+    if is_video_shaped and not isinstance(
+        vae, (VideoAutoencoder, CausalVideoAutoencoder)
+    ):
+        media_items = rearrange(media_items, "b c n h w -> (b n) c h w")
+    if split_size > 1:
+        if len(media_items) % split_size != 0:
+            raise ValueError(
+                "Error: The batch size must be divisible by 'train.vae_bs_split"
+            )
+        encode_bs = len(media_items) // split_size
+        # latents = [vae.encode(image_batch).latent_dist.sample() for image_batch in media_items.split(encode_bs)]
+        latents = []
+        if media_items.device.type == "xla":
+            xm.mark_step()
+        for image_batch in media_items.split(encode_bs):
+            latents.append(vae.encode(image_batch).latent_dist.sample())
+            if media_items.device.type == "xla":
+                xm.mark_step()
+        latents = torch.cat(latents, dim=0)
+    else:
+        latents = vae.encode(media_items).latent_dist.sample()
+    latents = normalize_latents(latents, vae, vae_per_channel_normalize)
+    if is_video_shaped and not isinstance(
+        vae, (VideoAutoencoder, CausalVideoAutoencoder)
+    ):
+        latents = rearrange(latents, "(b n) c h w -> b c n h w", b=batch_size)
+    return latents
+def vae_decode(
+    latents: Tensor,
+    vae: AutoencoderKL,
+    is_video: bool = True,
+    split_size: int = 1,
+    vae_per_channel_normalize=False,
+    timestep=None,
+) -> Tensor:
+    is_video_shaped = latents.dim() == 5
+    batch_size = latents.shape[0]
+    if is_video_shaped and not isinstance(
+        vae, (VideoAutoencoder, CausalVideoAutoencoder)
+    ):
+        latents = rearrange(latents, "b c n h w -> (b n) c h w")
+    if split_size > 1:
+        if len(latents) % split_size != 0:
+            raise ValueError(
+                "Error: The batch size must be divisible by 'train.vae_bs_split"
+            )
+        encode_bs = len(latents) // split_size
+        image_batch = [
+            _run_decoder(
+                latent_batch, vae, is_video, vae_per_channel_normalize, timestep
+            )
+            for latent_batch in latents.split(encode_bs)
+        ]
+        images = torch.cat(image_batch, dim=0)
+    else:
+        images = _run_decoder(
+            latents, vae, is_video, vae_per_channel_normalize, timestep
+        )
+    if is_video_shaped and not isinstance(
+        vae, (VideoAutoencoder, CausalVideoAutoencoder)
+    ):
+        images = rearrange(images, "(b n) c h w -> b c n h w", b=batch_size)
+    return images
+def _run_decoder(
+    latents: Tensor,
+    vae: AutoencoderKL,
+    is_video: bool,
+    vae_per_channel_normalize=False,
+    timestep=None,
+) -> Tensor:
+    if isinstance(vae, (VideoAutoencoder, CausalVideoAutoencoder)):
+        *_, fl, hl, wl = latents.shape
+        temporal_scale, spatial_scale, _ = get_vae_size_scale_factor(vae)
+        latents = latents.to(vae.dtype)
+        vae_decode_kwargs = {}
+        if timestep is not None:
+            vae_decode_kwargs["timestep"] = timestep
+        image = vae.decode(
+            un_normalize_latents(latents, vae, vae_per_channel_normalize),
+            return_dict=False,
+            target_shape=(
+                1,
+                3,
+                fl * temporal_scale if is_video else 1,
+                hl * spatial_scale,
+                wl * spatial_scale,
+            ),
+            **vae_decode_kwargs,
+        )[0]
+    else:
+        image = vae.decode(
+            un_normalize_latents(latents, vae, vae_per_channel_normalize),
+            return_dict=False,
+        )[0]
+    return image
+def get_vae_size_scale_factor(vae: AutoencoderKL) -> float:
+    if isinstance(vae, CausalVideoAutoencoder):
+        spatial = vae.spatial_downscale_factor
+        temporal = vae.temporal_downscale_factor
+    else:
+        down_blocks = len(
+            [
+                block
+                for block in vae.encoder.down_blocks
+                if isinstance(block.downsample, Downsample3D)
+            ]
+        )
+        spatial = vae.config.patch_size * 2**down_blocks
+        temporal = (
+            vae.config.patch_size_t * 2**down_blocks
+            if isinstance(vae, VideoAutoencoder)
+            else 1
+        )
+    return (temporal, spatial, spatial)
+def latent_to_pixel_coords(
+    latent_coords: Tensor, vae: AutoencoderKL, causal_fix: bool = False
+) -> Tensor:
+    """
+    Converts latent coordinates to pixel coordinates by scaling them according to the VAE's
+    configuration.
+    Args:
+        latent_coords (Tensor): A tensor of shape [batch_size, 3, num_latents]
+        containing the latent corner coordinates of each token.
+        vae (AutoencoderKL): The VAE model
+        causal_fix (bool): Whether to take into account the different temporal scale
+            of the first frame. Default = False for backwards compatibility.
+    Returns:
+        Tensor: A tensor of pixel coordinates corresponding to the input latent coordinates.
+    """
+    scale_factors = get_vae_size_scale_factor(vae)
+    causal_fix = isinstance(vae, CausalVideoAutoencoder) and causal_fix
+    pixel_coords = latent_to_pixel_coords_from_factors(
+        latent_coords, scale_factors, causal_fix
+    )
+    return pixel_coords
+def latent_to_pixel_coords_from_factors(
+    latent_coords: Tensor, scale_factors: Tuple, causal_fix: bool = False
+) -> Tensor:
+    pixel_coords = (
+        latent_coords
+        * torch.tensor(scale_factors, device=latent_coords.device)[None, :, None]
+    )
+    if causal_fix:
+        # Fix temporal scale for first frame to 1 due to causality
+        pixel_coords[:, 0] = (pixel_coords[:, 0] + 1 - scale_factors[0]).clamp(min=0)
+    return pixel_coords
+def normalize_latents(
+    latents: Tensor, vae: AutoencoderKL, vae_per_channel_normalize: bool = False
+) -> Tensor:
+    return (
+        (latents - vae.mean_of_means.to(latents.dtype).view(1, -1, 1, 1, 1))
+        / vae.std_of_means.to(latents.dtype).view(1, -1, 1, 1, 1)
+        if vae_per_channel_normalize
+        else latents * vae.config.scaling_factor
+    )
+def un_normalize_latents(
+    latents: Tensor, vae: AutoencoderKL, vae_per_channel_normalize: bool = False
+) -> Tensor:
+    return (
+        latents * vae.std_of_means.to(latents.dtype).view(1, -1, 1, 1, 1)
+        + vae.mean_of_means.to(latents.dtype).view(1, -1, 1, 1, 1)
+        if vae_per_channel_normalize
+        else latents / vae.config.scaling_factor
+    )

ltx_video/models/autoencoders/video_autoencoder.py ADDED Viewed

	@@ -0,0 +1,1045 @@

+import json
+import os
+from functools import partial
+from types import SimpleNamespace
+from typing import Any, Mapping, Optional, Tuple, Union
+import torch
+from einops import rearrange
+from torch import nn
+from torch.nn import functional
+from diffusers.utils import logging
+from ltx_video.utils.torch_utils import Identity
+from ltx_video.models.autoencoders.conv_nd_factory import make_conv_nd, make_linear_nd
+from ltx_video.models.autoencoders.pixel_norm import PixelNorm
+from ltx_video.models.autoencoders.vae import AutoencoderKLWrapper
+logger = logging.get_logger(__name__)
+class VideoAutoencoder(AutoencoderKLWrapper):
+    @classmethod
+    def from_pretrained(
+        cls,
+        pretrained_model_name_or_path: Optional[Union[str, os.PathLike]],
+        *args,
+        **kwargs,
+    ):
+        config_local_path = pretrained_model_name_or_path / "config.json"
+        config = cls.load_config(config_local_path, **kwargs)
+        video_vae = cls.from_config(config)
+        video_vae.to(kwargs["torch_dtype"])
+        model_local_path = pretrained_model_name_or_path / "autoencoder.pth"
+        ckpt_state_dict = torch.load(model_local_path)
+        video_vae.load_state_dict(ckpt_state_dict)
+        statistics_local_path = (
+            pretrained_model_name_or_path / "per_channel_statistics.json"
+        )
+        if statistics_local_path.exists():
+            with open(statistics_local_path, "r") as file:
+                data = json.load(file)
+            transposed_data = list(zip(*data["data"]))
+            data_dict = {
+                col: torch.tensor(vals)
+                for col, vals in zip(data["columns"], transposed_data)
+            }
+            video_vae.register_buffer("std_of_means", data_dict["std-of-means"])
+            video_vae.register_buffer(
+                "mean_of_means",
+                data_dict.get(
+                    "mean-of-means", torch.zeros_like(data_dict["std-of-means"])
+                ),
+            )
+        return video_vae
+    @staticmethod
+    def from_config(config):
+        assert (
+            config["_class_name"] == "VideoAutoencoder"
+        ), "config must have _class_name=VideoAutoencoder"
+        if isinstance(config["dims"], list):
+            config["dims"] = tuple(config["dims"])
+        assert config["dims"] in [2, 3, (2, 1)], "dims must be 2, 3 or (2, 1)"
+        double_z = config.get("double_z", True)
+        latent_log_var = config.get(
+            "latent_log_var", "per_channel" if double_z else "none"
+        )
+        use_quant_conv = config.get("use_quant_conv", True)
+        if use_quant_conv and latent_log_var == "uniform":
+            raise ValueError("uniform latent_log_var requires use_quant_conv=False")
+        encoder = Encoder(
+            dims=config["dims"],
+            in_channels=config.get("in_channels", 3),
+            out_channels=config["latent_channels"],
+            block_out_channels=config["block_out_channels"],
+            patch_size=config.get("patch_size", 1),
+            latent_log_var=latent_log_var,
+            norm_layer=config.get("norm_layer", "group_norm"),
+            patch_size_t=config.get("patch_size_t", config.get("patch_size", 1)),
+            add_channel_padding=config.get("add_channel_padding", False),
+        )
+        decoder = Decoder(
+            dims=config["dims"],
+            in_channels=config["latent_channels"],
+            out_channels=config.get("out_channels", 3),
+            block_out_channels=config["block_out_channels"],
+            patch_size=config.get("patch_size", 1),
+            norm_layer=config.get("norm_layer", "group_norm"),
+            patch_size_t=config.get("patch_size_t", config.get("patch_size", 1)),
+            add_channel_padding=config.get("add_channel_padding", False),
+        )
+        dims = config["dims"]
+        return VideoAutoencoder(
+            encoder=encoder,
+            decoder=decoder,
+            latent_channels=config["latent_channels"],
+            dims=dims,
+            use_quant_conv=use_quant_conv,
+        )
+    @property
+    def config(self):
+        return SimpleNamespace(
+            _class_name="VideoAutoencoder",
+            dims=self.dims,
+            in_channels=self.encoder.conv_in.in_channels
+            // (self.encoder.patch_size_t * self.encoder.patch_size**2),
+            out_channels=self.decoder.conv_out.out_channels
+            // (self.decoder.patch_size_t * self.decoder.patch_size**2),
+            latent_channels=self.decoder.conv_in.in_channels,
+            block_out_channels=[
+                self.encoder.down_blocks[i].res_blocks[-1].conv1.out_channels
+                for i in range(len(self.encoder.down_blocks))
+            ],
+            scaling_factor=1.0,
+            norm_layer=self.encoder.norm_layer,
+            patch_size=self.encoder.patch_size,
+            latent_log_var=self.encoder.latent_log_var,
+            use_quant_conv=self.use_quant_conv,
+            patch_size_t=self.encoder.patch_size_t,
+            add_channel_padding=self.encoder.add_channel_padding,
+        )
+    @property
+    def is_video_supported(self):
+        """
+        Check if the model supports video inputs of shape (B, C, F, H, W). Otherwise, the model only supports 2D images.
+        """
+        return self.dims != 2
+    @property
+    def downscale_factor(self):
+        return self.encoder.downsample_factor
+    def to_json_string(self) -> str:
+        import json
+        return json.dumps(self.config.__dict__)
+    def load_state_dict(self, state_dict: Mapping[str, Any], strict: bool = True):
+        model_keys = set(name for name, _ in self.named_parameters())
+        key_mapping = {
+            ".resnets.": ".res_blocks.",
+            "downsamplers.0": "downsample",
+            "upsamplers.0": "upsample",
+        }
+        converted_state_dict = {}
+        for key, value in state_dict.items():
+            for k, v in key_mapping.items():
+                key = key.replace(k, v)
+            if "norm" in key and key not in model_keys:
+                logger.info(
+                    f"Removing key {key} from state_dict as it is not present in the model"
+                )
+                continue
+            converted_state_dict[key] = value
+        super().load_state_dict(converted_state_dict, strict=strict)
+    def last_layer(self):
+        if hasattr(self.decoder, "conv_out"):
+            if isinstance(self.decoder.conv_out, nn.Sequential):
+                last_layer = self.decoder.conv_out[-1]
+            else:
+                last_layer = self.decoder.conv_out
+        else:
+            last_layer = self.decoder.layers[-1]
+        return last_layer
+class Encoder(nn.Module):
+    r"""
+    The `Encoder` layer of a variational autoencoder that encodes its input into a latent representation.
+    Args:
+        in_channels (`int`, *optional*, defaults to 3):
+            The number of input channels.
+        out_channels (`int`, *optional*, defaults to 3):
+            The number of output channels.
+        block_out_channels (`Tuple[int, ...]`, *optional*, defaults to `(64,)`):
+            The number of output channels for each block.
+        layers_per_block (`int`, *optional*, defaults to 2):
+            The number of layers per block.
+        norm_num_groups (`int`, *optional*, defaults to 32):
+            The number of groups for normalization.
+        patch_size (`int`, *optional*, defaults to 1):
+            The patch size to use. Should be a power of 2.
+        norm_layer (`str`, *optional*, defaults to `group_norm`):
+            The normalization layer to use. Can be either `group_norm` or `pixel_norm`.
+        latent_log_var (`str`, *optional*, defaults to `per_channel`):
+            The number of channels for the log variance. Can be either `per_channel`, `uniform`, or `none`.
+    """
+    def __init__(
+        self,
+        dims: Union[int, Tuple[int, int]] = 3,
+        in_channels: int = 3,
+        out_channels: int = 3,
+        block_out_channels: Tuple[int, ...] = (64,),
+        layers_per_block: int = 2,
+        norm_num_groups: int = 32,
+        patch_size: Union[int, Tuple[int]] = 1,
+        norm_layer: str = "group_norm",  # group_norm, pixel_norm
+        latent_log_var: str = "per_channel",
+        patch_size_t: Optional[int] = None,
+        add_channel_padding: Optional[bool] = False,
+    ):
+        super().__init__()
+        self.patch_size = patch_size
+        self.patch_size_t = patch_size_t if patch_size_t is not None else patch_size
+        self.add_channel_padding = add_channel_padding
+        self.layers_per_block = layers_per_block
+        self.norm_layer = norm_layer
+        self.latent_channels = out_channels
+        self.latent_log_var = latent_log_var
+        if add_channel_padding:
+            in_channels = in_channels * self.patch_size**3
+        else:
+            in_channels = in_channels * self.patch_size_t * self.patch_size**2
+        self.in_channels = in_channels
+        output_channel = block_out_channels[0]
+        self.conv_in = make_conv_nd(
+            dims=dims,
+            in_channels=in_channels,
+            out_channels=output_channel,
+            kernel_size=3,
+            stride=1,
+            padding=1,
+        )
+        self.down_blocks = nn.ModuleList([])
+        for i in range(len(block_out_channels)):
+            input_channel = output_channel
+            output_channel = block_out_channels[i]
+            is_final_block = i == len(block_out_channels) - 1
+            down_block = DownEncoderBlock3D(
+                dims=dims,
+                in_channels=input_channel,
+                out_channels=output_channel,
+                num_layers=self.layers_per_block,
+                add_downsample=not is_final_block and 2**i >= patch_size,
+                resnet_eps=1e-6,
+                downsample_padding=0,
+                resnet_groups=norm_num_groups,
+                norm_layer=norm_layer,
+            )
+            self.down_blocks.append(down_block)
+        self.mid_block = UNetMidBlock3D(
+            dims=dims,
+            in_channels=block_out_channels[-1],
+            num_layers=self.layers_per_block,
+            resnet_eps=1e-6,
+            resnet_groups=norm_num_groups,
+            norm_layer=norm_layer,
+        )
+        # out
+        if norm_layer == "group_norm":
+            self.conv_norm_out = nn.GroupNorm(
+                num_channels=block_out_channels[-1],
+                num_groups=norm_num_groups,
+                eps=1e-6,
+            )
+        elif norm_layer == "pixel_norm":
+            self.conv_norm_out = PixelNorm()
+        self.conv_act = nn.SiLU()
+        conv_out_channels = out_channels
+        if latent_log_var == "per_channel":
+            conv_out_channels *= 2
+        elif latent_log_var == "uniform":
+            conv_out_channels += 1
+        elif latent_log_var != "none":
+            raise ValueError(f"Invalid latent_log_var: {latent_log_var}")
+        self.conv_out = make_conv_nd(
+            dims, block_out_channels[-1], conv_out_channels, 3, padding=1
+        )
+        self.gradient_checkpointing = False
+    @property
+    def downscale_factor(self):
+        return (
+            2
+            ** len(
+                [
+                    block
+                    for block in self.down_blocks
+                    if isinstance(block.downsample, Downsample3D)
+                ]
+            )
+            * self.patch_size
+        )
+    def forward(
+        self, sample: torch.FloatTensor, return_features=False
+    ) -> torch.FloatTensor:
+        r"""The forward method of the `Encoder` class."""
+        downsample_in_time = sample.shape[2] != 1
+        # patchify
+        patch_size_t = self.patch_size_t if downsample_in_time else 1
+        sample = patchify(
+            sample,
+            patch_size_hw=self.patch_size,
+            patch_size_t=patch_size_t,
+            add_channel_padding=self.add_channel_padding,
+        )
+        sample = self.conv_in(sample)
+        checkpoint_fn = (
+            partial(torch.utils.checkpoint.checkpoint, use_reentrant=False)
+            if self.gradient_checkpointing and self.training
+            else lambda x: x
+        )
+        if return_features:
+            features = []
+        for down_block in self.down_blocks:
+            sample = checkpoint_fn(down_block)(
+                sample, downsample_in_time=downsample_in_time
+            )
+            if return_features:
+                features.append(sample)
+        sample = checkpoint_fn(self.mid_block)(sample)
+        # post-process
+        sample = self.conv_norm_out(sample)
+        sample = self.conv_act(sample)
+        sample = self.conv_out(sample)
+        if self.latent_log_var == "uniform":
+            last_channel = sample[:, -1:, ...]
+            num_dims = sample.dim()
+            if num_dims == 4:
+                # For shape (B, C, H, W)
+                repeated_last_channel = last_channel.repeat(
+                    1, sample.shape[1] - 2, 1, 1
+                )
+                sample = torch.cat([sample, repeated_last_channel], dim=1)
+            elif num_dims == 5:
+                # For shape (B, C, F, H, W)
+                repeated_last_channel = last_channel.repeat(
+                    1, sample.shape[1] - 2, 1, 1, 1
+                )
+                sample = torch.cat([sample, repeated_last_channel], dim=1)
+            else:
+                raise ValueError(f"Invalid input shape: {sample.shape}")
+        if return_features:
+            features.append(sample[:, : self.latent_channels, ...])
+            return sample, features
+        return sample
+class Decoder(nn.Module):
+    r"""
+    The `Decoder` layer of a variational autoencoder that decodes its latent representation into an output sample.
+    Args:
+        in_channels (`int`, *optional*, defaults to 3):
+            The number of input channels.
+        out_channels (`int`, *optional*, defaults to 3):
+            The number of output channels.
+        block_out_channels (`Tuple[int, ...]`, *optional*, defaults to `(64,)`):
+            The number of output channels for each block.
+        layers_per_block (`int`, *optional*, defaults to 2):
+            The number of layers per block.
+        norm_num_groups (`int`, *optional*, defaults to 32):
+            The number of groups for normalization.
+        patch_size (`int`, *optional*, defaults to 1):
+            The patch size to use. Should be a power of 2.
+        norm_layer (`str`, *optional*, defaults to `group_norm`):
+            The normalization layer to use. Can be either `group_norm` or `pixel_norm`.
+    """
+    def __init__(
+        self,
+        dims,
+        in_channels: int = 3,
+        out_channels: int = 3,
+        block_out_channels: Tuple[int, ...] = (64,),
+        layers_per_block: int = 2,
+        norm_num_groups: int = 32,
+        patch_size: int = 1,
+        norm_layer: str = "group_norm",
+        patch_size_t: Optional[int] = None,
+        add_channel_padding: Optional[bool] = False,
+    ):
+        super().__init__()
+        self.patch_size = patch_size
+        self.patch_size_t = patch_size_t if patch_size_t is not None else patch_size
+        self.add_channel_padding = add_channel_padding
+        self.layers_per_block = layers_per_block
+        if add_channel_padding:
+            out_channels = out_channels * self.patch_size**3
+        else:
+            out_channels = out_channels * self.patch_size_t * self.patch_size**2
+        self.out_channels = out_channels
+        self.conv_in = make_conv_nd(
+            dims,
+            in_channels,
+            block_out_channels[-1],
+            kernel_size=3,
+            stride=1,
+            padding=1,
+        )
+        self.mid_block = None
+        self.up_blocks = nn.ModuleList([])
+        self.mid_block = UNetMidBlock3D(
+            dims=dims,
+            in_channels=block_out_channels[-1],
+            num_layers=self.layers_per_block,
+            resnet_eps=1e-6,
+            resnet_groups=norm_num_groups,
+            norm_layer=norm_layer,
+        )
+        reversed_block_out_channels = list(reversed(block_out_channels))
+        output_channel = reversed_block_out_channels[0]
+        for i in range(len(reversed_block_out_channels)):
+            prev_output_channel = output_channel
+            output_channel = reversed_block_out_channels[i]
+            is_final_block = i == len(block_out_channels) - 1
+            up_block = UpDecoderBlock3D(
+                dims=dims,
+                num_layers=self.layers_per_block + 1,
+                in_channels=prev_output_channel,
+                out_channels=output_channel,
+                add_upsample=not is_final_block
+                and 2 ** (len(block_out_channels) - i - 1) > patch_size,
+                resnet_eps=1e-6,
+                resnet_groups=norm_num_groups,
+                norm_layer=norm_layer,
+            )
+            self.up_blocks.append(up_block)
+        if norm_layer == "group_norm":
+            self.conv_norm_out = nn.GroupNorm(
+                num_channels=block_out_channels[0], num_groups=norm_num_groups, eps=1e-6
+            )
+        elif norm_layer == "pixel_norm":
+            self.conv_norm_out = PixelNorm()
+        self.conv_act = nn.SiLU()
+        self.conv_out = make_conv_nd(
+            dims, block_out_channels[0], out_channels, 3, padding=1
+        )
+        self.gradient_checkpointing = False
+    def forward(self, sample: torch.FloatTensor, target_shape) -> torch.FloatTensor:
+        r"""The forward method of the `Decoder` class."""
+        assert target_shape is not None, "target_shape must be provided"
+        upsample_in_time = sample.shape[2] < target_shape[2]
+        sample = self.conv_in(sample)
+        upscale_dtype = next(iter(self.up_blocks.parameters())).dtype
+        checkpoint_fn = (
+            partial(torch.utils.checkpoint.checkpoint, use_reentrant=False)
+            if self.gradient_checkpointing and self.training
+            else lambda x: x
+        )
+        sample = checkpoint_fn(self.mid_block)(sample)
+        sample = sample.to(upscale_dtype)
+        for up_block in self.up_blocks:
+            sample = checkpoint_fn(up_block)(sample, upsample_in_time=upsample_in_time)
+        # post-process
+        sample = self.conv_norm_out(sample)
+        sample = self.conv_act(sample)
+        sample = self.conv_out(sample)
+        # un-patchify
+        patch_size_t = self.patch_size_t if upsample_in_time else 1
+        sample = unpatchify(
+            sample,
+            patch_size_hw=self.patch_size,
+            patch_size_t=patch_size_t,
+            add_channel_padding=self.add_channel_padding,
+        )
+        return sample
+class DownEncoderBlock3D(nn.Module):
+    def __init__(
+        self,
+        dims: Union[int, Tuple[int, int]],
+        in_channels: int,
+        out_channels: int,
+        dropout: float = 0.0,
+        num_layers: int = 1,
+        resnet_eps: float = 1e-6,
+        resnet_groups: int = 32,
+        add_downsample: bool = True,
+        downsample_padding: int = 1,
+        norm_layer: str = "group_norm",
+    ):
+        super().__init__()
+        res_blocks = []
+        for i in range(num_layers):
+            in_channels = in_channels if i == 0 else out_channels
+            res_blocks.append(
+                ResnetBlock3D(
+                    dims=dims,
+                    in_channels=in_channels,
+                    out_channels=out_channels,
+                    eps=resnet_eps,
+                    groups=resnet_groups,
+                    dropout=dropout,
+                    norm_layer=norm_layer,
+                )
+            )
+        self.res_blocks = nn.ModuleList(res_blocks)
+        if add_downsample:
+            self.downsample = Downsample3D(
+                dims,
+                out_channels,
+                out_channels=out_channels,
+                padding=downsample_padding,
+            )
+        else:
+            self.downsample = Identity()
+    def forward(
+        self, hidden_states: torch.FloatTensor, downsample_in_time
+    ) -> torch.FloatTensor:
+        for resnet in self.res_blocks:
+            hidden_states = resnet(hidden_states)
+        hidden_states = self.downsample(
+            hidden_states, downsample_in_time=downsample_in_time
+        )
+        return hidden_states
+class UNetMidBlock3D(nn.Module):
+    """
+    A 3D UNet mid-block [`UNetMidBlock3D`] with multiple residual blocks.
+    Args:
+        in_channels (`int`): The number of input channels.
+        dropout (`float`, *optional*, defaults to 0.0): The dropout rate.
+        num_layers (`int`, *optional*, defaults to 1): The number of residual blocks.
+        resnet_eps (`float`, *optional*, 1e-6 ): The epsilon value for the resnet blocks.
+        resnet_groups (`int`, *optional*, defaults to 32):
+            The number of groups to use in the group normalization layers of the resnet blocks.
+    Returns:
+        `torch.FloatTensor`: The output of the last residual block, which is a tensor of shape `(batch_size,
+        in_channels, height, width)`.
+    """
+    def __init__(
+        self,
+        dims: Union[int, Tuple[int, int]],
+        in_channels: int,
+        dropout: float = 0.0,
+        num_layers: int = 1,
+        resnet_eps: float = 1e-6,
+        resnet_groups: int = 32,
+        norm_layer: str = "group_norm",
+    ):
+        super().__init__()
+        resnet_groups = (
+            resnet_groups if resnet_groups is not None else min(in_channels // 4, 32)
+        )
+        self.res_blocks = nn.ModuleList(
+            [
+                ResnetBlock3D(
+                    dims=dims,
+                    in_channels=in_channels,
+                    out_channels=in_channels,
+                    eps=resnet_eps,
+                    groups=resnet_groups,
+                    dropout=dropout,
+                    norm_layer=norm_layer,
+                )
+                for _ in range(num_layers)
+            ]
+        )
+    def forward(self, hidden_states: torch.FloatTensor) -> torch.FloatTensor:
+        for resnet in self.res_blocks:
+            hidden_states = resnet(hidden_states)
+        return hidden_states
+class UpDecoderBlock3D(nn.Module):
+    def __init__(
+        self,
+        dims: Union[int, Tuple[int, int]],
+        in_channels: int,
+        out_channels: int,
+        resolution_idx: Optional[int] = None,
+        dropout: float = 0.0,
+        num_layers: int = 1,
+        resnet_eps: float = 1e-6,
+        resnet_groups: int = 32,
+        add_upsample: bool = True,
+        norm_layer: str = "group_norm",
+    ):
+        super().__init__()
+        res_blocks = []
+        for i in range(num_layers):
+            input_channels = in_channels if i == 0 else out_channels
+            res_blocks.append(
+                ResnetBlock3D(
+                    dims=dims,
+                    in_channels=input_channels,
+                    out_channels=out_channels,
+                    eps=resnet_eps,
+                    groups=resnet_groups,
+                    dropout=dropout,
+                    norm_layer=norm_layer,
+                )
+            )
+        self.res_blocks = nn.ModuleList(res_blocks)
+        if add_upsample:
+            self.upsample = Upsample3D(
+                dims=dims, channels=out_channels, out_channels=out_channels
+            )
+        else:
+            self.upsample = Identity()
+        self.resolution_idx = resolution_idx
+    def forward(
+        self, hidden_states: torch.FloatTensor, upsample_in_time=True
+    ) -> torch.FloatTensor:
+        for resnet in self.res_blocks:
+            hidden_states = resnet(hidden_states)
+        hidden_states = self.upsample(hidden_states, upsample_in_time=upsample_in_time)
+        return hidden_states
+class ResnetBlock3D(nn.Module):
+    r"""
+    A Resnet block.
+    Parameters:
+        in_channels (`int`): The number of channels in the input.
+        out_channels (`int`, *optional*, default to be `None`):
+            The number of output channels for the first conv layer. If None, same as `in_channels`.
+        dropout (`float`, *optional*, defaults to `0.0`): The dropout probability to use.
+        groups (`int`, *optional*, default to `32`): The number of groups to use for the first normalization layer.
+        eps (`float`, *optional*, defaults to `1e-6`): The epsilon to use for the normalization.
+    """
+    def __init__(
+        self,
+        dims: Union[int, Tuple[int, int]],
+        in_channels: int,
+        out_channels: Optional[int] = None,
+        conv_shortcut: bool = False,
+        dropout: float = 0.0,
+        groups: int = 32,
+        eps: float = 1e-6,
+        norm_layer: str = "group_norm",
+    ):
+        super().__init__()
+        self.in_channels = in_channels
+        out_channels = in_channels if out_channels is None else out_channels
+        self.out_channels = out_channels
+        self.use_conv_shortcut = conv_shortcut
+        if norm_layer == "group_norm":
+            self.norm1 = torch.nn.GroupNorm(
+                num_groups=groups, num_channels=in_channels, eps=eps, affine=True
+            )
+        elif norm_layer == "pixel_norm":
+            self.norm1 = PixelNorm()
+        self.non_linearity = nn.SiLU()
+        self.conv1 = make_conv_nd(
+            dims, in_channels, out_channels, kernel_size=3, stride=1, padding=1
+        )
+        if norm_layer == "group_norm":
+            self.norm2 = torch.nn.GroupNorm(
+                num_groups=groups, num_channels=out_channels, eps=eps, affine=True
+            )
+        elif norm_layer == "pixel_norm":
+            self.norm2 = PixelNorm()
+        self.dropout = torch.nn.Dropout(dropout)
+        self.conv2 = make_conv_nd(
+            dims, out_channels, out_channels, kernel_size=3, stride=1, padding=1
+        )
+        self.conv_shortcut = (
+            make_linear_nd(
+                dims=dims, in_channels=in_channels, out_channels=out_channels
+            )
+            if in_channels != out_channels
+            else nn.Identity()
+        )
+    def forward(
+        self,
+        input_tensor: torch.FloatTensor,
+    ) -> torch.FloatTensor:
+        hidden_states = input_tensor
+        hidden_states = self.norm1(hidden_states)
+        hidden_states = self.non_linearity(hidden_states)
+        hidden_states = self.conv1(hidden_states)
+        hidden_states = self.norm2(hidden_states)
+        hidden_states = self.non_linearity(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.conv2(hidden_states)
+        input_tensor = self.conv_shortcut(input_tensor)
+        output_tensor = input_tensor + hidden_states
+        return output_tensor
+class Downsample3D(nn.Module):
+    def __init__(
+        self,
+        dims,
+        in_channels: int,
+        out_channels: int,
+        kernel_size: int = 3,
+        padding: int = 1,
+    ):
+        super().__init__()
+        stride: int = 2
+        self.padding = padding
+        self.in_channels = in_channels
+        self.dims = dims
+        self.conv = make_conv_nd(
+            dims=dims,
+            in_channels=in_channels,
+            out_channels=out_channels,
+            kernel_size=kernel_size,
+            stride=stride,
+            padding=padding,
+        )
+    def forward(self, x, downsample_in_time=True):
+        conv = self.conv
+        if self.padding == 0:
+            if self.dims == 2:
+                padding = (0, 1, 0, 1)
+            else:
+                padding = (0, 1, 0, 1, 0, 1 if downsample_in_time else 0)
+            x = functional.pad(x, padding, mode="constant", value=0)
+            if self.dims == (2, 1) and not downsample_in_time:
+                return conv(x, skip_time_conv=True)
+        return conv(x)
+class Upsample3D(nn.Module):
+    """
+    An upsampling layer for 3D tensors of shape (B, C, D, H, W).
+    :param channels: channels in the inputs and outputs.
+    """
+    def __init__(self, dims, channels, out_channels=None):
+        super().__init__()
+        self.dims = dims
+        self.channels = channels
+        self.out_channels = out_channels or channels
+        self.conv = make_conv_nd(
+            dims, channels, out_channels, kernel_size=3, padding=1, bias=True
+        )
+    def forward(self, x, upsample_in_time):
+        if self.dims == 2:
+            x = functional.interpolate(
+                x, (x.shape[2] * 2, x.shape[3] * 2), mode="nearest"
+            )
+        else:
+            time_scale_factor = 2 if upsample_in_time else 1
+            # print("before:", x.shape)
+            b, c, d, h, w = x.shape
+            x = rearrange(x, "b c d h w -> (b d) c h w")
+            # height and width interpolate
+            x = functional.interpolate(
+                x, (x.shape[2] * 2, x.shape[3] * 2), mode="nearest"
+            )
+            _, _, h, w = x.shape
+            if not upsample_in_time and self.dims == (2, 1):
+                x = rearrange(x, "(b d) c h w -> b c d h w ", b=b, h=h, w=w)
+                return self.conv(x, skip_time_conv=True)
+            # Second ** upsampling ** which is essentially treated as a 1D convolution across the 'd' dimension
+            x = rearrange(x, "(b d) c h w -> (b h w) c 1 d", b=b)
+            # (b h w) c 1 d
+            new_d = x.shape[-1] * time_scale_factor
+            x = functional.interpolate(x, (1, new_d), mode="nearest")
+            # (b h w) c 1 new_d
+            x = rearrange(
+                x, "(b h w) c 1 new_d  -> b c new_d h w", b=b, h=h, w=w, new_d=new_d
+            )
+            # b c d h w
+            # x = functional.interpolate(
+            #     x, (x.shape[2] * time_scale_factor, x.shape[3] * 2, x.shape[4] * 2), mode="nearest"
+            # )
+            # print("after:", x.shape)
+        return self.conv(x)
+def patchify(x, patch_size_hw, patch_size_t=1, add_channel_padding=False):
+    if patch_size_hw == 1 and patch_size_t == 1:
+        return x
+    if x.dim() == 4:
+        x = rearrange(
+            x, "b c (h q) (w r) -> b (c r q) h w", q=patch_size_hw, r=patch_size_hw
+        )
+    elif x.dim() == 5:
+        x = rearrange(
+            x,
+            "b c (f p) (h q) (w r) -> b (c p r q) f h w",
+            p=patch_size_t,
+            q=patch_size_hw,
+            r=patch_size_hw,
+        )
+    else:
+        raise ValueError(f"Invalid input shape: {x.shape}")
+    if (
+        (x.dim() == 5)
+        and (patch_size_hw > patch_size_t)
+        and (patch_size_t > 1 or add_channel_padding)
+    ):
+        channels_to_pad = x.shape[1] * (patch_size_hw // patch_size_t) - x.shape[1]
+        padding_zeros = torch.zeros(
+            x.shape[0],
+            channels_to_pad,
+            x.shape[2],
+            x.shape[3],
+            x.shape[4],
+            device=x.device,
+            dtype=x.dtype,
+        )
+        x = torch.cat([padding_zeros, x], dim=1)
+    return x
+def unpatchify(x, patch_size_hw, patch_size_t=1, add_channel_padding=False):
+    if patch_size_hw == 1 and patch_size_t == 1:
+        return x
+    if (
+        (x.dim() == 5)
+        and (patch_size_hw > patch_size_t)
+        and (patch_size_t > 1 or add_channel_padding)
+    ):
+        channels_to_keep = int(x.shape[1] * (patch_size_t / patch_size_hw))
+        x = x[:, :channels_to_keep, :, :, :]
+    if x.dim() == 4:
+        x = rearrange(
+            x, "b (c r q) h w -> b c (h q) (w r)", q=patch_size_hw, r=patch_size_hw
+        )
+    elif x.dim() == 5:
+        x = rearrange(
+            x,
+            "b (c p r q) f h w -> b c (f p) (h q) (w r)",
+            p=patch_size_t,
+            q=patch_size_hw,
+            r=patch_size_hw,
+        )
+    return x
+def create_video_autoencoder_config(
+    latent_channels: int = 4,
+):
+    config = {
+        "_class_name": "VideoAutoencoder",
+        "dims": (
+            2,
+            1,
+        ),  # 2 for Conv2, 3 for Conv3d, (2, 1) for Conv2d followed by Conv1d
+        "in_channels": 3,  # Number of input color channels (e.g., RGB)
+        "out_channels": 3,  # Number of output color channels
+        "latent_channels": latent_channels,  # Number of channels in the latent space representation
+        "block_out_channels": [
+            128,
+            256,
+            512,
+            512,
+        ],  # Number of output channels of each encoder / decoder inner block
+        "patch_size": 1,
+    }
+    return config
+def create_video_autoencoder_pathify4x4x4_config(
+    latent_channels: int = 4,
+):
+    config = {
+        "_class_name": "VideoAutoencoder",
+        "dims": (
+            2,
+            1,
+        ),  # 2 for Conv2, 3 for Conv3d, (2, 1) for Conv2d followed by Conv1d
+        "in_channels": 3,  # Number of input color channels (e.g., RGB)
+        "out_channels": 3,  # Number of output color channels
+        "latent_channels": latent_channels,  # Number of channels in the latent space representation
+        "block_out_channels": [512]
+        * 4,  # Number of output channels of each encoder / decoder inner block
+        "patch_size": 4,
+        "latent_log_var": "uniform",
+    }
+    return config
+def create_video_autoencoder_pathify4x4_config(
+    latent_channels: int = 4,
+):
+    config = {
+        "_class_name": "VideoAutoencoder",
+        "dims": 2,  # 2 for Conv2, 3 for Conv3d, (2, 1) for Conv2d followed by Conv1d
+        "in_channels": 3,  # Number of input color channels (e.g., RGB)
+        "out_channels": 3,  # Number of output color channels
+        "latent_channels": latent_channels,  # Number of channels in the latent space representation
+        "block_out_channels": [512]
+        * 4,  # Number of output channels of each encoder / decoder inner block
+        "patch_size": 4,
+        "norm_layer": "pixel_norm",
+    }
+    return config
+def test_vae_patchify_unpatchify():
+    import torch
+    x = torch.randn(2, 3, 8, 64, 64)
+    x_patched = patchify(x, patch_size_hw=4, patch_size_t=4)
+    x_unpatched = unpatchify(x_patched, patch_size_hw=4, patch_size_t=4)
+    assert torch.allclose(x, x_unpatched)
+def demo_video_autoencoder_forward_backward():
+    # Configuration for the VideoAutoencoder
+    config = create_video_autoencoder_pathify4x4x4_config()
+    # Instantiate the VideoAutoencoder with the specified configuration
+    video_autoencoder = VideoAutoencoder.from_config(config)
+    print(video_autoencoder)
+    # Print the total number of parameters in the video autoencoder
+    total_params = sum(p.numel() for p in video_autoencoder.parameters())
+    print(f"Total number of parameters in VideoAutoencoder: {total_params:,}")
+    # Create a mock input tensor simulating a batch of videos
+    # Shape: (batch_size, channels, depth, height, width)
+    # E.g., 4 videos, each with 3 color channels, 16 frames, and 64x64 pixels per frame
+    input_videos = torch.randn(2, 3, 8, 64, 64)
+    # Forward pass: encode and decode the input videos
+    latent = video_autoencoder.encode(input_videos).latent_dist.mode()
+    print(f"input shape={input_videos.shape}")
+    print(f"latent shape={latent.shape}")
+    reconstructed_videos = video_autoencoder.decode(
+        latent, target_shape=input_videos.shape
+    ).sample
+    print(f"reconstructed shape={reconstructed_videos.shape}")
+    # Calculate the loss (e.g., mean squared error)
+    loss = torch.nn.functional.mse_loss(input_videos, reconstructed_videos)
+    # Perform backward pass
+    loss.backward()
+    print(f"Demo completed with loss: {loss.item()}")
+# Ensure to call the demo function to execute the forward and backward pass
+if __name__ == "__main__":
+    demo_video_autoencoder_forward_backward()

ltx_video/models/transformers/__init__.py ADDED Viewed

File without changes

ltx_video/models/transformers/attention.py ADDED Viewed

	@@ -0,0 +1,1264 @@

+import inspect
+from importlib import import_module
+from typing import Any, Dict, Optional, Tuple
+import torch
+import torch.nn.functional as F
+from diffusers.models.activations import GEGLU, GELU, ApproximateGELU
+from diffusers.models.attention import _chunked_feed_forward
+from diffusers.models.attention_processor import (
+    LoRAAttnAddedKVProcessor,
+    LoRAAttnProcessor,
+    LoRAAttnProcessor2_0,
+    LoRAXFormersAttnProcessor,
+    SpatialNorm,
+)
+from diffusers.models.lora import LoRACompatibleLinear
+from diffusers.models.normalization import RMSNorm
+from diffusers.utils import deprecate, logging
+from diffusers.utils.torch_utils import maybe_allow_in_graph
+from einops import rearrange
+from torch import nn
+from ltx_video.utils.skip_layer_strategy import SkipLayerStrategy
+try:
+    from torch_xla.experimental.custom_kernel import flash_attention
+except ImportError:
+    # workaround for automatic tests. Currently this function is manually patched
+    # to the torch_xla lib on setup of container
+    pass
+# code adapted from  https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention.py
+logger = logging.get_logger(__name__)
+@maybe_allow_in_graph
+class BasicTransformerBlock(nn.Module):
+    r"""
+    A basic Transformer block.
+    Parameters:
+        dim (`int`): The number of channels in the input and output.
+        num_attention_heads (`int`): The number of heads to use for multi-head attention.
+        attention_head_dim (`int`): The number of channels in each head.
+        dropout (`float`, *optional*, defaults to 0.0): The dropout probability to use.
+        cross_attention_dim (`int`, *optional*): The size of the encoder_hidden_states vector for cross attention.
+        activation_fn (`str`, *optional*, defaults to `"geglu"`): Activation function to be used in feed-forward.
+        num_embeds_ada_norm (:
+            obj: `int`, *optional*): The number of diffusion steps used during training. See `Transformer2DModel`.
+        attention_bias (:
+            obj: `bool`, *optional*, defaults to `False`): Configure if the attentions should contain a bias parameter.
+        only_cross_attention (`bool`, *optional*):
+            Whether to use only cross-attention layers. In this case two cross attention layers are used.
+        double_self_attention (`bool`, *optional*):
+            Whether to use two self-attention layers. In this case no cross attention layers are used.
+        upcast_attention (`bool`, *optional*):
+            Whether to upcast the attention computation to float32. This is useful for mixed precision training.
+        norm_elementwise_affine (`bool`, *optional*, defaults to `True`):
+            Whether to use learnable elementwise affine parameters for normalization.
+        qk_norm (`str`, *optional*, defaults to None):
+            Set to 'layer_norm' or `rms_norm` to perform query and key normalization.
+        adaptive_norm (`str`, *optional*, defaults to `"single_scale_shift"`):
+            The type of adaptive norm to use. Can be `"single_scale_shift"`, `"single_scale"` or "none".
+        standardization_norm (`str`, *optional*, defaults to `"layer_norm"`):
+            The type of pre-normalization to use. Can be `"layer_norm"` or `"rms_norm"`.
+        final_dropout (`bool` *optional*, defaults to False):
+            Whether to apply a final dropout after the last feed-forward layer.
+        attention_type (`str`, *optional*, defaults to `"default"`):
+            The type of attention to use. Can be `"default"` or `"gated"` or `"gated-text-image"`.
+        positional_embeddings (`str`, *optional*, defaults to `None`):
+            The type of positional embeddings to apply to.
+        num_positional_embeddings (`int`, *optional*, defaults to `None`):
+            The maximum number of positional embeddings to apply.
+    """
+    def __init__(
+        self,
+        dim: int,
+        num_attention_heads: int,
+        attention_head_dim: int,
+        dropout=0.0,
+        cross_attention_dim: Optional[int] = None,
+        activation_fn: str = "geglu",
+        num_embeds_ada_norm: Optional[int] = None,  # pylint: disable=unused-argument
+        attention_bias: bool = False,
+        only_cross_attention: bool = False,
+        double_self_attention: bool = False,
+        upcast_attention: bool = False,
+        norm_elementwise_affine: bool = True,
+        adaptive_norm: str = "single_scale_shift",  # 'single_scale_shift', 'single_scale' or 'none'
+        standardization_norm: str = "layer_norm",  # 'layer_norm' or 'rms_norm'
+        norm_eps: float = 1e-5,
+        qk_norm: Optional[str] = None,
+        final_dropout: bool = False,
+        attention_type: str = "default",  # pylint: disable=unused-argument
+        ff_inner_dim: Optional[int] = None,
+        ff_bias: bool = True,
+        attention_out_bias: bool = True,
+        use_tpu_flash_attention: bool = False,
+        use_rope: bool = False,
+    ):
+        super().__init__()
+        self.only_cross_attention = only_cross_attention
+        self.use_tpu_flash_attention = use_tpu_flash_attention
+        self.adaptive_norm = adaptive_norm
+        assert standardization_norm in ["layer_norm", "rms_norm"]
+        assert adaptive_norm in ["single_scale_shift", "single_scale", "none"]
+        make_norm_layer = (
+            nn.LayerNorm if standardization_norm == "layer_norm" else RMSNorm
+        )
+        # Define 3 blocks. Each block has its own normalization layer.
+        # 1. Self-Attn
+        self.norm1 = make_norm_layer(
+            dim, elementwise_affine=norm_elementwise_affine, eps=norm_eps
+        )
+        self.attn1 = Attention(
+            query_dim=dim,
+            heads=num_attention_heads,
+            dim_head=attention_head_dim,
+            dropout=dropout,
+            bias=attention_bias,
+            cross_attention_dim=cross_attention_dim if only_cross_attention else None,
+            upcast_attention=upcast_attention,
+            out_bias=attention_out_bias,
+            use_tpu_flash_attention=use_tpu_flash_attention,
+            qk_norm=qk_norm,
+            use_rope=use_rope,
+        )
+        # 2. Cross-Attn
+        if cross_attention_dim is not None or double_self_attention:
+            self.attn2 = Attention(
+                query_dim=dim,
+                cross_attention_dim=(
+                    cross_attention_dim if not double_self_attention else None
+                ),
+                heads=num_attention_heads,
+                dim_head=attention_head_dim,
+                dropout=dropout,
+                bias=attention_bias,
+                upcast_attention=upcast_attention,
+                out_bias=attention_out_bias,
+                use_tpu_flash_attention=use_tpu_flash_attention,
+                qk_norm=qk_norm,
+                use_rope=use_rope,
+            )  # is self-attn if encoder_hidden_states is none
+            if adaptive_norm == "none":
+                self.attn2_norm = make_norm_layer(
+                    dim, norm_eps, norm_elementwise_affine
+                )
+        else:
+            self.attn2 = None
+            self.attn2_norm = None
+        self.norm2 = make_norm_layer(dim, norm_eps, norm_elementwise_affine)
+        # 3. Feed-forward
+        self.ff = FeedForward(
+            dim,
+            dropout=dropout,
+            activation_fn=activation_fn,
+            final_dropout=final_dropout,
+            inner_dim=ff_inner_dim,
+            bias=ff_bias,
+        )
+        # 5. Scale-shift for PixArt-Alpha.
+        if adaptive_norm != "none":
+            num_ada_params = 4 if adaptive_norm == "single_scale" else 6
+            self.scale_shift_table = nn.Parameter(
+                torch.randn(num_ada_params, dim) / dim**0.5
+            )
+        # let chunk size default to None
+        self._chunk_size = None
+        self._chunk_dim = 0
+    def set_use_tpu_flash_attention(self):
+        r"""
+        Function sets the flag in this object and propagates down the children. The flag will enforce the usage of TPU
+        attention kernel.
+        """
+        self.use_tpu_flash_attention = True
+        self.attn1.set_use_tpu_flash_attention()
+        self.attn2.set_use_tpu_flash_attention()
+    def set_chunk_feed_forward(self, chunk_size: Optional[int], dim: int = 0):
+        # Sets chunk feed-forward
+        self._chunk_size = chunk_size
+        self._chunk_dim = dim
+    def forward(
+        self,
+        hidden_states: torch.FloatTensor,
+        freqs_cis: Optional[Tuple[torch.FloatTensor, torch.FloatTensor]] = None,
+        attention_mask: Optional[torch.FloatTensor] = None,
+        encoder_hidden_states: Optional[torch.FloatTensor] = None,
+        encoder_attention_mask: Optional[torch.FloatTensor] = None,
+        timestep: Optional[torch.LongTensor] = None,
+        cross_attention_kwargs: Dict[str, Any] = None,
+        class_labels: Optional[torch.LongTensor] = None,
+        skip_layer_mask: Optional[torch.Tensor] = None,
+        skip_layer_strategy: Optional[SkipLayerStrategy] = None,
+    ) -> torch.FloatTensor:
+        if cross_attention_kwargs is not None:
+            if cross_attention_kwargs.get("scale", None) is not None:
+                logger.warning(
+                    "Passing `scale` to `cross_attention_kwargs` is depcrecated. `scale` will be ignored."
+                )
+        # Notice that normalization is always applied before the real computation in the following blocks.
+        # 0. Self-Attention
+        batch_size = hidden_states.shape[0]
+        original_hidden_states = hidden_states
+        norm_hidden_states = self.norm1(hidden_states)
+        # Apply ada_norm_single
+        if self.adaptive_norm in ["single_scale_shift", "single_scale"]:
+            assert timestep.ndim == 3  # [batch, 1 or num_tokens, embedding_dim]
+            num_ada_params = self.scale_shift_table.shape[0]
+            ada_values = self.scale_shift_table[None, None] + timestep.reshape(
+                batch_size, timestep.shape[1], num_ada_params, -1
+            )
+            if self.adaptive_norm == "single_scale_shift":
+                shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp = (
+                    ada_values.unbind(dim=2)
+                )
+                norm_hidden_states = norm_hidden_states * (1 + scale_msa) + shift_msa
+            else:
+                scale_msa, gate_msa, scale_mlp, gate_mlp = ada_values.unbind(dim=2)
+                norm_hidden_states = norm_hidden_states * (1 + scale_msa)
+        elif self.adaptive_norm == "none":
+            scale_msa, gate_msa, scale_mlp, gate_mlp = None, None, None, None
+        else:
+            raise ValueError(f"Unknown adaptive norm type: {self.adaptive_norm}")
+        norm_hidden_states = norm_hidden_states.squeeze(
+            1
+        )  # TODO: Check if this is needed
+        # 1. Prepare GLIGEN inputs
+        cross_attention_kwargs = (
+            cross_attention_kwargs.copy() if cross_attention_kwargs is not None else {}
+        )
+        attn_output = self.attn1(
+            norm_hidden_states,
+            freqs_cis=freqs_cis,
+            encoder_hidden_states=(
+                encoder_hidden_states if self.only_cross_attention else None
+            ),
+            attention_mask=attention_mask,
+            skip_layer_mask=skip_layer_mask,
+            skip_layer_strategy=skip_layer_strategy,
+            **cross_attention_kwargs,
+        )
+        if gate_msa is not None:
+            attn_output = gate_msa * attn_output
+        hidden_states = attn_output + hidden_states
+        if hidden_states.ndim == 4:
+            hidden_states = hidden_states.squeeze(1)
+        # 3. Cross-Attention
+        if self.attn2 is not None:
+            if self.adaptive_norm == "none":
+                attn_input = self.attn2_norm(hidden_states)
+            else:
+                attn_input = hidden_states
+            attn_output = self.attn2(
+                attn_input,
+                freqs_cis=freqs_cis,
+                encoder_hidden_states=encoder_hidden_states,
+                attention_mask=encoder_attention_mask,
+                **cross_attention_kwargs,
+            )
+            hidden_states = attn_output + hidden_states
+        # 4. Feed-forward
+        norm_hidden_states = self.norm2(hidden_states)
+        if self.adaptive_norm == "single_scale_shift":
+            norm_hidden_states = norm_hidden_states * (1 + scale_mlp) + shift_mlp
+        elif self.adaptive_norm == "single_scale":
+            norm_hidden_states = norm_hidden_states * (1 + scale_mlp)
+        elif self.adaptive_norm == "none":
+            pass
+        else:
+            raise ValueError(f"Unknown adaptive norm type: {self.adaptive_norm}")
+        if self._chunk_size is not None:
+            # "feed_forward_chunk_size" can be used to save memory
+            ff_output = _chunked_feed_forward(
+                self.ff, norm_hidden_states, self._chunk_dim, self._chunk_size
+            )
+        else:
+            ff_output = self.ff(norm_hidden_states)
+        if gate_mlp is not None:
+            ff_output = gate_mlp * ff_output
+        hidden_states = ff_output + hidden_states
+        if hidden_states.ndim == 4:
+            hidden_states = hidden_states.squeeze(1)
+        if (
+            skip_layer_mask is not None
+            and skip_layer_strategy == SkipLayerStrategy.TransformerBlock
+        ):
+            skip_layer_mask = skip_layer_mask.view(-1, 1, 1)
+            hidden_states = hidden_states * skip_layer_mask + original_hidden_states * (
+                1.0 - skip_layer_mask
+            )
+        return hidden_states
+@maybe_allow_in_graph
+class Attention(nn.Module):
+    r"""
+    A cross attention layer.
+    Parameters:
+        query_dim (`int`):
+            The number of channels in the query.
+        cross_attention_dim (`int`, *optional*):
+            The number of channels in the encoder_hidden_states. If not given, defaults to `query_dim`.
+        heads (`int`,  *optional*, defaults to 8):
+            The number of heads to use for multi-head attention.
+        dim_head (`int`,  *optional*, defaults to 64):
+            The number of channels in each head.
+        dropout (`float`, *optional*, defaults to 0.0):
+            The dropout probability to use.
+        bias (`bool`, *optional*, defaults to False):
+            Set to `True` for the query, key, and value linear layers to contain a bias parameter.
+        upcast_attention (`bool`, *optional*, defaults to False):
+            Set to `True` to upcast the attention computation to `float32`.
+        upcast_softmax (`bool`, *optional*, defaults to False):
+            Set to `True` to upcast the softmax computation to `float32`.
+        cross_attention_norm (`str`, *optional*, defaults to `None`):
+            The type of normalization to use for the cross attention. Can be `None`, `layer_norm`, or `group_norm`.
+        cross_attention_norm_num_groups (`int`, *optional*, defaults to 32):
+            The number of groups to use for the group norm in the cross attention.
+        added_kv_proj_dim (`int`, *optional*, defaults to `None`):
+            The number of channels to use for the added key and value projections. If `None`, no projection is used.
+        norm_num_groups (`int`, *optional*, defaults to `None`):
+            The number of groups to use for the group norm in the attention.
+        spatial_norm_dim (`int`, *optional*, defaults to `None`):
+            The number of channels to use for the spatial normalization.
+        out_bias (`bool`, *optional*, defaults to `True`):
+            Set to `True` to use a bias in the output linear layer.
+        scale_qk (`bool`, *optional*, defaults to `True`):
+            Set to `True` to scale the query and key by `1 / sqrt(dim_head)`.
+        qk_norm (`str`, *optional*, defaults to None):
+            Set to 'layer_norm' or `rms_norm` to perform query and key normalization.
+        only_cross_attention (`bool`, *optional*, defaults to `False`):
+            Set to `True` to only use cross attention and not added_kv_proj_dim. Can only be set to `True` if
+            `added_kv_proj_dim` is not `None`.
+        eps (`float`, *optional*, defaults to 1e-5):
+            An additional value added to the denominator in group normalization that is used for numerical stability.
+        rescale_output_factor (`float`, *optional*, defaults to 1.0):
+            A factor to rescale the output by dividing it with this value.
+        residual_connection (`bool`, *optional*, defaults to `False`):
+            Set to `True` to add the residual connection to the output.
+        _from_deprecated_attn_block (`bool`, *optional*, defaults to `False`):
+            Set to `True` if the attention block is loaded from a deprecated state dict.
+        processor (`AttnProcessor`, *optional*, defaults to `None`):
+            The attention processor to use. If `None`, defaults to `AttnProcessor2_0` if `torch 2.x` is used and
+            `AttnProcessor` otherwise.
+    """
+    def __init__(
+        self,
+        query_dim: int,
+        cross_attention_dim: Optional[int] = None,
+        heads: int = 8,
+        dim_head: int = 64,
+        dropout: float = 0.0,
+        bias: bool = False,
+        upcast_attention: bool = False,
+        upcast_softmax: bool = False,
+        cross_attention_norm: Optional[str] = None,
+        cross_attention_norm_num_groups: int = 32,
+        added_kv_proj_dim: Optional[int] = None,
+        norm_num_groups: Optional[int] = None,
+        spatial_norm_dim: Optional[int] = None,
+        out_bias: bool = True,
+        scale_qk: bool = True,
+        qk_norm: Optional[str] = None,
+        only_cross_attention: bool = False,
+        eps: float = 1e-5,
+        rescale_output_factor: float = 1.0,
+        residual_connection: bool = False,
+        _from_deprecated_attn_block: bool = False,
+        processor: Optional["AttnProcessor"] = None,
+        out_dim: int = None,
+        use_tpu_flash_attention: bool = False,
+        use_rope: bool = False,
+    ):
+        super().__init__()
+        self.inner_dim = out_dim if out_dim is not None else dim_head * heads
+        self.query_dim = query_dim
+        self.use_bias = bias
+        self.is_cross_attention = cross_attention_dim is not None
+        self.cross_attention_dim = (
+            cross_attention_dim if cross_attention_dim is not None else query_dim
+        )
+        self.upcast_attention = upcast_attention
+        self.upcast_softmax = upcast_softmax
+        self.rescale_output_factor = rescale_output_factor
+        self.residual_connection = residual_connection
+        self.dropout = dropout
+        self.fused_projections = False
+        self.out_dim = out_dim if out_dim is not None else query_dim
+        self.use_tpu_flash_attention = use_tpu_flash_attention
+        self.use_rope = use_rope
+        # we make use of this private variable to know whether this class is loaded
+        # with an deprecated state dict so that we can convert it on the fly
+        self._from_deprecated_attn_block = _from_deprecated_attn_block
+        self.scale_qk = scale_qk
+        self.scale = dim_head**-0.5 if self.scale_qk else 1.0
+        if qk_norm is None:
+            self.q_norm = nn.Identity()
+            self.k_norm = nn.Identity()
+        elif qk_norm == "rms_norm":
+            self.q_norm = RMSNorm(dim_head * heads, eps=1e-5)
+            self.k_norm = RMSNorm(dim_head * heads, eps=1e-5)
+        elif qk_norm == "layer_norm":
+            self.q_norm = nn.LayerNorm(dim_head * heads, eps=1e-5)
+            self.k_norm = nn.LayerNorm(dim_head * heads, eps=1e-5)
+        else:
+            raise ValueError(f"Unsupported qk_norm method: {qk_norm}")
+        self.heads = out_dim // dim_head if out_dim is not None else heads
+        # for slice_size > 0 the attention score computation
+        # is split across the batch axis to save memory
+        # You can set slice_size with `set_attention_slice`
+        self.sliceable_head_dim = heads
+        self.added_kv_proj_dim = added_kv_proj_dim
+        self.only_cross_attention = only_cross_attention
+        if self.added_kv_proj_dim is None and self.only_cross_attention:
+            raise ValueError(
+                "`only_cross_attention` can only be set to True if `added_kv_proj_dim` is not None. Make sure to set either `only_cross_attention=False` or define `added_kv_proj_dim`."
+            )
+        if norm_num_groups is not None:
+            self.group_norm = nn.GroupNorm(
+                num_channels=query_dim, num_groups=norm_num_groups, eps=eps, affine=True
+            )
+        else:
+            self.group_norm = None
+        if spatial_norm_dim is not None:
+            self.spatial_norm = SpatialNorm(
+                f_channels=query_dim, zq_channels=spatial_norm_dim
+            )
+        else:
+            self.spatial_norm = None
+        if cross_attention_norm is None:
+            self.norm_cross = None
+        elif cross_attention_norm == "layer_norm":
+            self.norm_cross = nn.LayerNorm(self.cross_attention_dim)
+        elif cross_attention_norm == "group_norm":
+            if self.added_kv_proj_dim is not None:
+                # The given `encoder_hidden_states` are initially of shape
+                # (batch_size, seq_len, added_kv_proj_dim) before being projected
+                # to (batch_size, seq_len, cross_attention_dim). The norm is applied
+                # before the projection, so we need to use `added_kv_proj_dim` as
+                # the number of channels for the group norm.
+                norm_cross_num_channels = added_kv_proj_dim
+            else:
+                norm_cross_num_channels = self.cross_attention_dim
+            self.norm_cross = nn.GroupNorm(
+                num_channels=norm_cross_num_channels,
+                num_groups=cross_attention_norm_num_groups,
+                eps=1e-5,
+                affine=True,
+            )
+        else:
+            raise ValueError(
+                f"unknown cross_attention_norm: {cross_attention_norm}. Should be None, 'layer_norm' or 'group_norm'"
+            )
+        linear_cls = nn.Linear
+        self.linear_cls = linear_cls
+        self.to_q = linear_cls(query_dim, self.inner_dim, bias=bias)
+        if not self.only_cross_attention:
+            # only relevant for the `AddedKVProcessor` classes
+            self.to_k = linear_cls(self.cross_attention_dim, self.inner_dim, bias=bias)
+            self.to_v = linear_cls(self.cross_attention_dim, self.inner_dim, bias=bias)
+        else:
+            self.to_k = None
+            self.to_v = None
+        if self.added_kv_proj_dim is not None:
+            self.add_k_proj = linear_cls(added_kv_proj_dim, self.inner_dim)
+            self.add_v_proj = linear_cls(added_kv_proj_dim, self.inner_dim)
+        self.to_out = nn.ModuleList([])
+        self.to_out.append(linear_cls(self.inner_dim, self.out_dim, bias=out_bias))
+        self.to_out.append(nn.Dropout(dropout))
+        # set attention processor
+        # We use the AttnProcessor2_0 by default when torch 2.x is used which uses
+        # torch.nn.functional.scaled_dot_product_attention for native Flash/memory_efficient_attention
+        # but only if it has the default `scale` argument. TODO remove scale_qk check when we move to torch 2.1
+        if processor is None:
+            processor = AttnProcessor2_0()
+        self.set_processor(processor)
+    def set_use_tpu_flash_attention(self):
+        r"""
+        Function sets the flag in this object. The flag will enforce the usage of TPU attention kernel.
+        """
+        self.use_tpu_flash_attention = True
+    def set_processor(self, processor: "AttnProcessor") -> None:
+        r"""
+        Set the attention processor to use.
+        Args:
+            processor (`AttnProcessor`):
+                The attention processor to use.
+        """
+        # if current processor is in `self._modules` and if passed `processor` is not, we need to
+        # pop `processor` from `self._modules`
+        if (
+            hasattr(self, "processor")
+            and isinstance(self.processor, torch.nn.Module)
+            and not isinstance(processor, torch.nn.Module)
+        ):
+            logger.info(
+                f"You are removing possibly trained weights of {self.processor} with {processor}"
+            )
+            self._modules.pop("processor")
+        self.processor = processor
+    def get_processor(
+        self, return_deprecated_lora: bool = False
+    ) -> "AttentionProcessor":  # noqa: F821
+        r"""
+        Get the attention processor in use.
+        Args:
+            return_deprecated_lora (`bool`, *optional*, defaults to `False`):
+                Set to `True` to return the deprecated LoRA attention processor.
+        Returns:
+            "AttentionProcessor": The attention processor in use.
+        """
+        if not return_deprecated_lora:
+            return self.processor
+        # TODO(Sayak, Patrick). The rest of the function is needed to ensure backwards compatible
+        # serialization format for LoRA Attention Processors. It should be deleted once the integration
+        # with PEFT is completed.
+        is_lora_activated = {
+            name: module.lora_layer is not None
+            for name, module in self.named_modules()
+            if hasattr(module, "lora_layer")
+        }
+        # 1. if no layer has a LoRA activated we can return the processor as usual
+        if not any(is_lora_activated.values()):
+            return self.processor
+        # If doesn't apply LoRA do `add_k_proj` or `add_v_proj`
+        is_lora_activated.pop("add_k_proj", None)
+        is_lora_activated.pop("add_v_proj", None)
+        # 2. else it is not posssible that only some layers have LoRA activated
+        if not all(is_lora_activated.values()):
+            raise ValueError(
+                f"Make sure that either all layers or no layers have LoRA activated, but have {is_lora_activated}"
+            )
+        # 3. And we need to merge the current LoRA layers into the corresponding LoRA attention processor
+        non_lora_processor_cls_name = self.processor.__class__.__name__
+        lora_processor_cls = getattr(
+            import_module(__name__), "LoRA" + non_lora_processor_cls_name
+        )
+        hidden_size = self.inner_dim
+        # now create a LoRA attention processor from the LoRA layers
+        if lora_processor_cls in [
+            LoRAAttnProcessor,
+            LoRAAttnProcessor2_0,
+            LoRAXFormersAttnProcessor,
+        ]:
+            kwargs = {
+                "cross_attention_dim": self.cross_attention_dim,
+                "rank": self.to_q.lora_layer.rank,
+                "network_alpha": self.to_q.lora_layer.network_alpha,
+                "q_rank": self.to_q.lora_layer.rank,
+                "q_hidden_size": self.to_q.lora_layer.out_features,
+                "k_rank": self.to_k.lora_layer.rank,
+                "k_hidden_size": self.to_k.lora_layer.out_features,
+                "v_rank": self.to_v.lora_layer.rank,
+                "v_hidden_size": self.to_v.lora_layer.out_features,
+                "out_rank": self.to_out[0].lora_layer.rank,
+                "out_hidden_size": self.to_out[0].lora_layer.out_features,
+            }
+            if hasattr(self.processor, "attention_op"):
+                kwargs["attention_op"] = self.processor.attention_op
+            lora_processor = lora_processor_cls(hidden_size, **kwargs)
+            lora_processor.to_q_lora.load_state_dict(self.to_q.lora_layer.state_dict())
+            lora_processor.to_k_lora.load_state_dict(self.to_k.lora_layer.state_dict())
+            lora_processor.to_v_lora.load_state_dict(self.to_v.lora_layer.state_dict())
+            lora_processor.to_out_lora.load_state_dict(
+                self.to_out[0].lora_layer.state_dict()
+            )
+        elif lora_processor_cls == LoRAAttnAddedKVProcessor:
+            lora_processor = lora_processor_cls(
+                hidden_size,
+                cross_attention_dim=self.add_k_proj.weight.shape[0],
+                rank=self.to_q.lora_layer.rank,
+                network_alpha=self.to_q.lora_layer.network_alpha,
+            )
+            lora_processor.to_q_lora.load_state_dict(self.to_q.lora_layer.state_dict())
+            lora_processor.to_k_lora.load_state_dict(self.to_k.lora_layer.state_dict())
+            lora_processor.to_v_lora.load_state_dict(self.to_v.lora_layer.state_dict())
+            lora_processor.to_out_lora.load_state_dict(
+                self.to_out[0].lora_layer.state_dict()
+            )
+            # only save if used
+            if self.add_k_proj.lora_layer is not None:
+                lora_processor.add_k_proj_lora.load_state_dict(
+                    self.add_k_proj.lora_layer.state_dict()
+                )
+                lora_processor.add_v_proj_lora.load_state_dict(
+                    self.add_v_proj.lora_layer.state_dict()
+                )
+            else:
+                lora_processor.add_k_proj_lora = None
+                lora_processor.add_v_proj_lora = None
+        else:
+            raise ValueError(f"{lora_processor_cls} does not exist.")
+        return lora_processor
+    def forward(
+        self,
+        hidden_states: torch.FloatTensor,
+        freqs_cis: Optional[Tuple[torch.FloatTensor, torch.FloatTensor]] = None,
+        encoder_hidden_states: Optional[torch.FloatTensor] = None,
+        attention_mask: Optional[torch.FloatTensor] = None,
+        skip_layer_mask: Optional[torch.Tensor] = None,
+        skip_layer_strategy: Optional[SkipLayerStrategy] = None,
+        **cross_attention_kwargs,
+    ) -> torch.Tensor:
+        r"""
+        The forward method of the `Attention` class.
+        Args:
+            hidden_states (`torch.Tensor`):
+                The hidden states of the query.
+            encoder_hidden_states (`torch.Tensor`, *optional*):
+                The hidden states of the encoder.
+            attention_mask (`torch.Tensor`, *optional*):
+                The attention mask to use. If `None`, no mask is applied.
+            skip_layer_mask (`torch.Tensor`, *optional*):
+                The skip layer mask to use. If `None`, no mask is applied.
+            skip_layer_strategy (`SkipLayerStrategy`, *optional*, defaults to `None`):
+                Controls which layers to skip for spatiotemporal guidance.
+            **cross_attention_kwargs:
+                Additional keyword arguments to pass along to the cross attention.
+        Returns:
+            `torch.Tensor`: The output of the attention layer.
+        """
+        # The `Attention` class can call different attention processors / attention functions
+        # here we simply pass along all tensors to the selected processor class
+        # For standard processors that are defined here, `**cross_attention_kwargs` is empty
+        attn_parameters = set(
+            inspect.signature(self.processor.__call__).parameters.keys()
+        )
+        unused_kwargs = [
+            k for k, _ in cross_attention_kwargs.items() if k not in attn_parameters
+        ]
+        if len(unused_kwargs) > 0:
+            logger.warning(
+                f"cross_attention_kwargs {unused_kwargs} are not expected by"
+                f" {self.processor.__class__.__name__} and will be ignored."
+            )
+        cross_attention_kwargs = {
+            k: w for k, w in cross_attention_kwargs.items() if k in attn_parameters
+        }
+        return self.processor(
+            self,
+            hidden_states,
+            freqs_cis=freqs_cis,
+            encoder_hidden_states=encoder_hidden_states,
+            attention_mask=attention_mask,
+            skip_layer_mask=skip_layer_mask,
+            skip_layer_strategy=skip_layer_strategy,
+            **cross_attention_kwargs,
+        )
+    def batch_to_head_dim(self, tensor: torch.Tensor) -> torch.Tensor:
+        r"""
+        Reshape the tensor from `[batch_size, seq_len, dim]` to `[batch_size // heads, seq_len, dim * heads]`. `heads`
+        is the number of heads initialized while constructing the `Attention` class.
+        Args:
+            tensor (`torch.Tensor`): The tensor to reshape.
+        Returns:
+            `torch.Tensor`: The reshaped tensor.
+        """
+        head_size = self.heads
+        batch_size, seq_len, dim = tensor.shape
+        tensor = tensor.reshape(batch_size // head_size, head_size, seq_len, dim)
+        tensor = tensor.permute(0, 2, 1, 3).reshape(
+            batch_size // head_size, seq_len, dim * head_size
+        )
+        return tensor
+    def head_to_batch_dim(self, tensor: torch.Tensor, out_dim: int = 3) -> torch.Tensor:
+        r"""
+        Reshape the tensor from `[batch_size, seq_len, dim]` to `[batch_size, seq_len, heads, dim // heads]` `heads` is
+        the number of heads initialized while constructing the `Attention` class.
+        Args:
+            tensor (`torch.Tensor`): The tensor to reshape.
+            out_dim (`int`, *optional*, defaults to `3`): The output dimension of the tensor. If `3`, the tensor is
+                reshaped to `[batch_size * heads, seq_len, dim // heads]`.
+        Returns:
+            `torch.Tensor`: The reshaped tensor.
+        """
+        head_size = self.heads
+        if tensor.ndim == 3:
+            batch_size, seq_len, dim = tensor.shape
+            extra_dim = 1
+        else:
+            batch_size, extra_dim, seq_len, dim = tensor.shape
+        tensor = tensor.reshape(
+            batch_size, seq_len * extra_dim, head_size, dim // head_size
+        )
+        tensor = tensor.permute(0, 2, 1, 3)
+        if out_dim == 3:
+            tensor = tensor.reshape(
+                batch_size * head_size, seq_len * extra_dim, dim // head_size
+            )
+        return tensor
+    def get_attention_scores(
+        self,
+        query: torch.Tensor,
+        key: torch.Tensor,
+        attention_mask: torch.Tensor = None,
+    ) -> torch.Tensor:
+        r"""
+        Compute the attention scores.
+        Args:
+            query (`torch.Tensor`): The query tensor.
+            key (`torch.Tensor`): The key tensor.
+            attention_mask (`torch.Tensor`, *optional*): The attention mask to use. If `None`, no mask is applied.
+        Returns:
+            `torch.Tensor`: The attention probabilities/scores.
+        """
+        dtype = query.dtype
+        if self.upcast_attention:
+            query = query.float()
+            key = key.float()
+        if attention_mask is None:
+            baddbmm_input = torch.empty(
+                query.shape[0],
+                query.shape[1],
+                key.shape[1],
+                dtype=query.dtype,
+                device=query.device,
+            )
+            beta = 0
+        else:
+            baddbmm_input = attention_mask
+            beta = 1
+        attention_scores = torch.baddbmm(
+            baddbmm_input,
+            query,
+            key.transpose(-1, -2),
+            beta=beta,
+            alpha=self.scale,
+        )
+        del baddbmm_input
+        if self.upcast_softmax:
+            attention_scores = attention_scores.float()
+        attention_probs = attention_scores.softmax(dim=-1)
+        del attention_scores
+        attention_probs = attention_probs.to(dtype)
+        return attention_probs
+    def prepare_attention_mask(
+        self,
+        attention_mask: torch.Tensor,
+        target_length: int,
+        batch_size: int,
+        out_dim: int = 3,
+    ) -> torch.Tensor:
+        r"""
+        Prepare the attention mask for the attention computation.
+        Args:
+            attention_mask (`torch.Tensor`):
+                The attention mask to prepare.
+            target_length (`int`):
+                The target length of the attention mask. This is the length of the attention mask after padding.
+            batch_size (`int`):
+                The batch size, which is used to repeat the attention mask.
+            out_dim (`int`, *optional*, defaults to `3`):
+                The output dimension of the attention mask. Can be either `3` or `4`.
+        Returns:
+            `torch.Tensor`: The prepared attention mask.
+        """
+        head_size = self.heads
+        if attention_mask is None:
+            return attention_mask
+        current_length: int = attention_mask.shape[-1]
+        if current_length != target_length:
+            if attention_mask.device.type == "mps":
+                # HACK: MPS: Does not support padding by greater than dimension of input tensor.
+                # Instead, we can manually construct the padding tensor.
+                padding_shape = (
+                    attention_mask.shape[0],
+                    attention_mask.shape[1],
+                    target_length,
+                )
+                padding = torch.zeros(
+                    padding_shape,
+                    dtype=attention_mask.dtype,
+                    device=attention_mask.device,
+                )
+                attention_mask = torch.cat([attention_mask, padding], dim=2)
+            else:
+                # TODO: for pipelines such as stable-diffusion, padding cross-attn mask:
+                #       we want to instead pad by (0, remaining_length), where remaining_length is:
+                #       remaining_length: int = target_length - current_length
+                # TODO: re-enable tests/models/test_models_unet_2d_condition.py#test_model_xattn_padding
+                attention_mask = F.pad(attention_mask, (0, target_length), value=0.0)
+        if out_dim == 3:
+            if attention_mask.shape[0] < batch_size * head_size:
+                attention_mask = attention_mask.repeat_interleave(head_size, dim=0)
+        elif out_dim == 4:
+            attention_mask = attention_mask.unsqueeze(1)
+            attention_mask = attention_mask.repeat_interleave(head_size, dim=1)
+        return attention_mask
+    def norm_encoder_hidden_states(
+        self, encoder_hidden_states: torch.Tensor
+    ) -> torch.Tensor:
+        r"""
+        Normalize the encoder hidden states. Requires `self.norm_cross` to be specified when constructing the
+        `Attention` class.
+        Args:
+            encoder_hidden_states (`torch.Tensor`): Hidden states of the encoder.
+        Returns:
+            `torch.Tensor`: The normalized encoder hidden states.
+        """
+        assert (
+            self.norm_cross is not None
+        ), "self.norm_cross must be defined to call self.norm_encoder_hidden_states"
+        if isinstance(self.norm_cross, nn.LayerNorm):
+            encoder_hidden_states = self.norm_cross(encoder_hidden_states)
+        elif isinstance(self.norm_cross, nn.GroupNorm):
+            # Group norm norms along the channels dimension and expects
+            # input to be in the shape of (N, C, *). In this case, we want
+            # to norm along the hidden dimension, so we need to move
+            # (batch_size, sequence_length, hidden_size) ->
+            # (batch_size, hidden_size, sequence_length)
+            encoder_hidden_states = encoder_hidden_states.transpose(1, 2)
+            encoder_hidden_states = self.norm_cross(encoder_hidden_states)
+            encoder_hidden_states = encoder_hidden_states.transpose(1, 2)
+        else:
+            assert False
+        return encoder_hidden_states
+    @staticmethod
+    def apply_rotary_emb(
+        input_tensor: torch.Tensor,
+        freqs_cis: Tuple[torch.FloatTensor, torch.FloatTensor],
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        cos_freqs = freqs_cis[0]
+        sin_freqs = freqs_cis[1]
+        t_dup = rearrange(input_tensor, "... (d r) -> ... d r", r=2)
+        t1, t2 = t_dup.unbind(dim=-1)
+        t_dup = torch.stack((-t2, t1), dim=-1)
+        input_tensor_rot = rearrange(t_dup, "... d r -> ... (d r)")
+        out = input_tensor * cos_freqs + input_tensor_rot * sin_freqs
+        return out
+class AttnProcessor2_0:
+    r"""
+    Processor for implementing scaled dot-product attention (enabled by default if you're using PyTorch 2.0).
+    """
+    def __init__(self):
+        pass
+    def __call__(
+        self,
+        attn: Attention,
+        hidden_states: torch.FloatTensor,
+        freqs_cis: Tuple[torch.FloatTensor, torch.FloatTensor],
+        encoder_hidden_states: Optional[torch.FloatTensor] = None,
+        attention_mask: Optional[torch.FloatTensor] = None,
+        temb: Optional[torch.FloatTensor] = None,
+        skip_layer_mask: Optional[torch.FloatTensor] = None,
+        skip_layer_strategy: Optional[SkipLayerStrategy] = None,
+        *args,
+        **kwargs,
+    ) -> torch.FloatTensor:
+        if len(args) > 0 or kwargs.get("scale", None) is not None:
+            deprecation_message = "The `scale` argument is deprecated and will be ignored. Please remove it, as passing it will raise an error in the future. `scale` should directly be passed while calling the underlying pipeline component i.e., via `cross_attention_kwargs`."
+            deprecate("scale", "1.0.0", deprecation_message)
+        residual = hidden_states
+        if attn.spatial_norm is not None:
+            hidden_states = attn.spatial_norm(hidden_states, temb)
+        input_ndim = hidden_states.ndim
+        if input_ndim == 4:
+            batch_size, channel, height, width = hidden_states.shape
+            hidden_states = hidden_states.view(
+                batch_size, channel, height * width
+            ).transpose(1, 2)
+        batch_size, sequence_length, _ = (
+            hidden_states.shape
+            if encoder_hidden_states is None
+            else encoder_hidden_states.shape
+        )
+        if skip_layer_mask is not None:
+            skip_layer_mask = skip_layer_mask.reshape(batch_size, 1, 1)
+        if (attention_mask is not None) and (not attn.use_tpu_flash_attention):
+            attention_mask = attn.prepare_attention_mask(
+                attention_mask, sequence_length, batch_size
+            )
+            # scaled_dot_product_attention expects attention_mask shape to be
+            # (batch, heads, source_length, target_length)
+            attention_mask = attention_mask.view(
+                batch_size, attn.heads, -1, attention_mask.shape[-1]
+            )
+        if attn.group_norm is not None:
+            hidden_states = attn.group_norm(hidden_states.transpose(1, 2)).transpose(
+                1, 2
+            )
+        query = attn.to_q(hidden_states)
+        query = attn.q_norm(query)
+        if encoder_hidden_states is not None:
+            if attn.norm_cross:
+                encoder_hidden_states = attn.norm_encoder_hidden_states(
+                    encoder_hidden_states
+                )
+            key = attn.to_k(encoder_hidden_states)
+            key = attn.k_norm(key)
+        else:  # if no context provided do self-attention
+            encoder_hidden_states = hidden_states
+            key = attn.to_k(hidden_states)
+            key = attn.k_norm(key)
+            if attn.use_rope:
+                key = attn.apply_rotary_emb(key, freqs_cis)
+                query = attn.apply_rotary_emb(query, freqs_cis)
+        value = attn.to_v(encoder_hidden_states)
+        value_for_stg = value
+        inner_dim = key.shape[-1]
+        head_dim = inner_dim // attn.heads
+        query = query.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
+        key = key.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
+        value = value.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
+        # the output of sdp = (batch, num_heads, seq_len, head_dim)
+        if attn.use_tpu_flash_attention:  # use tpu attention offload 'flash attention'
+            q_segment_indexes = None
+            if (
+                attention_mask is not None
+            ):  # if mask is required need to tune both segmenIds fields
+                # attention_mask = torch.squeeze(attention_mask).to(torch.float32)
+                attention_mask = attention_mask.to(torch.float32)
+                q_segment_indexes = torch.ones(
+                    batch_size, query.shape[2], device=query.device, dtype=torch.float32
+                )
+                assert (
+                    attention_mask.shape[1] == key.shape[2]
+                ), f"ERROR: KEY SHAPE must be same as attention mask [{key.shape[2]}, {attention_mask.shape[1]}]"
+            assert (
+                query.shape[2] % 128 == 0
+            ), f"ERROR: QUERY SHAPE must be divisible by 128 (TPU limitation) [{query.shape[2]}]"
+            assert (
+                key.shape[2] % 128 == 0
+            ), f"ERROR: KEY SHAPE must be divisible by 128 (TPU limitation) [{key.shape[2]}]"
+            # run the TPU kernel implemented in jax with pallas
+            hidden_states_a = flash_attention(
+                q=query,
+                k=key,
+                v=value,
+                q_segment_ids=q_segment_indexes,
+                kv_segment_ids=attention_mask,
+                sm_scale=attn.scale,
+            )
+        else:
+            hidden_states_a = F.scaled_dot_product_attention(
+                query,
+                key,
+                value,
+                attn_mask=attention_mask,
+                dropout_p=0.0,
+                is_causal=False,
+            )
+        hidden_states_a = hidden_states_a.transpose(1, 2).reshape(
+            batch_size, -1, attn.heads * head_dim
+        )
+        hidden_states_a = hidden_states_a.to(query.dtype)
+        if (
+            skip_layer_mask is not None
+            and skip_layer_strategy == SkipLayerStrategy.AttentionSkip
+        ):
+            hidden_states = hidden_states_a * skip_layer_mask + hidden_states * (
+                1.0 - skip_layer_mask
+            )
+        elif (
+            skip_layer_mask is not None
+            and skip_layer_strategy == SkipLayerStrategy.AttentionValues
+        ):
+            hidden_states = hidden_states_a * skip_layer_mask + value_for_stg * (
+                1.0 - skip_layer_mask
+            )
+        else:
+            hidden_states = hidden_states_a
+        # linear proj
+        hidden_states = attn.to_out[0](hidden_states)
+        # dropout
+        hidden_states = attn.to_out[1](hidden_states)
+        if input_ndim == 4:
+            hidden_states = hidden_states.transpose(-1, -2).reshape(
+                batch_size, channel, height, width
+            )
+            if (
+                skip_layer_mask is not None
+                and skip_layer_strategy == SkipLayerStrategy.Residual
+            ):
+                skip_layer_mask = skip_layer_mask.reshape(batch_size, 1, 1, 1)
+        if attn.residual_connection:
+            if (
+                skip_layer_mask is not None
+                and skip_layer_strategy == SkipLayerStrategy.Residual
+            ):
+                hidden_states = hidden_states + residual * skip_layer_mask
+            else:
+                hidden_states = hidden_states + residual
+        hidden_states = hidden_states / attn.rescale_output_factor
+        return hidden_states
+class AttnProcessor:
+    r"""
+    Default processor for performing attention-related computations.
+    """
+    def __call__(
+        self,
+        attn: Attention,
+        hidden_states: torch.FloatTensor,
+        encoder_hidden_states: Optional[torch.FloatTensor] = None,
+        attention_mask: Optional[torch.FloatTensor] = None,
+        temb: Optional[torch.FloatTensor] = None,
+        *args,
+        **kwargs,
+    ) -> torch.Tensor:
+        if len(args) > 0 or kwargs.get("scale", None) is not None:
+            deprecation_message = "The `scale` argument is deprecated and will be ignored. Please remove it, as passing it will raise an error in the future. `scale` should directly be passed while calling the underlying pipeline component i.e., via `cross_attention_kwargs`."
+            deprecate("scale", "1.0.0", deprecation_message)
+        residual = hidden_states
+        if attn.spatial_norm is not None:
+            hidden_states = attn.spatial_norm(hidden_states, temb)
+        input_ndim = hidden_states.ndim
+        if input_ndim == 4:
+            batch_size, channel, height, width = hidden_states.shape
+            hidden_states = hidden_states.view(
+                batch_size, channel, height * width
+            ).transpose(1, 2)
+        batch_size, sequence_length, _ = (
+            hidden_states.shape
+            if encoder_hidden_states is None
+            else encoder_hidden_states.shape
+        )
+        attention_mask = attn.prepare_attention_mask(
+            attention_mask, sequence_length, batch_size
+        )
+        if attn.group_norm is not None:
+            hidden_states = attn.group_norm(hidden_states.transpose(1, 2)).transpose(
+                1, 2
+            )
+        query = attn.to_q(hidden_states)
+        if encoder_hidden_states is None:
+            encoder_hidden_states = hidden_states
+        elif attn.norm_cross:
+            encoder_hidden_states = attn.norm_encoder_hidden_states(
+                encoder_hidden_states
+            )
+        key = attn.to_k(encoder_hidden_states)
+        value = attn.to_v(encoder_hidden_states)
+        query = attn.head_to_batch_dim(query)
+        key = attn.head_to_batch_dim(key)
+        value = attn.head_to_batch_dim(value)
+        query = attn.q_norm(query)
+        key = attn.k_norm(key)
+        attention_probs = attn.get_attention_scores(query, key, attention_mask)
+        hidden_states = torch.bmm(attention_probs, value)
+        hidden_states = attn.batch_to_head_dim(hidden_states)
+        # linear proj
+        hidden_states = attn.to_out[0](hidden_states)
+        # dropout
+        hidden_states = attn.to_out[1](hidden_states)
+        if input_ndim == 4:
+            hidden_states = hidden_states.transpose(-1, -2).reshape(
+                batch_size, channel, height, width
+            )
+        if attn.residual_connection:
+            hidden_states = hidden_states + residual
+        hidden_states = hidden_states / attn.rescale_output_factor
+        return hidden_states
+class FeedForward(nn.Module):
+    r"""
+    A feed-forward layer.
+    Parameters:
+        dim (`int`): The number of channels in the input.
+        dim_out (`int`, *optional*): The number of channels in the output. If not given, defaults to `dim`.
+        mult (`int`, *optional*, defaults to 4): The multiplier to use for the hidden dimension.
+        dropout (`float`, *optional*, defaults to 0.0): The dropout probability to use.
+        activation_fn (`str`, *optional*, defaults to `"geglu"`): Activation function to be used in feed-forward.
+        final_dropout (`bool` *optional*, defaults to False): Apply a final dropout.
+        bias (`bool`, defaults to True): Whether to use a bias in the linear layer.
+    """
+    def __init__(
+        self,
+        dim: int,
+        dim_out: Optional[int] = None,
+        mult: int = 4,
+        dropout: float = 0.0,
+        activation_fn: str = "geglu",
+        final_dropout: bool = False,
+        inner_dim=None,
+        bias: bool = True,
+    ):
+        super().__init__()
+        if inner_dim is None:
+            inner_dim = int(dim * mult)
+        dim_out = dim_out if dim_out is not None else dim
+        linear_cls = nn.Linear
+        if activation_fn == "gelu":
+            act_fn = GELU(dim, inner_dim, bias=bias)
+        elif activation_fn == "gelu-approximate":
+            act_fn = GELU(dim, inner_dim, approximate="tanh", bias=bias)
+        elif activation_fn == "geglu":
+            act_fn = GEGLU(dim, inner_dim, bias=bias)
+        elif activation_fn == "geglu-approximate":
+            act_fn = ApproximateGELU(dim, inner_dim, bias=bias)
+        else:
+            raise ValueError(f"Unsupported activation function: {activation_fn}")
+        self.net = nn.ModuleList([])
+        # project in
+        self.net.append(act_fn)
+        # project dropout
+        self.net.append(nn.Dropout(dropout))
+        # project out
+        self.net.append(linear_cls(inner_dim, dim_out, bias=bias))
+        # FF as used in Vision Transformer, MLP-Mixer, etc. have a final dropout
+        if final_dropout:
+            self.net.append(nn.Dropout(dropout))
+    def forward(self, hidden_states: torch.Tensor, scale: float = 1.0) -> torch.Tensor:
+        compatible_cls = (GEGLU, LoRACompatibleLinear)
+        for module in self.net:
+            if isinstance(module, compatible_cls):
+                hidden_states = module(hidden_states, scale)
+            else:
+                hidden_states = module(hidden_states)
+        return hidden_states

ltx_video/models/transformers/embeddings.py ADDED Viewed

	@@ -0,0 +1,129 @@

+# Adapted from: https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/embeddings.py
+import math
+import numpy as np
+import torch
+from einops import rearrange
+from torch import nn
+def get_timestep_embedding(
+    timesteps: torch.Tensor,
+    embedding_dim: int,
+    flip_sin_to_cos: bool = False,
+    downscale_freq_shift: float = 1,
+    scale: float = 1,
+    max_period: int = 10000,
+):
+    """
+    This matches the implementation in Denoising Diffusion Probabilistic Models: Create sinusoidal timestep embeddings.
+    :param timesteps: a 1-D Tensor of N indices, one per batch element.
+                      These may be fractional.
+    :param embedding_dim: the dimension of the output. :param max_period: controls the minimum frequency of the
+    embeddings. :return: an [N x dim] Tensor of positional embeddings.
+    """
+    assert len(timesteps.shape) == 1, "Timesteps should be a 1d-array"
+    half_dim = embedding_dim // 2
+    exponent = -math.log(max_period) * torch.arange(
+        start=0, end=half_dim, dtype=torch.float32, device=timesteps.device
+    )
+    exponent = exponent / (half_dim - downscale_freq_shift)
+    emb = torch.exp(exponent)
+    emb = timesteps[:, None].float() * emb[None, :]
+    # scale embeddings
+    emb = scale * emb
+    # concat sine and cosine embeddings
+    emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=-1)
+    # flip sine and cosine embeddings
+    if flip_sin_to_cos:
+        emb = torch.cat([emb[:, half_dim:], emb[:, :half_dim]], dim=-1)
+    # zero pad
+    if embedding_dim % 2 == 1:
+        emb = torch.nn.functional.pad(emb, (0, 1, 0, 0))
+    return emb
+def get_3d_sincos_pos_embed(embed_dim, grid, w, h, f):
+    """
+    grid_size: int of the grid height and width return: pos_embed: [grid_size*grid_size, embed_dim] or
+    [1+grid_size*grid_size, embed_dim] (w/ or w/o cls_token)
+    """
+    grid = rearrange(grid, "c (f h w) -> c f h w", h=h, w=w)
+    grid = rearrange(grid, "c f h w -> c h w f", h=h, w=w)
+    grid = grid.reshape([3, 1, w, h, f])
+    pos_embed = get_3d_sincos_pos_embed_from_grid(embed_dim, grid)
+    pos_embed = pos_embed.transpose(1, 0, 2, 3)
+    return rearrange(pos_embed, "h w f c -> (f h w) c")
+def get_3d_sincos_pos_embed_from_grid(embed_dim, grid):
+    if embed_dim % 3 != 0:
+        raise ValueError("embed_dim must be divisible by 3")
+    # use half of dimensions to encode grid_h
+    emb_f = get_1d_sincos_pos_embed_from_grid(embed_dim // 3, grid[0])  # (H*W*T, D/3)
+    emb_h = get_1d_sincos_pos_embed_from_grid(embed_dim // 3, grid[1])  # (H*W*T, D/3)
+    emb_w = get_1d_sincos_pos_embed_from_grid(embed_dim // 3, grid[2])  # (H*W*T, D/3)
+    emb = np.concatenate([emb_h, emb_w, emb_f], axis=-1)  # (H*W*T, D)
+    return emb
+def get_1d_sincos_pos_embed_from_grid(embed_dim, pos):
+    """
+    embed_dim: output dimension for each position pos: a list of positions to be encoded: size (M,) out: (M, D)
+    """
+    if embed_dim % 2 != 0:
+        raise ValueError("embed_dim must be divisible by 2")
+    omega = np.arange(embed_dim // 2, dtype=np.float64)
+    omega /= embed_dim / 2.0
+    omega = 1.0 / 10000**omega  # (D/2,)
+    pos_shape = pos.shape
+    pos = pos.reshape(-1)
+    out = np.einsum("m,d->md", pos, omega)  # (M, D/2), outer product
+    out = out.reshape([*pos_shape, -1])[0]
+    emb_sin = np.sin(out)  # (M, D/2)
+    emb_cos = np.cos(out)  # (M, D/2)
+    emb = np.concatenate([emb_sin, emb_cos], axis=-1)  # (M, D)
+    return emb
+class SinusoidalPositionalEmbedding(nn.Module):
+    """Apply positional information to a sequence of embeddings.
+    Takes in a sequence of embeddings with shape (batch_size, seq_length, embed_dim) and adds positional embeddings to
+    them
+    Args:
+        embed_dim: (int): Dimension of the positional embedding.
+        max_seq_length: Maximum sequence length to apply positional embeddings
+    """
+    def __init__(self, embed_dim: int, max_seq_length: int = 32):
+        super().__init__()
+        position = torch.arange(max_seq_length).unsqueeze(1)
+        div_term = torch.exp(
+            torch.arange(0, embed_dim, 2) * (-math.log(10000.0) / embed_dim)
+        )
+        pe = torch.zeros(1, max_seq_length, embed_dim)
+        pe[0, :, 0::2] = torch.sin(position * div_term)
+        pe[0, :, 1::2] = torch.cos(position * div_term)
+        self.register_buffer("pe", pe)
+    def forward(self, x):
+        _, seq_length, _ = x.shape
+        x = x + self.pe[:, :seq_length]
+        return x

ltx_video/models/transformers/symmetric_patchifier.py ADDED Viewed

	@@ -0,0 +1,84 @@

+from abc import ABC, abstractmethod
+from typing import Tuple
+import torch
+from diffusers.configuration_utils import ConfigMixin
+from einops import rearrange
+from torch import Tensor
+class Patchifier(ConfigMixin, ABC):
+    def __init__(self, patch_size: int):
+        super().__init__()
+        self._patch_size = (1, patch_size, patch_size)
+    @abstractmethod
+    def patchify(self, latents: Tensor) -> Tuple[Tensor, Tensor]:
+        raise NotImplementedError("Patchify method not implemented")
+    @abstractmethod
+    def unpatchify(
+        self,
+        latents: Tensor,
+        output_height: int,
+        output_width: int,
+        out_channels: int,
+    ) -> Tuple[Tensor, Tensor]:
+        pass
+    @property
+    def patch_size(self):
+        return self._patch_size
+    def get_latent_coords(
+        self, latent_num_frames, latent_height, latent_width, batch_size, device
+    ):
+        """
+        Return a tensor of shape [batch_size, 3, num_patches] containing the
+            top-left corner latent coordinates of each latent patch.
+        The tensor is repeated for each batch element.
+        """
+        latent_sample_coords = torch.meshgrid(
+            torch.arange(0, latent_num_frames, self._patch_size[0], device=device),
+            torch.arange(0, latent_height, self._patch_size[1], device=device),
+            torch.arange(0, latent_width, self._patch_size[2], device=device),
+        )
+        latent_sample_coords = torch.stack(latent_sample_coords, dim=0)
+        latent_coords = latent_sample_coords.unsqueeze(0).repeat(batch_size, 1, 1, 1, 1)
+        latent_coords = rearrange(
+            latent_coords, "b c f h w -> b c (f h w)", b=batch_size
+        )
+        return latent_coords
+class SymmetricPatchifier(Patchifier):
+    def patchify(self, latents: Tensor) -> Tuple[Tensor, Tensor]:
+        b, _, f, h, w = latents.shape
+        latent_coords = self.get_latent_coords(f, h, w, b, latents.device)
+        latents = rearrange(
+            latents,
+            "b c (f p1) (h p2) (w p3) -> b (f h w) (c p1 p2 p3)",
+            p1=self._patch_size[0],
+            p2=self._patch_size[1],
+            p3=self._patch_size[2],
+        )
+        return latents, latent_coords
+    def unpatchify(
+        self,
+        latents: Tensor,
+        output_height: int,
+        output_width: int,
+        out_channels: int,
+    ) -> Tuple[Tensor, Tensor]:
+        output_height = output_height // self._patch_size[1]
+        output_width = output_width // self._patch_size[2]
+        latents = rearrange(
+            latents,
+            "b (f h w) (c p q) -> b c f (h p) (w q)",
+            h=output_height,
+            w=output_width,
+            p=self._patch_size[1],
+            q=self._patch_size[2],
+        )
+        return latents

ltx_video/models/transformers/transformer3d.py ADDED Viewed

	@@ -0,0 +1,507 @@

+# Adapted from: https://github.com/huggingface/diffusers/blob/v0.26.3/src/diffusers/models/transformers/transformer_2d.py
+import math
+from dataclasses import dataclass
+from typing import Any, Dict, List, Optional, Union
+import os
+import json
+import glob
+from pathlib import Path
+import torch
+from diffusers.configuration_utils import ConfigMixin, register_to_config
+from diffusers.models.embeddings import PixArtAlphaTextProjection
+from diffusers.models.modeling_utils import ModelMixin
+from diffusers.models.normalization import AdaLayerNormSingle
+from diffusers.utils import BaseOutput, is_torch_version
+from diffusers.utils import logging
+from torch import nn
+from safetensors import safe_open
+from ltx_video.models.transformers.attention import BasicTransformerBlock
+from ltx_video.utils.skip_layer_strategy import SkipLayerStrategy
+from ltx_video.utils.diffusers_config_mapping import (
+    diffusers_and_ours_config_mapping,
+    make_hashable_key,
+    TRANSFORMER_KEYS_RENAME_DICT,
+)
+logger = logging.get_logger(__name__)
+@dataclass
+class Transformer3DModelOutput(BaseOutput):
+    """
+    The output of [`Transformer2DModel`].
+    Args:
+        sample (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)` or `(batch size, num_vector_embeds - 1, num_latent_pixels)` if [`Transformer2DModel`] is discrete):
+            The hidden states output conditioned on the `encoder_hidden_states` input. If discrete, returns probability
+            distributions for the unnoised latent pixels.
+    """
+    sample: torch.FloatTensor
+class Transformer3DModel(ModelMixin, ConfigMixin):
+    _supports_gradient_checkpointing = True
+    @register_to_config
+    def __init__(
+        self,
+        num_attention_heads: int = 16,
+        attention_head_dim: int = 88,
+        in_channels: Optional[int] = None,
+        out_channels: Optional[int] = None,
+        num_layers: int = 1,
+        dropout: float = 0.0,
+        norm_num_groups: int = 32,
+        cross_attention_dim: Optional[int] = None,
+        attention_bias: bool = False,
+        num_vector_embeds: Optional[int] = None,
+        activation_fn: str = "geglu",
+        num_embeds_ada_norm: Optional[int] = None,
+        use_linear_projection: bool = False,
+        only_cross_attention: bool = False,
+        double_self_attention: bool = False,
+        upcast_attention: bool = False,
+        adaptive_norm: str = "single_scale_shift",  # 'single_scale_shift' or 'single_scale'
+        standardization_norm: str = "layer_norm",  # 'layer_norm' or 'rms_norm'
+        norm_elementwise_affine: bool = True,
+        norm_eps: float = 1e-5,
+        attention_type: str = "default",
+        caption_channels: int = None,
+        use_tpu_flash_attention: bool = False,  # if True uses the TPU attention offload ('flash attention')
+        qk_norm: Optional[str] = None,
+        positional_embedding_type: str = "rope",
+        positional_embedding_theta: Optional[float] = None,
+        positional_embedding_max_pos: Optional[List[int]] = None,
+        timestep_scale_multiplier: Optional[float] = None,
+        causal_temporal_positioning: bool = False,  # For backward compatibility, will be deprecated
+    ):
+        super().__init__()
+        self.use_tpu_flash_attention = (
+            use_tpu_flash_attention  # FIXME: push config down to the attention modules
+        )
+        self.use_linear_projection = use_linear_projection
+        self.num_attention_heads = num_attention_heads
+        self.attention_head_dim = attention_head_dim
+        inner_dim = num_attention_heads * attention_head_dim
+        self.inner_dim = inner_dim
+        self.patchify_proj = nn.Linear(in_channels, inner_dim, bias=True)
+        self.positional_embedding_type = positional_embedding_type
+        self.positional_embedding_theta = positional_embedding_theta
+        self.positional_embedding_max_pos = positional_embedding_max_pos
+        self.use_rope = self.positional_embedding_type == "rope"
+        self.timestep_scale_multiplier = timestep_scale_multiplier
+        if self.positional_embedding_type == "absolute":
+            raise ValueError("Absolute positional embedding is no longer supported")
+        elif self.positional_embedding_type == "rope":
+            if positional_embedding_theta is None:
+                raise ValueError(
+                    "If `positional_embedding_type` type is rope, `positional_embedding_theta` must also be defined"
+                )
+            if positional_embedding_max_pos is None:
+                raise ValueError(
+                    "If `positional_embedding_type` type is rope, `positional_embedding_max_pos` must also be defined"
+                )
+        # 3. Define transformers blocks
+        self.transformer_blocks = nn.ModuleList(
+            [
+                BasicTransformerBlock(
+                    inner_dim,
+                    num_attention_heads,
+                    attention_head_dim,
+                    dropout=dropout,
+                    cross_attention_dim=cross_attention_dim,
+                    activation_fn=activation_fn,
+                    num_embeds_ada_norm=num_embeds_ada_norm,
+                    attention_bias=attention_bias,
+                    only_cross_attention=only_cross_attention,
+                    double_self_attention=double_self_attention,
+                    upcast_attention=upcast_attention,
+                    adaptive_norm=adaptive_norm,
+                    standardization_norm=standardization_norm,
+                    norm_elementwise_affine=norm_elementwise_affine,
+                    norm_eps=norm_eps,
+                    attention_type=attention_type,
+                    use_tpu_flash_attention=use_tpu_flash_attention,
+                    qk_norm=qk_norm,
+                    use_rope=self.use_rope,
+                )
+                for d in range(num_layers)
+            ]
+        )
+        # 4. Define output layers
+        self.out_channels = in_channels if out_channels is None else out_channels
+        self.norm_out = nn.LayerNorm(inner_dim, elementwise_affine=False, eps=1e-6)
+        self.scale_shift_table = nn.Parameter(
+            torch.randn(2, inner_dim) / inner_dim**0.5
+        )
+        self.proj_out = nn.Linear(inner_dim, self.out_channels)
+        self.adaln_single = AdaLayerNormSingle(
+            inner_dim, use_additional_conditions=False
+        )
+        if adaptive_norm == "single_scale":
+            self.adaln_single.linear = nn.Linear(inner_dim, 4 * inner_dim, bias=True)
+        self.caption_projection = None
+        if caption_channels is not None:
+            self.caption_projection = PixArtAlphaTextProjection(
+                in_features=caption_channels, hidden_size=inner_dim
+            )
+        self.gradient_checkpointing = False
+    def set_use_tpu_flash_attention(self):
+        r"""
+        Function sets the flag in this object and propagates down the children. The flag will enforce the usage of TPU
+        attention kernel.
+        """
+        logger.info("ENABLE TPU FLASH ATTENTION -> TRUE")
+        self.use_tpu_flash_attention = True
+        # push config down to the attention modules
+        for block in self.transformer_blocks:
+            block.set_use_tpu_flash_attention()
+    def create_skip_layer_mask(
+        self,
+        batch_size: int,
+        num_conds: int,
+        ptb_index: int,
+        skip_block_list: Optional[List[int]] = None,
+    ):
+        if skip_block_list is None or len(skip_block_list) == 0:
+            return None
+        num_layers = len(self.transformer_blocks)
+        mask = torch.ones(
+            (num_layers, batch_size * num_conds), device=self.device, dtype=self.dtype
+        )
+        for block_idx in skip_block_list:
+            mask[block_idx, ptb_index::num_conds] = 0
+        return mask
+    def _set_gradient_checkpointing(self, module, value=False):
+        if hasattr(module, "gradient_checkpointing"):
+            module.gradient_checkpointing = value
+    def get_fractional_positions(self, indices_grid):
+        fractional_positions = torch.stack(
+            [
+                indices_grid[:, i] / self.positional_embedding_max_pos[i]
+                for i in range(3)
+            ],
+            dim=-1,
+        )
+        return fractional_positions
+    def precompute_freqs_cis(self, indices_grid, spacing="exp"):
+        dtype = torch.float32  # We need full precision in the freqs_cis computation.
+        dim = self.inner_dim
+        theta = self.positional_embedding_theta
+        fractional_positions = self.get_fractional_positions(indices_grid)
+        start = 1
+        end = theta
+        device = fractional_positions.device
+        if spacing == "exp":
+            indices = theta ** (
+                torch.linspace(
+                    math.log(start, theta),
+                    math.log(end, theta),
+                    dim // 6,
+                    device=device,
+                    dtype=dtype,
+                )
+            )
+            indices = indices.to(dtype=dtype)
+        elif spacing == "exp_2":
+            indices = 1.0 / theta ** (torch.arange(0, dim, 6, device=device) / dim)
+            indices = indices.to(dtype=dtype)
+        elif spacing == "linear":
+            indices = torch.linspace(start, end, dim // 6, device=device, dtype=dtype)
+        elif spacing == "sqrt":
+            indices = torch.linspace(
+                start**2, end**2, dim // 6, device=device, dtype=dtype
+            ).sqrt()
+        indices = indices * math.pi / 2
+        if spacing == "exp_2":
+            freqs = (
+                (indices * fractional_positions.unsqueeze(-1))
+                .transpose(-1, -2)
+                .flatten(2)
+            )
+        else:
+            freqs = (
+                (indices * (fractional_positions.unsqueeze(-1) * 2 - 1))
+                .transpose(-1, -2)
+                .flatten(2)
+            )
+        cos_freq = freqs.cos().repeat_interleave(2, dim=-1)
+        sin_freq = freqs.sin().repeat_interleave(2, dim=-1)
+        if dim % 6 != 0:
+            cos_padding = torch.ones_like(cos_freq[:, :, : dim % 6])
+            sin_padding = torch.zeros_like(cos_freq[:, :, : dim % 6])
+            cos_freq = torch.cat([cos_padding, cos_freq], dim=-1)
+            sin_freq = torch.cat([sin_padding, sin_freq], dim=-1)
+        return cos_freq.to(self.dtype), sin_freq.to(self.dtype)
+    def load_state_dict(
+        self,
+        state_dict: Dict,
+        *args,
+        **kwargs,
+    ):
+        if any([key.startswith("model.diffusion_model.") for key in state_dict.keys()]):
+            state_dict = {
+                key.replace("model.diffusion_model.", ""): value
+                for key, value in state_dict.items()
+                if key.startswith("model.diffusion_model.")
+            }
+        super().load_state_dict(state_dict, *args, **kwargs)
+    @classmethod
+    def from_pretrained(
+        cls,
+        pretrained_model_path: Optional[Union[str, os.PathLike]],
+        *args,
+        **kwargs,
+    ):
+        pretrained_model_path = Path(pretrained_model_path)
+        if pretrained_model_path.is_dir():
+            config_path = pretrained_model_path / "transformer" / "config.json"
+            with open(config_path, "r") as f:
+                config = make_hashable_key(json.load(f))
+            assert config in diffusers_and_ours_config_mapping, (
+                "Provided diffusers checkpoint config for transformer is not suppported. "
+                "We only support diffusers configs found in Lightricks/LTX-Video."
+            )
+            config = diffusers_and_ours_config_mapping[config]
+            state_dict = {}
+            ckpt_paths = (
+                pretrained_model_path
+                / "transformer"
+                / "diffusion_pytorch_model*.safetensors"
+            )
+            dict_list = glob.glob(str(ckpt_paths))
+            for dict_path in dict_list:
+                part_dict = {}
+                with safe_open(dict_path, framework="pt", device="cpu") as f:
+                    for k in f.keys():
+                        part_dict[k] = f.get_tensor(k)
+                state_dict.update(part_dict)
+            for key in list(state_dict.keys()):
+                new_key = key
+                for replace_key, rename_key in TRANSFORMER_KEYS_RENAME_DICT.items():
+                    new_key = new_key.replace(replace_key, rename_key)
+                state_dict[new_key] = state_dict.pop(key)
+            with torch.device("meta"):
+                transformer = cls.from_config(config)
+            transformer.load_state_dict(state_dict, assign=True, strict=True)
+        elif pretrained_model_path.is_file() and str(pretrained_model_path).endswith(
+            ".safetensors"
+        ):
+            comfy_single_file_state_dict = {}
+            with safe_open(pretrained_model_path, framework="pt", device="cpu") as f:
+                metadata = f.metadata()
+                for k in f.keys():
+                    comfy_single_file_state_dict[k] = f.get_tensor(k)
+            configs = json.loads(metadata["config"])
+            transformer_config = configs["transformer"]
+            with torch.device("meta"):
+                transformer = Transformer3DModel.from_config(transformer_config)
+            transformer.load_state_dict(comfy_single_file_state_dict, assign=True)
+        return transformer
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        indices_grid: torch.Tensor,
+        encoder_hidden_states: Optional[torch.Tensor] = None,
+        timestep: Optional[torch.LongTensor] = None,
+        class_labels: Optional[torch.LongTensor] = None,
+        cross_attention_kwargs: Dict[str, Any] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        encoder_attention_mask: Optional[torch.Tensor] = None,
+        skip_layer_mask: Optional[torch.Tensor] = None,
+        skip_layer_strategy: Optional[SkipLayerStrategy] = None,
+        return_dict: bool = True,
+    ):
+        """
+        The [`Transformer2DModel`] forward method.
+        Args:
+            hidden_states (`torch.LongTensor` of shape `(batch size, num latent pixels)` if discrete, `torch.FloatTensor` of shape `(batch size, channel, height, width)` if continuous):
+                Input `hidden_states`.
+            indices_grid (`torch.LongTensor` of shape `(batch size, 3, num latent pixels)`):
+            encoder_hidden_states ( `torch.FloatTensor` of shape `(batch size, sequence len, embed dims)`, *optional*):
+                Conditional embeddings for cross attention layer. If not given, cross-attention defaults to
+                self-attention.
+            timestep ( `torch.LongTensor`, *optional*):
+                Used to indicate denoising step. Optional timestep to be applied as an embedding in `AdaLayerNorm`.
+            class_labels ( `torch.LongTensor` of shape `(batch size, num classes)`, *optional*):
+                Used to indicate class labels conditioning. Optional class labels to be applied as an embedding in
+                `AdaLayerZeroNorm`.
+            cross_attention_kwargs ( `Dict[str, Any]`, *optional*):
+                A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
+                `self.processor` in
+                [diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
+            attention_mask ( `torch.Tensor`, *optional*):
+                An attention mask of shape `(batch, key_tokens)` is applied to `encoder_hidden_states`. If `1` the mask
+                is kept, otherwise if `0` it is discarded. Mask will be converted into a bias, which adds large
+                negative values to the attention scores corresponding to "discard" tokens.
+            encoder_attention_mask ( `torch.Tensor`, *optional*):
+                Cross-attention mask applied to `encoder_hidden_states`. Two formats supported:
+                    * Mask `(batch, sequence_length)` True = keep, False = discard.
+                    * Bias `(batch, 1, sequence_length)` 0 = keep, -10000 = discard.
+                If `ndim == 2`: will be interpreted as a mask, then converted into a bias consistent with the format
+                above. This bias will be added to the cross-attention scores.
+            skip_layer_mask ( `torch.Tensor`, *optional*):
+                A mask of shape `(num_layers, batch)` that indicates which layers to skip. `0` at position
+                `layer, batch_idx` indicates that the layer should be skipped for the corresponding batch index.
+            skip_layer_strategy ( `SkipLayerStrategy`, *optional*, defaults to `None`):
+                Controls which layers are skipped when calculating a perturbed latent for spatiotemporal guidance.
+            return_dict (`bool`, *optional*, defaults to `True`):
+                Whether or not to return a [`~models.unets.unet_2d_condition.UNet2DConditionOutput`] instead of a plain
+                tuple.
+        Returns:
+            If `return_dict` is True, an [`~models.transformer_2d.Transformer2DModelOutput`] is returned, otherwise a
+            `tuple` where the first element is the sample tensor.
+        """
+        # for tpu attention offload 2d token masks are used. No need to transform.
+        if not self.use_tpu_flash_attention:
+            # ensure attention_mask is a bias, and give it a singleton query_tokens dimension.
+            #   we may have done this conversion already, e.g. if we came here via UNet2DConditionModel#forward.
+            #   we can tell by counting dims; if ndim == 2: it's a mask rather than a bias.
+            # expects mask of shape:
+            #   [batch, key_tokens]
+            # adds singleton query_tokens dimension:
+            #   [batch,                    1, key_tokens]
+            # this helps to broadcast it as a bias over attention scores, which will be in one of the following shapes:
+            #   [batch,  heads, query_tokens, key_tokens] (e.g. torch sdp attn)
+            #   [batch * heads, query_tokens, key_tokens] (e.g. xformers or classic attn)
+            if attention_mask is not None and attention_mask.ndim == 2:
+                # assume that mask is expressed as:
+                #   (1 = keep,      0 = discard)
+                # convert mask into a bias that can be added to attention scores:
+                #       (keep = +0,     discard = -10000.0)
+                attention_mask = (1 - attention_mask.to(hidden_states.dtype)) * -10000.0
+                attention_mask = attention_mask.unsqueeze(1)
+            # convert encoder_attention_mask to a bias the same way we do for attention_mask
+            if encoder_attention_mask is not None and encoder_attention_mask.ndim == 2:
+                encoder_attention_mask = (
+                    1 - encoder_attention_mask.to(hidden_states.dtype)
+                ) * -10000.0
+                encoder_attention_mask = encoder_attention_mask.unsqueeze(1)
+        # 1. Input
+        hidden_states = self.patchify_proj(hidden_states)
+        if self.timestep_scale_multiplier:
+            timestep = self.timestep_scale_multiplier * timestep
+        freqs_cis = self.precompute_freqs_cis(indices_grid)
+        batch_size = hidden_states.shape[0]
+        timestep, embedded_timestep = self.adaln_single(
+            timestep.flatten(),
+            {"resolution": None, "aspect_ratio": None},
+            batch_size=batch_size,
+            hidden_dtype=hidden_states.dtype,
+        )
+        # Second dimension is 1 or number of tokens (if timestep_per_token)
+        timestep = timestep.view(batch_size, -1, timestep.shape[-1])
+        embedded_timestep = embedded_timestep.view(
+            batch_size, -1, embedded_timestep.shape[-1]
+        )
+        # 2. Blocks
+        if self.caption_projection is not None:
+            batch_size = hidden_states.shape[0]
+            encoder_hidden_states = self.caption_projection(encoder_hidden_states)
+            encoder_hidden_states = encoder_hidden_states.view(
+                batch_size, -1, hidden_states.shape[-1]
+            )
+        for block_idx, block in enumerate(self.transformer_blocks):
+            if self.training and self.gradient_checkpointing:
+                def create_custom_forward(module, return_dict=None):
+                    def custom_forward(*inputs):
+                        if return_dict is not None:
+                            return module(*inputs, return_dict=return_dict)
+                        else:
+                            return module(*inputs)
+                    return custom_forward
+                ckpt_kwargs: Dict[str, Any] = (
+                    {"use_reentrant": False} if is_torch_version(">=", "1.11.0") else {}
+                )
+                hidden_states = torch.utils.checkpoint.checkpoint(
+                    create_custom_forward(block),
+                    hidden_states,
+                    freqs_cis,
+                    attention_mask,
+                    encoder_hidden_states,
+                    encoder_attention_mask,
+                    timestep,
+                    cross_attention_kwargs,
+                    class_labels,
+                    (
+                        skip_layer_mask[block_idx]
+                        if skip_layer_mask is not None
+                        else None
+                    ),
+                    skip_layer_strategy,
+                    **ckpt_kwargs,
+                )
+            else:
+                hidden_states = block(
+                    hidden_states,
+                    freqs_cis=freqs_cis,
+                    attention_mask=attention_mask,
+                    encoder_hidden_states=encoder_hidden_states,
+                    encoder_attention_mask=encoder_attention_mask,
+                    timestep=timestep,
+                    cross_attention_kwargs=cross_attention_kwargs,
+                    class_labels=class_labels,
+                    skip_layer_mask=(
+                        skip_layer_mask[block_idx]
+                        if skip_layer_mask is not None
+                        else None
+                    ),
+                    skip_layer_strategy=skip_layer_strategy,
+                )
+        # 3. Output
+        scale_shift_values = (
+            self.scale_shift_table[None, None] + embedded_timestep[:, :, None]
+        )
+        shift, scale = scale_shift_values[:, :, 0], scale_shift_values[:, :, 1]
+        hidden_states = self.norm_out(hidden_states)
+        # Modulation
+        hidden_states = hidden_states * (1 + scale) + shift
+        hidden_states = self.proj_out(hidden_states)
+        if not return_dict:
+            return (hidden_states,)
+        return Transformer3DModelOutput(sample=hidden_states)

ltx_video/pipelines/__init__.py ADDED Viewed

File without changes

ltx_video/pipelines/crf_compressor.py ADDED Viewed

	@@ -0,0 +1,50 @@

+import av
+import torch
+import io
+import numpy as np
+def _encode_single_frame(output_file, image_array: np.ndarray, crf):
+    container = av.open(output_file, "w", format="mp4")
+    try:
+        stream = container.add_stream(
+            "libx264", rate=1, options={"crf": str(crf), "preset": "veryfast"}
+        )
+        stream.height = image_array.shape[0]
+        stream.width = image_array.shape[1]
+        av_frame = av.VideoFrame.from_ndarray(image_array, format="rgb24").reformat(
+            format="yuv420p"
+        )
+        container.mux(stream.encode(av_frame))
+        container.mux(stream.encode())
+    finally:
+        container.close()
+def _decode_single_frame(video_file):
+    container = av.open(video_file)
+    try:
+        stream = next(s for s in container.streams if s.type == "video")
+        frame = next(container.decode(stream))
+    finally:
+        container.close()
+    return frame.to_ndarray(format="rgb24")
+def compress(image: torch.Tensor, crf=29):
+    if crf == 0:
+        return image
+    image_array = (
+        (image[: (image.shape[0] // 2) * 2, : (image.shape[1] // 2) * 2] * 255.0)
+        .byte()
+        .cpu()
+        .numpy()
+    )
+    with io.BytesIO() as output_file:
+        _encode_single_frame(output_file, image_array, crf)
+        video_bytes = output_file.getvalue()
+    with io.BytesIO(video_bytes) as video_file:
+        image_array = _decode_single_frame(video_file)
+    tensor = torch.tensor(image_array, dtype=image.dtype, device=image.device) / 255.0
+    return tensor