FastVLM-0.5B Video Analysis and Captioning
This Colab notebook demonstrates how to use the Apple FastVLM-0.5B model from Hugging Face (apple/FastVLM-0.5B) to perform video analysis and generate captions for video frames.
The notebook covers the following steps:
- Model Loading: Loading the FastVLM-0.5B model and its processor using the Hugging Face transformerslibrary.
- Image Captioning: Testing the model on sample images.
- Video Processing: Reading a video file (specifically /content/drive/MyDrive/VLMs/vlm_warehouse.mp4in this case) and extracting frames.
- Inference on Video Frames: Running the FastVLM model on selected video frames to generate descriptions.
- Caption Overlay and Video Generation: Creating a new video file where the original video frames are displayed with the generated captions overlaid or stacked below. The captions update based on the inference performed on key frames.
Usage
You can open this notebook directly in Google Colab by clicking the "Open in Colab" badge on the repository page.
To run the video analysis section, make sure you have a video file available in your Google Drive at the path specified in the notebook (currently set to /content/drive/MyDrive/VLMs/vlm_warehouse.mp4).
Model Details
- Model ID: apple/FastVLM-0.5B
- Model Type: Vision-Language Model
- Library: Hugging Face transformers
Datasets Used
- Conceptual Captions (used for initial model testing)
- Custom video file (vlm_warehouse.mp4from Google Drive)
Example Output
Stacked video with original frames that are available with generated captions at the bottom
Acknowledgements
- The developers of the FastVLM-0.5B model.
- The Hugging Face team for the transformersandhuggingface_hublibraries.
- Google Colab for providing the environment.
Feel free to explore and adapt this notebook for your own video analysis tasks!
	Inference Providers
	NEW
	
	
	This model isn't deployed by any Inference Provider.
	๐
			
		Ask for provider support
Model tree for profitmonk/FASTVLM-0.5B-vlm-notebook
Base model
apple/FastVLM-0.5B