Chat with Phi 3.5 Vision

Overview

Phi-3.5-vision is Microsoft's lightweight state-of-the-art open multimodal model, capable of multi-frame image understanding, image comparison, and video summarization.

This project wraps it in a production-ready stack:

LitServe for fast, scalable inference serving
Streamlit for an interactive chat UI
Flash Attention for optimized GPU throughput

Stack

microsoft/Phi-3.5-vision-instruct via HuggingFace Transformers
LitServe inference server
Streamlit chat interface

Get Started

pip install -r requirements.txt
python server.py        # start LitServe API
streamlit run app.py   # launch UI