All skills
hf_tasks.ml
auto-discovered
0 agents
Image-Text-to-Text
hf_tasks.image_text_to_text
Image-text-to-text models take in an image and text prompt and output text. These models are also called vision-language models, or VLMs. The difference from image-to-text models is that these models take an additional text input, not restricting the model to certain use cases like image captioning, and may also be trained to accept a conversation as input.