Mistral likely does āprompt enhancement,ā aka feeding your prompt to an LLM first and asking it to expand it with more words.
So internally, a Mistral text LLM is probably writing out āsure! Hereās a long prompt with no dog: ā¦ā and then that part is fed to the image generator.
Other āLLMsā are truly multimodal and generate image output, hence they still get the word ādogā in the input.
Mistral likely does āprompt enhancement,ā aka feeding your prompt to an LLM first and asking it to expand it with more words.
So internally, a Mistral text LLM is probably writing out āsure! Hereās a long prompt with no dog: ā¦ā and then that part is fed to the image generator.
Other āLLMsā are truly multimodal and generate image output, hence they still get the word ādogā in the input.