Whatever is the max that'll fit into your GPU memory is good enough.
That being said, I chose the q4 quant over q6 so that instead of having to type `ollama run phi3:14b-medium-4k-instruct-q6_K` every time I want to run the model, I just run `ollama run phi3:medium .` And from my tests and understanding, the difference in quality is negligeable, unless you're generating code.
In that case, I recommend copying the bigger model:
ollama pull phi3:14b-medium-4k-instruct-q6_K
ollama cp phi3:14b-medium-4k-instruct-q6_K phi3
# And then, when you want to run it, you do this:
ollama run phi3
Hey! By the way, I tried to run Phi3 Medium GGUF using text-generation-webui and I am getting an error when loading model.
shared.tokenizer = load_model(selected_model, loader)
Isn't text-generation-webui supports Phi3 Medium yet?
Well good to know.
Because every time new model comes out I have to wonder, is it my version of text-generation-webui, or setting, or file is broken?
I had wait for a month until I run c4ai-command-r-v01 just because it's default context set to something crazy, like 100k and this was default option in text-generation-webui.
I usually use the compression ratio that still fits into my GPUs RAM, but don't go higher than 6_K because then I don't notice a difference
Whatever is the max that'll fit into your GPU memory is good enough. That being said, I chose the q4 quant over q6 so that instead of having to type `ollama run phi3:14b-medium-4k-instruct-q6_K` every time I want to run the model, I just run `ollama run phi3:medium .` And from my tests and understanding, the difference in quality is negligeable, unless you're generating code.
In that case, I recommend copying the bigger model: ollama pull phi3:14b-medium-4k-instruct-q6_K ollama cp phi3:14b-medium-4k-instruct-q6_K phi3 # And then, when you want to run it, you do this: ollama run phi3
Hey! By the way, I tried to run Phi3 Medium GGUF using text-generation-webui and I am getting an error when loading model. shared.tokenizer = load_model(selected_model, loader) Isn't text-generation-webui supports Phi3 Medium yet?
They haven't implemented the latest version of llama cpp yet, so no.
Well good to know. Because every time new model comes out I have to wonder, is it my version of text-generation-webui, or setting, or file is broken? I had wait for a month until I run c4ai-command-r-v01 just because it's default context set to something crazy, like 100k and this was default option in text-generation-webui.
How can we say anything without knowing your system?
12GB? If you go for 6\_K for llama 3 than you probably can't go above Q4/Q5 on phi3 medium with full offload.