T O P

  • By -

martinus

I usually use the compression ratio that still fits into my GPUs RAM, but don't go higher than 6_K because then I don't notice a difference


StopwatchGod

Whatever is the max that'll fit into your GPU memory is good enough. That being said, I chose the q4 quant over q6 so that instead of having to type `ollama run phi3:14b-medium-4k-instruct-q6_K` every time I want to run the model, I just run `ollama run phi3:medium .` And from my tests and understanding, the difference in quality is negligeable, unless you're generating code.


happybydefault

In that case, I recommend copying the bigger model: ollama pull phi3:14b-medium-4k-instruct-q6_K ollama cp phi3:14b-medium-4k-instruct-q6_K phi3 # And then, when you want to run it, you do this: ollama run phi3


uti24

Hey! By the way, I tried to run Phi3 Medium GGUF using text-generation-webui and I am getting an error when loading model. shared.tokenizer = load_model(selected_model, loader) Isn't text-generation-webui supports Phi3 Medium yet?


Flopsinator

They haven't implemented the latest version of llama cpp yet, so no.


uti24

Well good to know. Because every time new model comes out I have to wonder, is it my version of text-generation-webui, or setting, or file is broken? I had wait for a month until I run c4ai-command-r-v01 just because it's default context set to something crazy, like 100k and this was default option in text-generation-webui.


Primary-Ad2848

How can we say anything without knowing your system?


IndicationUnfair7961

12GB? If you go for 6\_K for llama 3 than you probably can't go above Q4/Q5 on phi3 medium with full offload.