T O P

  • By -

visualdata

If you are just trying to understand transformers by building, I would start with Andrej Karpathy's Let's build GPT: [https://www.youtube.com/watch?v=kCc8FmEb1nY](https://www.youtube.com/watch?v=kCc8FmEb1nY)


CygnusX1

This is an incredible series, even if you don't have any plans to follow along.


antoine-ross

Can vouch for this. I believe all of Dr Andrej's tutorials are really intuitive and relatively easy to follow along. Learned a lot from watching all of his tutorials.


JRytM

!remindme 1 week


RemindMeBot

I will be messaging you in 7 days on [**2024-04-30 21:31:59 UTC**](http://www.wolframalpha.com/input/?i=2024-04-30%2021:31:59%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/LocalLLaMA/comments/18l6gyl/has_anyone_trained_their_own_llm_from_scratch/l0yaeg6/?context=3) [**CLICK THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2FLocalLLaMA%2Fcomments%2F18l6gyl%2Fhas_anyone_trained_their_own_llm_from_scratch%2Fl0yaeg6%2F%5D%0A%0ARemindMe%21%202024-04-30%2021%3A31%3A59%20UTC) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%2018l6gyl) ***** |[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)| |-|-|-|-|


[deleted]

[удалено]


RemindMeBot

I will be messaging you in 8 hours on [**2024-05-07 11:37:38 UTC**](http://www.wolframalpha.com/input/?i=2024-05-07%2011:37:38%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/LocalLLaMA/comments/18l6gyl/has_anyone_trained_their_own_llm_from_scratch/l2xkb8l/?context=3) [**CLICK THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2FLocalLLaMA%2Fcomments%2F18l6gyl%2Fhas_anyone_trained_their_own_llm_from_scratch%2Fl2xkb8l%2F%5D%0A%0ARemindMe%21%202024-05-07%2011%3A37%3A38%20UTC) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%2018l6gyl) ***** |[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)| |-|-|-|-|


Erikhm

!remindme 5days


RemindMeBot

I will be messaging you in 5 days on [**2024-06-06 08:48:14 UTC**](http://www.wolframalpha.com/input/?i=2024-06-06%2008:48:14%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/LocalLLaMA/comments/18l6gyl/has_anyone_trained_their_own_llm_from_scratch/l6lkq08/?context=3) [**CLICK THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2FLocalLLaMA%2Fcomments%2F18l6gyl%2Fhas_anyone_trained_their_own_llm_from_scratch%2Fl6lkq08%2F%5D%0A%0ARemindMe%21%202024-06-06%2008%3A48%3A14%20UTC) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%2018l6gyl) ***** |[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)| |-|-|-|-|


redditfov

Thanks!


exclaim_bot

>Thanks! You're welcome!


nocnydrwal

!RemindMe one week


zolo90

!Remind me 1 month


freddyox

!RemindMe 10 hours


[deleted]

!Remind me one week


Accomplished_Pin_626

!Remind me 5 days


tamlc

!RemindMe 2 hours


Tacx79

Around a year ago (very shortly before pygmalion-6b and c.ai were starting to be very popular) I wrote some simple gpt from scratch with 100-600m params, as usual I wrote the dataloader to not just put the stuff randomly into the model - I had \~5gb of text (not sure if compressed or after tokenizing). The model started to form somewhat logical but still very stupid short sentences after 100k-300k steps (maybe 30k-100k with other architecture) and I calculated it would take 200 years on my pc to do just 1 epoch over that 5gb of text. All the models I trained were useless but I learned a lot of useful stuff about 'text' part of ai - it was fun after all


timschwartz

Were you training with a GPU or on your CPU?


KvAk_AKPlaysYT

I'm currently in the process of doing so by watching this video, keep in mind that I'm just doing it for the experience. https://youtu.be/UU1WVnMk4E8?si=EAWK-cTAOJQe7Z6W


[deleted]

Would love to hear your experiences after you're done.


KvAk_AKPlaysYT

!RemindMe 1 month


lordosthyvel

Optimistic


[deleted]

[удалено]


RemindMeBot

I will be messaging you in 1 month on [**2024-01-18 12:11:37 UTC**](http://www.wolframalpha.com/input/?i=2024-01-18%2012:11:37%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/LocalLLaMA/comments/18l6gyl/has_anyone_trained_their_own_llm_from_scratch/kdvrkot/?context=3) [**6 OTHERS CLICKED THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2FLocalLLaMA%2Fcomments%2F18l6gyl%2Fhas_anyone_trained_their_own_llm_from_scratch%2Fkdvrkot%2F%5D%0A%0ARemindMe%21%202024-01-18%2012%3A11%3A37%20UTC) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%2018l6gyl) ***** |[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)| |-|-|-|-|


proudomarr

u/KvAk_AKPlaysYT reminder bro


[deleted]

Not LLM which is too much expensive, but I have trained a transformer which output random "florida man" meme news titles lol. I used Colab to train with PyTorch, wrote entire transformer from scratch. Since it was free version of colab, after the training, I was banned from using GPU for about a month.


[deleted]

That's pretty funny. Good ol' florida man.


Wonderful-Camp2553

"Florida man melts GPUs in Google's data center, gets banned"


[deleted]

LMFAO.


stddealer

I've trained very small (a few thousand of parameters) LMs, based on HMM, able to generate gibberish that might look like English to non English speakers, but their actual working use case is to determine if some text is made of English language or not. I did the same things for French and German.


[deleted]

That's a cool project!


[deleted]

[удалено]


[deleted]

I have 90k in Google Cloud Credits. I will give them to anyone that wants to try to train their own model.


[deleted]

They run out in February: first come first serve!


Key-Morning-4712

I hope we can make it a unified effort by this sub and train one model that's actually competitive to other 7b models. That would be cool.


[deleted]

We have a lot of brain power in this sub to do such a thing. I've got the credits if we want to collab.


Key-Morning-4712

Let's do. It would be great if you can create a new github org and a new reddit post inviting everyone in this sub. Thanks for doing this btw.


[deleted]

We have a few folks who signed up for credits here: https://join.slack.com/t/halyai/shared_invite/zt-23euqlj0i-kM68jyXT_o__cx_1DkLYpA Join #gcp channel. We will divy up the credits with whoever joins by end of day. Update: we have too many people. Join and you can be on the waitlist.


[deleted]

Another update. We made the good people who got in as board members (10 people so far) who vote on funding new projects with google credits. It's like a communist VC firm. You can pitch your ideas and projects. Higher chance of getting approved if you solve a real societal problem. I'll work with Google to get more credits for this communist endeavor. I'm not on the board so I have no say what gets funded.


Blonkist

Is this still going on? I would be curious to hop in as an observer.


[deleted]

No, this program has ended.


waxbolt

How many FLOPs is that equivalent to?


[deleted]

No idea


Caffeine_Monster

A fair bit. Smidge under 10k A100 hours. or 1/20th of a llama2 7b. Probably better off doing some ambitious finetuning rather than under training a small model from scratch.


Smallpaul

I’m curious why you would rather use your GPU time on this rather than on doing something new.


[deleted]

The research project is understanding long term memory for LLMs. https://docs.google.com/document/d/1MY-GSRDR3wt9bIBikUZLyJ1USDWVTr7zcIvDvDAhWQI/edit?usp=drivesdk


Smallpaul

There is no need at all to train an LLM from scratch to execute on that plan and I’m completely confused about why you would want to give away the 90k to someone who wants to.


[deleted]

I'm porting off google cloud so might as well let someone have fun. No skin off my back.


Smallpaul

Why wouldn’t you use the tokens to actually explore/deliver the project you linked.


[deleted]

By the time we hear back if the grant was approved the credits are gone.


Smallpaul

So the grant really has nothing to do with the tokens and you are just confusing things by referencing it when I asked you why you want to train an LLM from scratch. And we are back to the original question of why DO you want to train an LLM from scratch?


BackgroundAmoebaNine

/u/Smallpaul, is there a reason you're going so hard on OP right now? Would you rather see them executed than to share 90K cloud credits that they do not have use for and are expiring in February?


[deleted]

Sorry for the confusion. I read your comment wrong. I was just showing that we are trying to get deep understanding about context and context windows.


[deleted]

I see you meant tokens as credits, I thought you meant tokens in LLM context.


Smallpaul

Sorry. Jumping between threads and mixing up my terminology.


[deleted]

All good. I bet you don't get confused as often as I do 😂


johnkapolos

PM'd you :)


mgranin

sent a PM to you


LoadingALIAS

I’m interested. Check your DMs.


[deleted]

😬😬😬


Extraltodeus

Total cumulated A100 hours for all llama2 models was around 3 millions IIRC


sexybokononist

Training this on just one A100 would take 342 years. If they started training in 1681 they’d be finishing up this year.


Gov_CockPic

How many guys on stationary bikes would it take to produce the electricity needed for the compute of 1 hour of A100 compute training?


m18coppola

I trained a language model on a single copy of the king james bible. it's hilariously incoherent but surprisingly structured.


Dyonizius

interesting!! some historians believe the bible was written by psyop agents


the_ham_man4

Historians = some uneducated Reddit users who believe anything on YouTube 


Evening_Ad6637

This is my experience from June this year with llama.cpp -> train-from-scratch: https://www.reddit.com/r/LocalLLaMA/comments/14dstqm/tutorial_train_your_own_llamacpp_miniggmlmodel/


Fun_Tangerine_1086

Yes, working on -2k and -4k context versions of gpt2-medium and gpt2-large(ish) sized models. - With care, you can actually do useful work on a 12 GB GPU (RTX 3060 12gb here) - Using 4- and 8-bit optimizers; and other non-AdamW optimizers, batch_size=1; (you can actually do gpt2-medium w/ 4k context, w/ 4-bit AdamW on a 12GB GPU... with 262 mb of VRAM to spare) - Datasets -- Using subsets (10%-30%) of SlimPajama and some openwebtext; also longer-context material (radio transcripts, books, transcripts of old PDF reports). Switching subsets of SlimPajama occasionally does seem to work! - Pretty easy for training to get "stuck" or loss explode; have frequent checkpoints; be willing to stop and resume training w/ different learning rates. (In ye olde ML days, cyclic learning rates were in vogue; practically, I've been doing that w/ the stops/starts w/ different rates) - Check you work occasionally vs some of the open evaluations (ex: hellaswag). Can save a lot of effort, when say you mistokenize both training and validation datasets... also sample some output from you models regularly!


Gov_CockPic

What's your power utility bill been like since you started?


SlowSmarts

I trained a small gpt2 model about a year ago and it was just gibberish. I then started training a model from llama.cpp when I first saw it was possible about half a year ago. This has been more successful, and it has learned to stop itself recently. The llama model takes ~750GB of ram to train. I've been training it on and off, whenever I have CPU time not being used up by other projects. I've tried various methods of CPU clustering but nothing so far has performed well enough to persist with. I've also tried other training acceleration methods like CuBLAS, but my K80 GPUs are now old enough that it becomes a python library nightmare to get working and not crash. So, the llama model has been mostly trained on an average of 80 CPU threads, using most of the 768GB system ram, for about 3 months combined. ..... And it just now learned to stop itself, occasionally.


masc98

I've trained a good old GPT2 model on some whatsapp conversations, simple dumb project that I honestly suggest to you as well. It's simple to collect data and you'll make good laughs, guaranteed. Jokes aside, the important things you soon realise, is that CLM pretraining is SO important if you need good zero shot performance and common world knowledge in your model. If your model is meant for a narrower context, I'd suggest a lightweight pretraining with domain knowledge and then finetune on instructions. Lately I've used [xLLM](https://github.com/BobaZooba/xllm) library, pretty neat experience.


Imaginary_Bench_7294

Unfortunately, this requires a lot of time and effort. You need to create a dataset in the format you want the model to work with. If you want a good dataset, this entails curating it, reading through each entry for spelling or grammatical errors. That in itself takes a lot of work. If you use datasets that have been provided free of charge, you should still check the data for accuracy and appropriate content. Then comes the compute expense. Lora training is based on already trained models, so I don't know exactly how it compares in some aspects. However, for proper training from scratch, you need to use the full sized models, which is hardware prohibitive depending on the size of the model. Of course, while small models are convenient for testing and lower hardware requirements, larger models are better able to be generalized since they can develop more intricate relationships between words and concepts. There is also a fine balance between overfitting the model and the desired results. Overfit the data, and you're likely to have it spit out exact copies of the input texts. Under train the model, and it might string together unrelated things. One of the easier, but costlier ways to do this is by increasing the epochs, or how many times the data is fed in, while decreasing the amount it alters the relationships per epoch, aka the learning rate. Making the model learn slower and thus allowing more checkpoints to be saved, let's you select the point at which the training has reached optimal status for your needs. That also means that to reach the final epoch, you're looking at much more compute time required. Then you've got batch sizes, input string lengths, noise injection, etc, etc. Finding the right balance for what you want the model to do is not a simple matter. That's one of the major reasons most of the models are based on pretrained Llama. The fine tuning of a model can be done relatively quickly in comparison to the initial base model training, as you're only adjusting the internal relationships, not creating them. For the most part.


[deleted]

Can you use AI to do that work?


Imaginary_Bench_7294

For some things, sure. Such as curating the datasets, you could probably use AI for that. Spell check and grammar check systems could handle making sure the text isn't full of mistakes, and AI could determine if it is applicable to what you want the data to contain. The issue would come mostly from fact-checking the data if it is not roleplay content. Edit: hit post to early. The parts that would require human touch, such as determining if your model has reached the desired level of training, would be iffy. You can have some metrics such as loss, cross entropy, or other stats that tell you how close the model produces text VS the training data, but that is a loose representation. For coding or mathematics, that works pretty well. For creativity, not so much, as a higher loss means it is less likely to reproduce the input data, and therefore be more creative.


[deleted]

I've read papers saying most models are actually under trained.


Imaginary_Bench_7294

I'd have to read the papers you're referencing to really discuss them, however it depends on the goal of the model. Task specific models, such as coding or math centric models, might not be. Generalist models, such as for chatting, RP, etc, probably not so much a concern. Overtraining on wildly varying data such as chat logs will be detrimental to the creativity and also increase the potential of it spitting out exact copies of the training data. In fact, this can even happen when the model isn't over trained on the data. [https://www.theregister.com/2023/12/01/chatgpt\_poetry\_ai/](https://www.theregister.com/2023/12/01/chatgpt_poetry_ai/)


CKtalon

Yes, even at 1.5T tokens, a 7B LLM wouldn't have reached convergence. (Chinchilla (20x parameters) is not to be used as a rule of thumb for 'sufficient training'). Not sure how you are going to train from scratch though. Even a 1-2 B model will require thousands of dollars.


[deleted]

I have 90k in Google Cloud credits that expire in February. Need to use them. Happy to have others help me use them up (no crypto mining because that is against TOS).


artificial_simpleton

No one can possibly read through the entire dataset used for pretraining of a large language model, partially because it would take much longer than a human lifetime to do so. You need to curate the data you are using, but you don't do it manually, and knowing what heuristics to use is, of course, critical (some basic ones can be found at eg red pajama repo). Overfitting is also largely not a problem for LLM pretraining, simply because you usually have a lot more data than what your compute budget is. Also injecting noise for LLM pretraining is something no one does these days.


a_beautiful_rhind

Wasn't someone trying to reproduce phi here?


[deleted]

I'm interested to know if home grown LLMs also suffer from context loss on long prompts.


[deleted]

I'm working with UCSB on a research project and would love to interview anyone who has experience in this.


[deleted]

[удалено]


[deleted]

Why'd you drop out?


[deleted]

[удалено]


[deleted]

Sounds like you at least had a good time in IV 😁


[deleted]

I went there for 10 years. I was the Van Wilder of UCSB. They couldn't get rid of me.


MindOrbits

Check out Santa Barbara Hacker Space. I have a feeling a few members have been working with AI.


[deleted]

Is Steve still with them? Love that guy.


MindOrbits

I escaped CA a while a go so haven't been in person for some time, even when I was there often who you'd see really depended on day and time. They had a slack channel, that's probably the best way to find out.


a_beautiful_rhind

I'm assuming they do. Nobody can train anything substantial though because $$$$.


Sartilas

Hard


LoathsomeNeanderthal

Old article but you get the idea: https://www.mosaicml.com/blog/gpt-3-quality-for-500k


fab_space

I did it from scratch with the goal to make it able to produce valid words just by generating letter after letter.. giving it a score at each generation and use that feedback to adjust weights. In the other terminal the generator show me real time results generating a bunch of text (up to 256chars, space and punctuation included). Doing this will make u aware about how hard is to achieve a general LM based on words instead of a use specific one based on chars. I’ll try to serve this as web app then the reinforcement will be made by multiple users increasing the overall generation results faster than just me but i’m sure it will be hacked by lamers very soon.


randomqhacker

Not an LLM, but I used bi-grams and tri-grams from a large corpus of the Internet, ranked by frequency, to generate likely next words. I also added some variance (think temperature) to make it not always pick the most likely words. It was fun to watch it babble in a way that almost worked grammatically, but otherwise it was pretty useless, unless you want to reinvent next word prediction for a virtual keyboard or something. [https://en.wikipedia.org/wiki/Word\_n-gram\_language\_model](https://en.wikipedia.org/wiki/Word_n-gram_language_model)


richhoods

I think people forget what the B stands for with these llms. Training these models even on cloud machines are many times more expensive than what most people can afford


Internet--Traveller

It's quite technical, you need to create your own datasets in json to train it. I watched a video of it, and decided not to try it.


chibop1

Unless you're training a really tiny model like GPT1 with 117M, no individual can train from scratch. Most people mean finetuning. For full parameter finetuning, you can get it done with 8x a100 80gb in about 30 hours depending on the size of dataset. As far as training from scratch: According to [this](https://the-decoder.com/gpt-4-architecture-datasets-costs-and-more-leaked/), the training costs for GPT-4 was around $63 million. For [Llama-2](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md), here are hours spent/gpu. * 7B: 184320 * 13B: 368640 * 70B: 1720320 * Total: 3311616 If you were to rent a100 80gb at $1.6/hr, that's $294,912 USD to train 7B model. This only includes GPU cost. This does not include obtaining quality dataset, extra hardware, and so on.


Revolutionalredstone

I've created a few from absolute scratch. I'm not using transformers, back prop, or even connectionism. Instead I've got a drag-net system where millions of tiny programs are generated and individually graded based on their contribution to successful prediction. (collectivism) The technique is incredibly simple and doesn't even use math (no divide or anything even as complicated as that in the program) Its also extremely fast at inference time. I've got a bunch of other ideas as well, I want to combine ideas carefully to see what's important


iamkushagra24

!remindme 2 weeks


RemindMeBot

I will be messaging you in 14 days on [**2024-05-28 06:39:40 UTC**](http://www.wolframalpha.com/input/?i=2024-05-28%2006:39:40%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/LocalLLaMA/comments/18l6gyl/has_anyone_trained_their_own_llm_from_scratch/l3yuhlj/?context=3) [**CLICK THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2FLocalLLaMA%2Fcomments%2F18l6gyl%2Fhas_anyone_trained_their_own_llm_from_scratch%2Fl3yuhlj%2F%5D%0A%0ARemindMe%21%202024-05-28%2006%3A39%3A40%20UTC) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%2018l6gyl) ***** |[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)| |-|-|-|-|


minecraft_simon

why would anyone do that?


Business-Lead2679

Oh, definitely! Here is my little side project with Mistral-7b, I trained it to respond in a more readable way haha https://preview.redd.it/cb965m0go87c1.png?width=682&format=png&auto=webp&s=d53d0cded865d2d2fbf87918a87601221faed1b9


Business-Lead2679

PS ignore the params at the top, I’m running the model in Jupyter notebook and doing ctrl c ctrl v of its responses into bettergpt.chat so I can see how the responses look like in the classic UI


[deleted]

Awesome! I tried mistral out but the results were really poor. Not sure how they got so much funding from A16Z with an LLM that barely works. This was a month ago so maybe it's better now.


Business-Lead2679

Did you use the instruct version with the correct prompt template? Or perhaps you used the base model (which of course won’t respond correctly as it’s not instruction tuned). I fine-tuned the base model on my dataset, and it works really well. I love how it breaks down the problems into small pieces so you really understand what’s it about: https://preview.redd.it/5k7x0ah0r87c1.png?width=682&format=png&auto=webp&s=b545772f2863cd3a3ba87e4c73a2a4beb10b75f2


Business-Lead2679

And I fine-tuned it in such way that it will address you by the name you set!


[deleted]

It must have been the base model since it was so bad. I need to try the one that actually works.


Mac-Wac-1

lol ya if you have money. Like a min of 500k


MetalHarmony761

!Remind me 1 week