T O P

  • By -

pxp121kr

In my opinion, it would be more impressive if it solves a harder task, with multiple iteration, and shows how the code is changing overtime, until the task is accomplished. It's not clear that he solved it at the first try, and just made a file and opened it, or it reached the final output after multiple iterations.


nnet42

See my other comment here, but it did get it the first try on this example. I could follow up with more requests, like add a color or shape picker for the model, give me a model loader, or use a shader to give the model fur, and it'll get it correct most of the time with Claude. GPT-4o will usually get init code wrong, will see the error in debug output, then is able to correct issues and move on. All of my API requests are queued first in a database, so I can go back and see what happened. In this example that used Claude, it created the project directory with PowerShell, used the save\_file tool twice to save the html page and accompanying .js file, then used another tool to open it in Chrome and return the js console output for examination. It also used a task\_analysis tool, a send\_response tool to return messages to the user, and a complete\_task tool to end the seession. I will be sure to demo more iterative development functionality extensively in the future.


__Loot__

How is this impressive? Need to see more examples of task that are acutally usefull


arionem

Totally agree. This is almost like "hello world"


nnet42

Yeah, that is exactly what this is - a simple example with a visual that non-coders can understand, showing multiple tool usage. Another example task I use a lot for testing is "Create two C# programs that can send each other messages." and then I ask it to replace the communication protocol with something more secure or to use windows message queuing (msmq). It has access to a persistent PowerShell console and two Ubuntu servers. This program is brains for a 3d-printed astromech droid. I'm using programming tasks to flesh out cognitive abilities. I've been throwing a variety of general programming tasks at it to fix issues that the different LLMs run into. Extremely large projects, like making a full side-scroller game, can get expensive quick, and often requires stakeholder meetings or lots of human input. I've been working on getting it to figure stuff out on its own without too much direct help. If you have anything specific you'd like to see, I'm all ears.


__Loot__

How about something easy. Code a website that uses the nasa picture of day api to display the picture of the day. And code a website that displays the 10 latest tv shows from rotten tomatoes with there audience and critics scores with images. Now for the nasa one you have to get a email for the api key. I dont expect that to be automated.(But if it can wow). But just tell it hey heres my api key. Rotten tomatos can be scraped easily. Ill be impressed if it can pull it off. And just know I know it will happen one day and im rooting for it. Also can be a simple web site with simple styles.


nnet42

This is doable, I have had success with asking for multiple projects at the same time. An email tool is on my to-do list, and also GUI integration tools. I've found that the image recog models do great with debugging using screenshots. I'll figure out something so keys aren't exposed in the conversation or web page, maybe have it build a node.js endpoint and tell it the key is an ENV variable. I will try to get this done / recorded on Monday. Thanks!


__Loot__

Oh i didn't mean at the sametime 🙃 and .env is a good idew


nnet42

haha, I can also ask for documentation and to upload everything to github.


__Loot__

I wonder the cost of running it per month? And how much more at coding than gpt4o? I wonder if I should try it.


nnet42

It can get expensive, I guess depends on your workload, fortunately inference prices are going down as they release new models. On Opus 3 if I tell it to examine its own code and implement an improvement like add a tool, it would usually be a couple dollars. I also have multiple layers of stuff I have turned off due to cost that can be helpful and give it an intelligence boost, like reflecting on the convo for ethical considerations, but yeah it can add up. I forgot I left it running once and it ate through $40 in a few hours after getting stuck in a loop. If it didn't spend most of its time waiting for API request limits it would have been worse. I added some protection for that after that happened.


__Loot__

Im waiting for it to be cheap enough that its a non issue.


nnet42

See my other comments here, but I would say its tool use is impressive. It would certainly take a human longer than a couple minutes to complete this same task. I am definitely open to any specific task examples you might have that would blow your socks off and further discussion.


geepytee

Agents are all the hype right now


Longjumping-Stay7151

I see they used claude 3 opus. Have they tried claude 3.5 sonet?


nnet42

Yes, I got Sonnet 3.5 hooked up within an hour of it being released. It is much faster, and cheaper! I haven't noticed any real difference in output quality. GPT-4o is also able to do pretty well, but it has nothing on Claude. I have different task complexity levels set up that can point to different models. With the larger models you can send a lot of requests in a single prompt, but I'm also targeting 7B models so I have things split up fairly granularly. So far I have all of the Anthropic models, OpenAI models, all of the models on [Groq](https://groq.com/), or I can point it to llama.cpp where I can run my own (smaller) models. I have some failover stuff in there too, so for example if one service is down it'll switch to the next best model.


Ok_Elderberry_6727

That’s too cool. Wait for openai to steamroll this. Just saying.


Cryptizard

It's not GPT, it says claude-3-opus right in the console.


nnet42

It can use OpenAI models as well as all of the Anthropic models, or Groq, or llama.cpp. I've been using "GPT Agent" so people who have only heard of ChatGPT will know what I'm talking about.


Akimbo333

Hm


Arcturus_Labelle

Thanks for posting, but the problem with demos like this is they are always tiny, toy projects which have loads of examples in the training data. When I've used AI models to assist me with programming, they always, and I mean always, fall down once you get to a certain size of project. They can't keep more than a handful of abstractions in mind at once, nor can they seem to respect detailed specifications I prompt them with. This is a fancy wrapper on top of a dum dum not-even-junior-software-engineer programmer. GPT-5 / Claude 4 / Gemini 2 with something like Q* might change that.


nnet42

I actually have an extensive memory / RAG system I've built with rolling conversation summaries and an internal "Robot Context" that tracks project management tasks, directives, short and long-term memories, and anything else the robot would like to explicitly remember. You can give it directions that will persist across sessions. The current conversation is reflected upon using an array of unique perspectives that are fed back into context by my Context Attention Engine. One of my first directives was to say "Azule" whenever I mention the color blue. It is able to remember and follow the instruction no matter how long ago the instruction was given, or how deep we are in conversation talking about unrelated topics.


geepytee

The memory stuff you've built sounds cool, honestly would be great if you could explain that on a video or website