72 hours of vibe coding and more
Learn By Doing With Steven
Mar 20, 2025
72 hours give or take, I have been swimming in the misty ocean of agentic IDE and llms, testing how long would it take to find one island, or one destinaiton, where I can enjoy a completed program and find that there is still much to learn.
In this article you are going to see my experience of the 72 hour agentic IDE experience, plus additional 3 single model assisted coding, vibe coding tips, and some additional thoughts.
After all these tests, I got success rate 5/11. If I get the UI template but the result is not meaning full, I call it half success. In case I have both UI and completed workflow with reasonable results, I call it full success(template success marked as 0.5 project completion, full success marked as 1 project completion).
Well, the whole experience is summarized in below hand written image. I prefer some hand writing job now because the longer I have all the agents and llms doing job for me, the sooner I am gonna lose all the human skills. At least, I find that writing down ideas is always a good start for most tasks.
In short, I used TRAE and CURSOR, different models, trying to complete 5 main projects(during the iterations, if I change the project requirements, there could be additional projects which I call as side projects, with the aim of getting something run with measurable results, without waiting for comprehensive completion of code).
And, below are the project output screen shots. There are no perfect projects produced. But, there are several projects I think worth sharing, being put on github later, and being further improved.
First, LSTM simulation and metrics dashboard.
The project is intended to simulate LSTM model training and its vairations based on changes of hyperparameters. It took about 5 rounds of chat with cursor using claude 3.7 sonnet think model. It already represents most of the hyperparameter tuning choices and their comparisons. Although some details could be further improved.
Second, stock data and tecnical trading signal dashboard with price prediction.
I have done similar projects before to visualize stock data using streamlit, simulating random pick of SP500 portfolio and backward testing historical performances forllowing specific technical trading strategies, etc. Part of them were shared in my youtube channel. But his time, I do not use colab or streamlit(at least I do not arbitrarily choose the tools).
After 10 rounds of iterations, I decided to focus on data display instead of model training and prediciton, because it was hard to reach a workable check point. The tricky part is during the iterations, the model usually confuses about the sequential priority. For example, if we build algorithm and interface separately, we might get result faster. But if we ask the model to do both, most of the tests failed. Eventually I got visuals of below but implementing machine learning algorithms could be further enhancements in the future.
Third, weather forcast using multiple machine learning models and data display in dashboard.
Similar to the previous project, the aim is to display multiple time series weather data and implement machine learning algorithms to predict future data.
The general layout of the dashboard and raw data display are all well designed. But the data engineering, feature selection, and also model training part kept giving errors. So at this moment I stop at the data visualization step.
Forth, sentiment analysis with training, testing, and dashboard.
Using gemini 2.0 think experiment model to build first version, then asking cursor claude 3.7 sonnet think model and claude 3.5(due to high usage of 3.7) to test and debug the code, I got one program to evaluate sentiment using three approaches and the other to train and compare all algorithms/models. Before this, I got around 40 iterations using claude with cursor and Trae, all programs could not be run with continuation and I aborted the projects.
Failed version:
Success version(Although eval plots could be improved)
Fifth, interactive elaboration of transformer model.
Suprisingly, the zero shot completion(not interactive in html format, but with comprehensive plots to show step wise results) rigarding simulation of stepwise transformer internal mechanism seems to work well. With just one more iteration the interactive step wise plotly plots is already good enough.
And, with 5 more iterations, I got below result, with process replay of token generation. In general the quality is acceptable for further improvements.
Other tests without agentic IDE after the 72 hours:
After the above tests, I further tested single model interaction and ran the codes in colab environment.
With gemini 2.0 Flash series, I could get direct simulation of commercial banking process without fancy UI but practical procedure control, ledger, database management. I ran the code in colab.
Colab with Gemeni can conduct comprehensive data analysis tasks, with detail design of the whole procedure. There is no need for coding, you just upload the dataset and add/confirm prompts.
ChatGPT is here to stay, and I found, again, 3o High model can generate single block well structured code. After around 10 iterations, the model generates a single code block for me to run in colab, and I have a consolidated html dashboard(all generated in colab and downloaded in local environment) showing 6 algorithms on 2 split strategies(train val test) of weather data prediction(each variable as target for once).
The display includes metrics comparison, train validation test split of actual and predicted values, radar plots to show algorithm comparison on specific model metric rigarding each variable as target variable. And also, a overfitting check. Total number of documents produced reaches 265.
This version does not consider data as time series data. So a TSA version could be further extended from current settings.
Advices for vibe coders:
Mixture of models could solve some issues which is hard for current state/model.
Considering multiple independent tests to find the best model/algorithm/ui and then mark the choice as a learnt skill to be reflected in future iterations.
Like my previous tests with snake game, brick game, supermario game, 3D driving game simulations with TRAE, the intuition remains: the models are most capable of easy(but for claude 3.7, not the case some times) tasks where in open web you could find many examples. For complicated tasks, divide them and build them block by block.
In case I do not use the agentic IDE:
Gemini 2.0 series could be useful to general one single code block or formulate a basis for agentic workflow iterations.
Gemini with colab can offer data analysis service now. The initial test is impressive and could be further tested.
GPT 3o High is the most capable model to form complex model comparison tasks.
Thoughts about GenAI, AGI etc
I find an analogy amusing: using deepseek to generate any quantitative trading strategy can not win over its creator, the deepseek research team.
And my thoughts: deepseek team has been doing quantitative trading for years with machine learning and all sorts of algorithms(most assumes they have achieved high return), now with spare compute they have trained and presented deepseek models.
Somehow people try to find the quant algorithm using deepseek models and achieve over market average trading gains. But people do not realize that, all these talents with high math proficiency and working on ever evolving real world challenges will not be simulated well using any models, at least for the next years(I do not like assumptions without grounding, so I leave the number as NA). Because, current architectures of llm, agentic workflows, are just not enough to simulate many real world nenarios. Their computational essence still lies on statistical inference or reality compressed in multi dimensional vector spaces.
Simulating any trading strategy using small feature space(like we simulate some algorithms in universities) is easy but with large space(in real world, the scaled data pipeline and complexity of models are ever evolving) it would require more efficiency consideration and advanced data engineering, where llms would become less useful. LLMs rely on their training data, this is golden rule number one, and hence their coding capability is limited on the training data.
And I would further argue that, any deepseek generated results would not incur a market turbulance surpassing the impact of deepseek on INVIDIA stock this year(although INVIDIA is here to stay and market sometimes does not grasp the main idea of compute demand on inferencing side). And this could be the most significant market impact of all time rigarding model release where llms can never inmagine.
Dario from claude is high on his thoughts about coding, AGI etc, but in reality if you try any of the coding products, even the newly released computer use, manus, there are so many gaps in many use cases. Matrices, vector representations with online data can only do part of the job in the real world, and are still far from super intelligence. So from silicon valley vc to all GenAI founders, marketing pitches are not always well grounded. Users should have solid or reasonable consideration of any GenAI tools being used for their use cases.
Personally I never watch benchmark that much because, I have limited options and the general use cases of mine can be solved or assisted using current tech stack. I would suggest you to build your own model/tech stack, and find the balance between different metrics and choose the best options to each use case.
2025, the year of agents continues, but there will be more to come, and the job is not yet finished. Don't get hyped, just stay on track and better use tools.
Support my work:
https://www.paypal.com/paypalme/wangjiansuper?country.x=IT&locale.x=en_US
https://buymeacoffee.com/learnbydoing
Follow me for more tutorials, thoughts, about:
LLM, models🖥️ Exploration, experiments, learning🕹️ AI driven coding ✍️ AI regulation 📖 Data science 📊
https://linktr.ee/learnbydoingwithsteven
https://github.com/learnbydoingwithsteven/Youtube/