Tackling AI's Unsolved Problem: Validating LLM Output Accuracy
Update: I am working on one of the bigger unsolved problems in AI research, apparently. I was talking with a friend - an AI researcher and part-time Computer Science professor - who mentioned that one of the great unsolved problems in all of ML/AI is generating accurate outputs from LLMs. It's frontier research, he said. He's saying this in passing, while explaining a project that he's working on. In my head, I'm thinking - "Wait a sec, that's exactly what I am trying to do right now!" Step 1 to fix TalkTastic is to actually understand the codebase and map out how it all works. My idea is to compress the codebase for our macOS app by a factor of 20x (1.8M tokens → ~40k tokens) without any loss of accuracy, so then I can use Claude to reason over the whole thing and start shipping at warp speed. In order to do this, I need to create a highly-accurate intermediate codebase representation that's incredibly detailed yet compact enough to fit within Claude’s context window. Sure, AI can generate a summary of anything, but can you rely on that summary to be 100% accurate and not have missed anything? Nope. The core challenge is: When LLMs generate output, how do you validate accuracy? For me, manual validation of every claim isn't feasible. So apparently, to solve our little "how do I fix my busted codebase problem," I need to crack an unsolved problem in all of AI. Strangely exhilarating. Haven't proven it yet, but I think I'm onto something. Stay tuned.