Skip to main content

Four LLM Lessons Building "AI for Local"

·1341 words·7 mins

Our cities are full of experiences that delight, inspire, and connect us, yet it’s always been technologically impossible to organize all the millions of choices. At Outgoing, our mission is to get people off their screens and out exploring the real world, and thanks to the latest crop of large language models, we’ve made several key technological advances that have allowed us to build the world’s best concierge for IRL experiences.

One of the false dichotomies of the past few months is whether AI is “all hype" or “the end of civilization.” In the months Outgoing has been building our global knowledge graph of real-world experiences, we’ve found these are amazingly powerful tools, but not in the ways many people think! Our team has busted many of the popular myths around large language models: embeddings and vector databases are math, not magic; the smartest models aren’t necessarily the best; and natural language is much worse than code for expressing complex ideas.

We know that (“AI for Local”) is the killer use-case for LLMs. But since so many AI deployments have stalled after the prototype stage, we’re sharing a few hopefully-helpful lessons we learned along our journey towards a reliable, scalable, hallucination-free system for going out in the real world.

LLMs for Language
#

When ChatGPT entered the scene, many people assumed you could “ask the model” and use the pre-trained knowledge of the foundation models to answer questions directly. This was quickly dispelled and RAG became the cause celebre: vector databases magically retrieve all the potential matches, and then a generative model can craft the perfect answer!

While this oversimplified approach can make a fine weekend prototype (or (on-stage demo)!) these models’ combination of low precision retrieval with slow and imprecise generation is far too loose for our IRL applications: At Outgoing, we can’t accept models that hallucinate proximity, showtimes, or descriptions.

Enter Outgoing’s structured knowledge graph (OGKG). From our team’s decade+ experience working on ML and NLP in cybersecurity, we know that what LLMs truly excel at is translation, be that from French to English or from HTML to a proprietary, domain specific language, and summarization.

We’ve found that even (small, fine-tuned models) have impressive translation, extraction, and summarization abilities that accurately preserve dates, locations, multimedia, and suitability. By leaning on these perceptual language-to-language mappings over last-generation’s template parsers and named-entity recognition heuristics, we’ve gotten tremendous accuracy converting thousands of varied sources into a retrieval-optimized graph structure. This observation has also offloaded much of the semantic matching work to a pre-computation task, giving us fast, geospacially-accurate responses without the slow LLM in the user loop needed in a typical GraphRAG approach.

Change is the Only Constant
#

Our team’s decades of experience in behavioral and text classification at Google, Yahoo, and others cemented an important lesson: Any parsers and classifiers will need to be revisited frequently, and the more people who touch them, the more likely they’ll end up with contradictions and (accuracy paradoxes). Unlike in spam fighting, the parsing at Outgoing is not adversarial, but given the breadth of sources we pull from, we knew we needed extreme agility in evolving the ML platform.

To maximize agility, we’ve avoided the bevy of LLM middleware frameworks, and kept clear separation between prompts and code, storing the former in standalone modules (we like (Jinja2)) with regression tests. We’ve also been clear about overloading models with unnecessary decision logic: In the zeal around “agents,” many teams seem to be using English to express conditional logic (“If the user input is foo, then apply these rules”) which introduces ambiguity, bloats context, exacerbates testing, and can much more efficiently be handled directly in code. As a rule of thumb, if any of your prompts are using the word “if” it’s probably better to use an actual if() statement

You Had One Job
#

The popular consensus is to use the biggest, “smartest” model you can afford so you don’t need to optimize your prompts, and the foundation models are all locked in a battle royale over which ones can solve the most complex tasks. At Outgoing, we’ve found four big drawbacks to this approach:

  1. Throughput: Beyond the cost, the biggest models often come with restrictive and unpredictable rate limits
  2. Domain Specificity: Leaderboards are misleading; most of our use cases couldn’t care less about the International Math Olympiad, and a model that’s trained on all human knowledge can actually be “too smart” and get stuck overthinking. We saw one trace where a “smart” model claimed that museums in Paris often have crêperies nearby, so the Louvre is a “great spot for dessert” 😂 🥞
  3. Context Distraction: Large context windows, including the tokens used for complex prompts, dramatically increase variability in results, so even if the large model can usually handle it, random errors will pop up
  4. Debuggability: Complex tasks are, by definition, much more difficult to debug and improve, and brittle when someone other than the original author needs to make tweaks

In overall systems engineering, our team is a big proponent of (Uncle Bob’s Single Responsibility Principle), i.e. the goal to “isolate your modules from the complexities of the organization” such that you don’t need to call a meeting to align around a software change. When you have one big LLM prompt full of rules, few-shot examples, counterexamples, and tangled branching logic, small changes can easily have ripple effects throughout both the codebase and the team.

We’ve found it’s far better to chunk our prompts into simple, discrete tasks such as a vision module that just decides whether an image is an interior, exterior, or poster, followed by a second discrete module that assesses the quality and aesthetics. Doing so allows a given LLM module to only care about changes coming from the Editorial team, or the Ranking team, or the Partnerships team, and vice versa.

This modularization also allows offloading these tasks to small self-hosted models, which puts bandwidth/throughput back into our control, lowers cost (and energy consumption), and reduces complexity and unintended consequences when evolving tasks.

A Necessary Eval
#

The world is gradually catching up to something we learned long ago at Google and (Impermium) for all our ML and AI jobs: The quality of an intelligent system is directly tied to the availability of reliable, balanced eval sets.

Once tasks have been broken down into smaller modules, the complexity and manageability of evals also becomes cleaner. Tons of tools have popped up recently, but we’ve found the pairing of (Langfuse) and (Braintrust) (in large part due to the focus and hands-on responsiveness of their founders, Marc Klingen Ankur Goyal), have greatly improved our ability to rapidly evolve ML tasks without surprise regressions. It’s also helped us bring more of the team into directly solving problems and building features without the chaos of vibe coding on the production platform itself — our non-engineering team members are just as productive on ML tuning as the core developers.

The biggest challenge of evals remaining for us has been how to choose the best, most representative examples, since (numerous) (studies) (have) found that example quality > quantity. Our latest research in this front includes the use of the (Genetic-Pareto prompt optimization) and similar techniques through the brilliant DSPy framework, which has been yielding exciting results.

AI for the Real World
#

The result of all these lessons has been a renewed appreciation for transformers, LLMs, and machine learning: Not as a magical, alchemical force that reduces all feature work to a vibe, but as a powerful and transformative technology for making the real world more accessible and available for all of us. At Outgoing, these tools have allowed us to tackle the longstanding goal of reliably aggregating all our cities’ ephemeral events and activities, and brings us one step closer to getting people off their screens and out into the real world.

If this kind of problem appeals — helping get the world off their screens and outside connecting to one another — please reach out to us at ([email protected]), we’d love to talk!