Wednesday, October 15, 2025

Monsters & Magicians: Testing the Illusions of Generative AI with Ben Simo (a PNSQC Live Blog)

Day 2 of the main program is underway (Day 3 includes the workshop presentations). It's crazy to see/feel how quickly this event goes by. As always, I've had a great time at this event and enjoyed my interactions with everyone. It's especially neat to realize how many people I know in this space, and when keynote speakers are literal friends, such as today with Ben Simo. 

Machines to do things beyond our physical or mental powers have existed for thousands of years. We can go back to the Antikytheria Mechanism for what may quite possibly be the world's first "artificial intelligence," depending on how you want to interpret that term. Over time, as we have come to grips with and developed an understanding of the rules, laws, and repeatability of activities, what was magical once upon a time has become commonplace in our everyday use. 


PNSQChronicles: Brief Interview of "Monsters and Magicians" with Ben Simo on YouTube

We now see Large Language Models and predictive text generation as the current amazeballs part of our reality. Many people are excited about these technologies, but at the same time, there are many risks surfacing, with reports of organizations suffering actual harm or damage because of using AI tools. We have heard of apps that have jacked up rates arbitrarily, published legal documents with no basis in law, fact, or reality, and taken models of "virtual people" who learn from interactions and the biases and inputs trained these models to be incredibly racist and hateful. 

These situations point to an interesting set of questions: how specifically can we as testers benefit from this wild new world of seemingly random query and response systems? How do we test software that produces inexplicable fuzzy outputs? At the end of the day, software deals with patterns and algorithms. We have technologies such as machine learning, clustering, and ways that data can be grouped and sorted. If we give LLM's a closed data set of information and ask it to work with just that information, it does a remarkably good job of transforming or "creating" work and assets. The key here is that we have given it a known and bounded set of information. Because it is bounded, it is working with a known set of information and can be guided specifically as to what to do with it. As we open it to the outer world and give it fewer controls or restrictions, we open the model up to having to look at vague clusters of data with potentially dubious provenience. A great example of this that I saw in practice during one of the workshops was with the idea of creating spec documents in markdown that would reside at the base of your document tree. By making the spec document the oracle of choice, and giving the instructions that the spec document was the arbiter of what the model was to do, we limit the chance of hallucinations and odd reactions considerably. Not completely, but we make it much easier to track what the model is doing.

An example I had fun with recently was when I heard through a podcast a story of the son of Hephaistos who went to Olympus during the waning days of the Greek pantheon's influence and took a remnant of the fire from Olympus (in an homage to the myth of Prometheus). However, something about this seemed "off", as in it was being presented as an ancient myth, but it clearly wasn't. Could we identify where the original story came from? Through various prompts, reviewing the text transcript of the story, and other clues, the LLM determined that, indeed, it was a modern story being told in the manner of an ancient Greek myth, and even noted that the delivery of the story in meter and timing mimicked closely the delivery of Hesiod. I was not able to determine who wrote the story or where to find it on the internet, but the details it did provide were interesting. Many of them felt fanciful, but all of them felt plausible. That's the danger with LLM output. Unless we are diligent, the very plausibility of the output could be accepted easily as though it were fact. Closer inspection found numerous areas that were not accurate (referencing older myths that I was aware of and had history with, but attributing individuals and characters that didn't belong there). People who may not have this knowledge or familiarity might accept what's being presented as fact because it flows so naturally and just "feels right and authoritative". 

In these cases, the Boolean PASS vs FAIL rationale doesn't work. It's not that the tests pass or fail, but that large elements do pass, but there are outputs that are "off". It's not a total fail, but it's also "corrupted" in a way that we cannot simply rely on the output. Additionally, we can run tests multiple times and get slightly different outputs with the exact same information being presented. In my own world of testing AI, we use a variety of benchmarks and monitors that help us determine if the models are behaving in ways that we expect them to. We have a variety of tests and models that allow us to determine things like Performance drift, comprehensive analysis, bias drift and disparity, the currency of the data, and comparing for homoscedasticity (a $10 word that means looking at the variance of the errors in a model and determining if it is constant/consistent across all of our observations). 

A neat tool that we have is the "AI Risk Repository" which helps us identify risks and domains of use of models where the various risks can be found. By looking at the areas that are potential risks, we can better be informed or consider aspects we can apply to our testing efforts.

Ultimately, one of the key takeaways from this talk is the idea that we are at peril of being beguiled by the magic surrounding us. We want to be responsible with our use of AI, and thus, we need to test and consider how best to apply what we learn and spread that knowledge amongst our colleagues. Magic is often sleight of hand, and it's important that we understand and can understand how that sleight of hand is performed.

No comments: