AI Testing for Networks: Benchmarking Real Generative AI Experience Across Mobile and Fixed

Mobile AI Artificial Intelligence

Generative AI has moved from novelty to everyday digital behavior at exceptional speed. In a very short period, the market has shifted from text-first assistants to multimodal platforms able to interpret images, search the web, reason across documents and increasingly support real-time interactions. That shift matters far beyond the application layer. It is changing what users expect from connectivity, how traffic behaves across networks, and how operators should think about digital service assurance.

For telecom operators, this creates a new challenge. It is no longer enough to monitor network quality through traditional indicators alone, or to assume that AI performance is determined only by the model itself. From the end-user perspective, AI is now a service experience. And like any service experience, it depends on how the model, the application workflow and the network behave together.

That is why MedUX unveiled its new AI Testing capabilities at MWC Barcelona 2026: a new approach designed to benchmark leading AI platforms from the real user perspective across mobile and fixed environments.

From chatbot hype to a new traffic reality

The recent evolution of generative AI helps explain why this topic now deserves a dedicated testing framework.

The first major inflection point was the mainstream launch of conversational GenAI, which made AI interaction part of daily digital life at mass scale. The second was the rapid shift to multimodality, where users no longer only type prompts but also upload images, ask for web-grounded answers, combine text and media, and increasingly expect richer, faster, more contextual outputs. The third is now underway: AI is becoming more continuous, more embedded and more interactive across devices, applications and workflows.

This progression is important because it changes the network profile of AI usage. Traditional digital services such as video streaming are mostly downlink-dominant. Many AI interactions are different. They can be more conversational, more latency-sensitive, and in some cases more dependent on uplink capacity because users are sending prompts, images, screenshots, video snippets or contextual data to the model before receiving an answer.

That shift is already visible in broader industry forecasts. Mobile network data traffic continues to grow strongly, while 5G is carrying an increasing share of it. At the same time, AI is starting to reshape not only traffic volume, but traffic character: more bidirectional flows, more cloud interaction, more dependence on responsiveness, and more pressure on transport and interconnection.

In other words, AI is not just another app category. It is a new class of digital interaction with its own Quality of Experience dynamics.

Why operators need a clearer view of AI performance

For operators, there are two common mistakes to avoid.

The first is to look at AI as a pure cloud or model issue. A model may perform well in benchmark tables, but that does not automatically tell an operator what the end user will actually experience on a live mobile or fixed network. Real performance is shaped by the full delivery chain: access conditions, transport, routing, peering, device behavior, API workflow, CDN dynamics and service stability.

The second is to treat AI like a single use case. In reality, not all AI interactions stress the network in the same way. A visual recognition task, a web-grounded search and a text-analysis request are not equivalent. They have different latency profiles, different payload behavior and different accuracy expectations. The best-performing model for one task may not be the best for another.

That is exactly why testing must become more task-aware and more user-centric.

Operators need to understand questions such as:

Which AI platforms deliver the best end-user experience on my networks?
How do different models behave across mobile and fixed environments?
Is the issue speed, consistency, or answer quality?
Which AI tasks are more sensitive to network conditions?
Where should I optimize if I want to improve user perception of AI services?

This is no longer a theoretical discussion. As AI assistants become more embedded in search, customer care, productivity, commerce and device interfaces, the operator that cannot see AI performance clearly will struggle to understand a growing part of the digital experience it is delivering.

Introducing MedUX AI Testing

MedUX AI Testing is designed to benchmark AI platforms in a way that reflects real interaction experience.

MedUX benchmarks leading AI models across different methods including visual analysis, web search and text analysis. Tests are launched through AI platforms and evaluate key metrics such as response time, accuracy and reliability. The framework is designed to replicate real AI interaction experience and, importantly, to test against actual platforms, which means the analysis inherently includes network-side factors such as Content Delivery Networks and other delivery configurations.

This is a critical point. For operators, what matters is not only theoretical model capability, but how the service behaves in practice when it reaches the end user over a real network.

Because MedUX applies this testing approach across both mobile and fixed environments, it enables a consistent end-user-centric view of AI performance across the full connectivity landscape. That makes it especially relevant for convergent operators that need to assess experience quality across both domains with a common methodology.

AI benchmarking in practice: what the first comparative tests show

Now we include three benchmark examples that illustrate why AI testing should not be reduced to a single winner-takes-all ranking.

Vision benchmark

In the vision benchmark, the AI is asked to identify the city corresponding to an image, effectively testing the ability to interpret visual content.

The result is already instructive. OpenAI GPT 5.2 in Thinking mode achieved the lowest response time in this example benchmark. In accuracy, OpenAI GPT 5.2 and Gemini 3 Pro both reached 100% correct responses. Reliability was perfect across all tested models, with all of them successfully completing the benchmark.

The implication is clear: for image-based reasoning and visual recognition tasks, both speed and precision matter, and the benchmark shows that the strongest result is not simply the model that answers first, but the one that combines fast response with top accuracy and full reliability. In this benchmark, the presentation identifies GPT 5.2 Thinking as the best-performing option for visual content analysis.

AI Models Benchmark | Vision – comparative performance across leading AI platforms in visual analysis tasks.

Search benchmark

In the search benchmark, the models are asked to search the internet to find data about a specific event.

Here, OpenAI GPT 5.2 Instant mode delivered the lowest response time, while OpenAI GPT 5.2 and Gemini 3 Pro reached the highest accuracy with 100% correct responses. Again, reliability was perfect across all models.

This is particularly relevant for operators because web-grounded AI use cases are becoming more common in consumer and enterprise journeys. As AI assistants evolve from pure text generators into answer engines connected to live information, the perceived quality of the service depends heavily on how fast and how accurately that answer arrives. In this benchmark, the presentation highlights GPT 5.2 Thinking and Instant as the best-performing models for searching specific content on the internet.

AI Models Benchmark | Search – comparative performance for web-grounded AI requests.

Text benchmark

The text benchmark asks the AI to analyze text by performing a specific task: counting characters in a text string.

This case is especially useful because it shows why AI benchmarking should always be contextual. OpenAI GPT 5.2 Instant mode obtained the lowest response time. However, the highest accuracy was achieved by OpenAI GPT 4 Thinking and by all Gemini models, each reaching 100% correct responses. Reliability was again perfect across the board.

The takeaway is that the fastest model is not automatically the best model for every task. In text analysis, response time remains important, but accuracy becomes decisive when the task is deterministic and easy to verify. That is why the presentation identifies GPT 4 Thinking as the best model for this benchmark.

AI Models Benchmark | Text – comparative performance for text analysis tasks.

What these first results mean for telecom operators

The three benchmark examples point to a broader strategic message.

First, AI performance is task-dependent. Operators should not assume that one model, one execution mode or one benchmark result will represent the full AI experience. Vision, search and text analysis each tell a different story.

Second, end-user perception depends on more than raw model capability. The relevant question for operators is not simply which model scores highest in isolated lab conditions. It is which combination of model, delivery path and network conditions produces the best real experience for the task the user is trying to complete.

Third, AI testing should become part of service assurance and competitive intelligence. If AI is increasingly part of how users search, interact, create and consume information, then operators need visibility into how those services actually perform across their own network and versus competitors.

This opens several high-value use cases:

AI service assurance: detect where AI-related customer experience degrades in real conditions.
Benchmarking by network or ISP: compare how leading AI platforms behave across fixed and mobile providers.
Optimization prioritization: identify whether the bottleneck is latency, instability or model-specific accuracy behavior.
Product and marketing claims: support a more evidence-based view of digital experience leadership.
Convergent experience analysis: compare how the same AI service behaves on mobile versus fixed access.

For customer experience, this matters even more than with many conventional OTTs. When a user asks an AI model to interpret an image, retrieve information or analyze a piece of text, delays and low-quality answers are especially visible. AI is inherently interactive. The user is waiting for the system to think, respond and often guide the next step. That makes responsiveness and accuracy central to satisfaction.

Why the end-user-centric perspective matters now

The market is moving quickly toward a reality where AI becomes part of mainstream digital behavior, not just an innovation showcase.

As traffic grows and AI interactions become richer, the operator view needs to evolve accordingly. It is not enough to know that the network is available. It is not enough to know that peak throughput is high. And it is not enough to know that a model performs well in abstract benchmark rankings.

Operators need to know what users actually experience when they use AI on real networks.

That means measuring AI the same way MedUX has long argued other digital services should be measured: from the perspective of the end user, in real conditions, with a methodology that connects technical metrics to perceived experience.

This is where MedUX AI Testing becomes strategically important. It extends the logic of QoE benchmarking into one of the fastest-growing categories of digital interaction. And it does so in a way that is directly relevant for mobile and fixed operators alike.

Conclusion

Generative AI is becoming part of the daily experience layer of connectivity. It is creating new traffic patterns, new expectations and new service assurance needs.

For operators, this means AI can no longer be treated as a black box happening somewhere in the cloud. It must be monitored, benchmarked and understood from the real user perspective.

MedUX AI Testing responds to that need by bringing structured, end-user-centric benchmarking to leading AI platforms across mobile and fixed networks. By measuring response time, accuracy and reliability across concrete AI tasks such as vision, search and text analysis, it helps operators move from assumptions to evidence.

And that is the real value of AI testing for telecoms: not simply knowing which model is powerful in theory, but understanding which AI experience your customers are actually getting in practice.

About MedUX

At MedUX, we provide tools that help telecommunications regulators ensure that operators comply with Quality of Experience (QoE) and Quality of Service (QoS) standards for fixed, mobile, and digital services—based on real end-user data and insights. MedUX delivers a comprehensive view of the state and quality of digital services, as well as how they are perceived by end users.

MedUX offers innovative solutions for the telecommunications industry to tackle new challenges, enabling our clients to assess the quality of services provided, empower users, and meet regulatory requirements. If you'd like to learn more about our solutions, feel free to contact us at hello@medux.com.

Subscribe to our newsletter and stay up to date with our blog articles and more!

Want to know more?

Request a demo of MedUX AI Testing

Don't forget to follow us on social media and subscribe to our newsletter to receive the latest updates and news.

Explore the Latest from MedUX

See All

5G QoE Mobile

Launch of the First 5G Quality of Experience (QoE) Benchmark in Europe

You may actually want to take look at our article more recent releasing our 5G Quality of Experience Benchmarking report to discover performance and insights at European level: All 5G networks are not created equal: Unveiling the TRUE QoE of 5G in Europe

In this first comparative…

November 2, 2022

5G Mobile QoE

How will 5G improve mobile experience?

5G has been going around the Telecom industry for some time now and 2019 seems to be the year when it finally becomes a reality. 5G is an end-to-end ecosystem to enable a fully mobile and connected society, which also means a technological base for the development of IoT (Internet of Things). …

June 5, 2019

4G Europe Mobile

Unlimited Mobile Data: are we ready for it?

The European mobile market is starting to break the barrier since more and more countries are now offering Internet data plans at a very competitive price, and even unlimited data plans.

It seems unbelievable, right? Not so long ago, people were usually looking for public Wi-Fi networks,…

July 27, 2018