Build a Visual Agent for GUI Navigation

About the Workshop

A hands-on, code-along workshop on one of the more technically interesting problems in applied AI right now: building systems that can navigate graphical user interfaces the way humans do — not by hard-coding button positions or using accessibility APIs, but by seeing the screen and deciding where to click based on visual understanding and language instructions.

Visual agents that operate GUIs combine computer vision, natural language understanding, and spatial reasoning. They are the foundation of a class of automation tools — browser agents, OS-level copilots, automated testers — that could fundamentally change how software is operated and tested. Building them well requires careful work on dataset quality, model evaluation, and iterative debugging.

This workshop walks through the full pipeline using FiftyOne for dataset curation and Microsoft’s GUI-Actor model for inference — a NeurIPS 2025 paper from Microsoft Research that takes a coordinate-free approach to GUI grounding.

331 attendees from 48 groups across the global Belgium AI ML & CV Meetup network. Hosted by Jimmy Guerrero.

What You Will Build

01

Dataset Creation & Management Structure, annotate, and load GUI interaction datasets using the COCO4GUI standardised format.

02

Data Exploration & Analysis Use FiftyOne’s interactive interface to visualise datasets, analyse action distributions, and understand annotation patterns.

03

Multimodal Embeddings Compute embeddings for screenshots and UI element patches to enable similarity search and retrieval across a GUI dataset.

04

Model Inference with GUI-Actor Run Microsoft’s GUI-Actor model to predict interaction points from natural language instructions — without generating screen coordinates.

05

Performance Evaluation Measure model accuracy using standard metrics and normalised click distance to assess localisation precision.

06

Failure Analysis Investigate model failures through attention maps, error pattern analysis, and systematic debugging — distinguishing attention misalignment from localisation errors.

07

Synthetic Data Generation Use FiftyOne plugins to augment training data with synthetic task descriptions and variations to improve model robustness.

Speaker

Harpreet Sahota

Hacker-in-Residence & Machine Learning Engineer — Voxel51

Harpreet Sahota is a machine learning engineer and hacker-in-residence at Voxel51, the company behind FiftyOne. His work sits at the intersection of deep learning, generative AI, and applied research — with a particular focus on RAG (Retrieval-Augmented Generation), autonomous agents, and multimodal AI systems that combine vision and language.

As hacker-in-residence, his role is hands-on: building real systems with the tools he teaches, identifying where those tools break, and developing workflows that practitioners can actually use in production. For this workshop he brings together two of the more interesting open-source pieces in visual AI right now — FiftyOne for dataset management and evaluation infrastructure, and GUI-Actor for the inference layer — and walks through the full pipeline from raw data to failure analysis.

Machine Learning Multimodal AI RAG Autonomous Agents Computer Vision Generative AI

in LinkedIn GitHub — FiftyOne

Topics

GUI Agents — AI that Operates Interfaces ↗

A GUI agent is an AI system that can interact with software interfaces autonomously — navigating menus, filling forms, clicking buttons, reading screen content — the way a human operator would. The potential applications are significant: automated software testing, OS-level automation, browser agents that can complete multi-step tasks on the web, and accessibility tools for users who cannot operate interfaces manually. The technical challenge is that interfaces are visually complex, highly variable across applications, and change with every software update. A hard-coded automation script breaks the moment the UI changes. A visual agent that understands the semantics of what it sees — “this is a submit button” rather than “this is a grey rectangle at coordinates (412, 308)” — is far more robust.

GUI-Actor — Coordinate-Free Visual Grounding ↗

Most current GUI AI models predict the target interaction point by generating screen coordinates as text tokens — outputting something like x=0.47, y=0.23. This has fundamental problems: the model’s language head is not well-suited to precise spatial reasoning, the supervision signal is ambiguous (a mis-click by 3 pixels is the same loss as one by 300), and the approach doesn’t naturally handle cases where multiple valid targets exist. GUI-Actor is a NeurIPS 2025 paper from Microsoft Research that rethinks this. The key insight: humans don’t calculate coordinates before clicking — they perceive the target element and act directly. GUI-Actor adds an attention-based “action head” to a vision-language model (Qwen2-VL / Qwen2.5-VL backbone) that attends to the relevant visual region and produces a grounding directly from that attention map. The result: GUI-Actor-7B surpasses models with 10× more parameters on the ScreenSpot-Pro benchmark, achieving 44.6 with Qwen2.5-VL. Available on Hugging Face.

Vision-Language Models (VLMs) ↗

Vision-Language Models are neural networks that jointly understand images and text. Where a language model like GPT processes only text, a VLM can take a screenshot as input alongside a natural language instruction and reason about both together. This makes them the natural architecture for GUI agents: given a screenshot and the instruction “click the login button”, a VLM can identify which element on screen corresponds to that instruction. Modern VLMs — GPT-4o, Gemini, Qwen2-VL, Claude — have strong visual understanding out of the box, but applying them to precise GUI interaction (which requires pixel-accurate click targets) requires additional training and specialised architectures like GUI-Actor’s action head.

Dataset Curation for Visual AI ↗

Building a visual AI model is not just about the model architecture — the quality of the training and evaluation data determines whether the model generalises to real-world conditions. For GUI agents this is particularly tricky: interfaces vary enormously across operating systems, applications, themes, screen resolutions, and languages. A dataset curated from one application may produce a model that fails completely on another. Good dataset curation for GUI agents involves capturing diverse screenshot-instruction-action triples, standardising annotation formats (COCO4GUI in this workshop), auditing for label errors, computing embeddings to identify redundant or underrepresented samples, and generating synthetic variations to cover edge cases. FiftyOne is the open-source toolkit used throughout this workflow — its interactive interface makes it practical to visualise, filter, and correct large visual datasets without writing bespoke tooling.

Event Details

Date: Thursday, 9 April 2026

Time: 18:00 – 19:30 CEST (09:00 – 10:30 Pacific)

Format: Hands-on workshop — bring a laptop, follow along

Location: Online — Zoom (link visible after registration)

Network: 331 attendees from 48 groups

Organiser: Belgium AI ML & CV Meetup / Voxel51

Event type: Workshop

Target audience: ML engineers, computer vision practitioners, and developers interested in AI automation and agent systems. Basic Python and familiarity with machine learning concepts recommended.

Attend

Price Free

Format Online — Zoom

Registration Required — Zoom link sent on signup

↗ Register on Meetup

Organised by

Belgium AI ML & CV Meetup

One of Belgium’s most active applied AI communities and part of a global network of 48 AI, ML, and computer vision meetup groups. The Belgium chapter covers machine learning, computer vision, and practical AI. Hosted by Jimmy Guerrero.

↗ meetup.com/belgium-ai-machine-learning

Voxel51

The company behind FiftyOne, the open-source toolkit for building high-quality datasets and evaluating computer vision and AI models (10,500+ GitHub stars). This global workshop series is part of their developer community programme.

↗ voxel51.com