About the Workshop
A hands-on, code-along workshop on one of the more technically interesting problems in applied AI right now: building systems that can navigate graphical user interfaces the way humans do — not by hard-coding button positions or using accessibility APIs, but by seeing the screen and deciding where to click based on visual understanding and language instructions.
Visual agents that operate GUIs combine computer vision, natural language understanding, and spatial reasoning. They are the foundation of a class of automation tools — browser agents, OS-level copilots, automated testers — that could fundamentally change how software is operated and tested. Building them well requires careful work on dataset quality, model evaluation, and iterative debugging.
This workshop walks through the full pipeline using FiftyOne for dataset curation and Microsoft’s GUI-Actor model for inference — a NeurIPS 2025 paper from Microsoft Research that takes a coordinate-free approach to GUI grounding.
331 attendees from 48 groups across the global Belgium AI ML & CV Meetup network. Hosted by Jimmy Guerrero.
What You Will Build
Speaker
Harpreet Sahota
Hacker-in-Residence & Machine Learning Engineer — Voxel51
Harpreet Sahota is a machine learning engineer and hacker-in-residence at Voxel51, the company behind FiftyOne. His work sits at the intersection of deep learning, generative AI, and applied research — with a particular focus on RAG (Retrieval-Augmented Generation), autonomous agents, and multimodal AI systems that combine vision and language.
As hacker-in-residence, his role is hands-on: building real systems with the tools he teaches, identifying where those tools break, and developing workflows that practitioners can actually use in production. For this workshop he brings together two of the more interesting open-source pieces in visual AI right now — FiftyOne for dataset management and evaluation infrastructure, and GUI-Actor for the inference layer — and walks through the full pipeline from raw data to failure analysis.
Topics
GUI Agents — AI that Operates Interfaces ↗
A GUI agent is an AI system that can interact with software interfaces autonomously — navigating menus, filling forms, clicking buttons, reading screen content — the way a human operator would. The potential applications are significant: automated software testing, OS-level automation, browser agents that can complete multi-step tasks on the web, and accessibility tools for users who cannot operate interfaces manually. The technical challenge is that interfaces are visually complex, highly variable across applications, and change with every software update. A hard-coded automation script breaks the moment the UI changes. A visual agent that understands the semantics of what it sees — “this is a submit button” rather than “this is a grey rectangle at coordinates (412, 308)” — is far more robust.
GUI-Actor — Coordinate-Free Visual Grounding ↗
Most current GUI AI models predict the target interaction point by generating screen coordinates as text tokens — outputting something like x=0.47, y=0.23. This has fundamental problems: the model’s language head is not well-suited to precise spatial reasoning, the supervision signal is ambiguous (a mis-click by 3 pixels is the same loss as one by 300), and the approach doesn’t naturally handle cases where multiple valid targets exist. GUI-Actor is a NeurIPS 2025 paper from Microsoft Research that rethinks this. The key insight: humans don’t calculate coordinates before clicking — they perceive the target element and act directly. GUI-Actor adds an attention-based “action head” to a vision-language model (Qwen2-VL / Qwen2.5-VL backbone) that attends to the relevant visual region and produces a grounding directly from that attention map. The result: GUI-Actor-7B surpasses models with 10× more parameters on the ScreenSpot-Pro benchmark, achieving 44.6 with Qwen2.5-VL. Available on Hugging Face.
Vision-Language Models (VLMs) ↗
Vision-Language Models are neural networks that jointly understand images and text. Where a language model like GPT processes only text, a VLM can take a screenshot as input alongside a natural language instruction and reason about both together. This makes them the natural architecture for GUI agents: given a screenshot and the instruction “click the login button”, a VLM can identify which element on screen corresponds to that instruction. Modern VLMs — GPT-4o, Gemini, Qwen2-VL, Claude — have strong visual understanding out of the box, but applying them to precise GUI interaction (which requires pixel-accurate click targets) requires additional training and specialised architectures like GUI-Actor’s action head.
Dataset Curation for Visual AI ↗
Building a visual AI model is not just about the model architecture — the quality of the training and evaluation data determines whether the model generalises to real-world conditions. For GUI agents this is particularly tricky: interfaces vary enormously across operating systems, applications, themes, screen resolutions, and languages. A dataset curated from one application may produce a model that fails completely on another. Good dataset curation for GUI agents involves capturing diverse screenshot-instruction-action triples, standardising annotation formats (COCO4GUI in this workshop), auditing for label errors, computing embeddings to identify redundant or underrepresented samples, and generating synthetic variations to cover edge cases. FiftyOne is the open-source toolkit used throughout this workflow — its interactive interface makes it practical to visualise, filter, and correct large visual datasets without writing bespoke tooling.
Event Details
Date: Thursday, 9 April 2026
Time: 18:00 – 19:30 CEST (09:00 – 10:30 Pacific)
Format: Hands-on workshop — bring a laptop, follow along
Location: Online — Zoom (link visible after registration)
Network: 331 attendees from 48 groups
Organiser: Belgium AI ML & CV Meetup / Voxel51
Event type: Workshop
Target audience: ML engineers, computer vision practitioners, and developers interested in AI automation and agent systems. Basic Python and familiarity with machine learning concepts recommended.
Attend
Organised by
Belgium AI ML & CV Meetup
One of Belgium’s most active applied AI communities and part of a global network of 48 AI, ML, and computer vision meetup groups. The Belgium chapter covers machine learning, computer vision, and practical AI. Hosted by Jimmy Guerrero.
↗ meetup.com/belgium-ai-machine-learningVoxel51
The company behind FiftyOne, the open-source toolkit for building high-quality datasets and evaluating computer vision and AI models (10,500+ GitHub stars). This global workshop series is part of their developer community programme.
↗ voxel51.com