NVIDIA LocateAnything: An Open-Source Visual Grounding Model That Finds Objects in Images 10× Faster

NVIDIA, with Hong Kong Polytechnic University and Nanjing University, has released LocateAnything, an open-source visual grounding model: give it one image and one sentence, and it draws a box around whatever you asked for. Thanks to a new "draw the whole box at once" parallel-decoding method, it locates nearly 13 targets per second on a single top-end GPU — outpacing comparable models on both speed and accuracy — and it's free to download.
Advertisement

Today's AI can hold a conversation, write code, and generate images. Yet one task that sounds almost trivial has stayed slow and clumsy: pointing precisely to where a given object sits inside an image. In the research world, this is called visual grounding.

It looks simple, but it's exactly where robots and "AI that operates your phone or computer for you" tend to choke. A robot reaching for a cup first needs the cup's precise location in the frame; an AI placing your takeout order first has to find the "Order Now" button in a screenshot. Slow localization stalls the entire chain of action.

What LocateAnything actually does

The visual grounding model LocateAnything does precisely this: hand it an image, describe in one sentence what you're looking for, and it marks the matching spot with a bounding box.

The range of things it can recognize is broad — real-world objects, buttons and icons on a phone screen, even text printed inside a picture. One model now handles work that used to require several separate tools.

Why it's so much faster

Advertisement

The key is a technique the team calls parallel box decoding.

Older models report a box's position the way you'd read out a long ID number — emitting the four coordinates one digit at a time, slowly and with plenty of room for error. LocateAnything instead treats an entire box as something said in a single breath, producing all four coordinates at once.

It also ships with three modes: a fast mode tuned for speed, a slow mode tuned for precision, and a hybrid mode that runs fast by default and automatically falls back to slow on the hard cases.

The benchmark scorecard

To train the model, the team assembled a massive dataset: 12 million images, 138 million text descriptions, and 785 million annotated boxes.

On speed, running on the same H100 GPU, it processes 12.7 boxes per second — versus 1.1 and 5.0 for the two comparable models used as baselines. On the strictest localization-accuracy benchmark, its score beats those rivals by a wide margin as well.

More importantly, the model is already public and free for anyone to download and use. For developers building robots or phone-automation tools, the bar for getting AI to "see fast and point accurately" just dropped a long way.

Advertisement

Related