How to use from the
Use from the
MLX library
# Make sure mlx-vlm is installed
# pip install --upgrade mlx-vlm

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

# Load the model
model, processor = load("mlx-community/LocateAnything-3B-8bit")
config = load_config("mlx-community/LocateAnything-3B-8bit")

# Prepare input
image = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
prompt = "Describe this image."

# Apply chat template
formatted_prompt = apply_chat_template(
    processor, config, prompt, num_images=1
)

# Generate output
output = generate(model, processor, formatted_prompt, image)
print(output)

mlx-community/LocateAnything-3B-8bit

MLX 8-bit (~9.4 bits/weight) conversion of nvidia/LocateAnything-3B, a vision-language model for fast, high-quality visual grounding (object detection, referring-expression grounding, pointing, GUI/text localization). Converted with mlx-vlm for Apple Silicon.

Grounding output is byte-identical to the bf16 model in our tests.

Requirements

Note: LocateAnything support in mlx-vlm currently lives in a pull request and is not yet in a released mlx-vlm. Until it merges, install from the branch that adds the locateanything model:

pip install "git+https://github.com/beshkenadze/mlx-vlm@feat/locateanything-3b"

Usage

python -m mlx_vlm.generate --model mlx-community/LocateAnything-3B-8bit \
  --image http://images.cocodataset.org/val2017/000000039769.jpg \
  --prompt "Detect all objects in the image." --max-tokens 128 --temperature 0.0

Output is structured coordinate tokens, e.g. <ref>remote</ref><box><64><152><273><244></box> with coordinates quantized to <0>..<1000> (normalized). Decoding modes: autoregressive (slow, default) and Parallel Box Decoding (fast/hybrid, ~2x faster) via generation_mode.

Attribution & license

  • Derived from nvidia/LocateAnything-3B — released under the NVIDIA License: non-commercial, research/academic use only (commercial use not permitted except by NVIDIA). Redistribution must retain this license and attribution.
  • Vision encoder: MoonViT-SO-400M (MIT). Language model: Qwen2.5-3B-Instruct (Qwen Research License). Part of the Eagle VLM family.

The LICENSE file from the source model is included in this repo.

Downloads last month
841
Safetensors
Model size
1B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mlx-community/LocateAnything-3B-8bit

Base model

Qwen/Qwen2.5-3B
Quantized
(12)
this model