Early Access — Scores are preliminary

BloxBench

The benchmark for measuring how well AI models understand Roblox — from Luau scripting and the Roblox API to game architecture and Studio workflows.

Rankings

Leaderboard

Overall scores across all Roblox knowledge categories. Scored out of 100. Higher is better.

GPT-5.3 Codex (High)

OpenAI

65.53

Claude Opus 4.6 (Thinking)

Anthropic

64.17

Sonnet 4.6

Anthropic

62.8

GLM-5

Zhipu AI

61.9

Gemini 3 Pro

Google

56.7

Gemini 3 Flash

Google

49.58

—

Grok 4.2

xAI

Pending

—

Kimi K2.5

Moonshot AI

Pending

—

MiniMax-M2.5

MiniMax

Pending

Detailed Results

Model Breakdown

Strengths, weaknesses, and key findings from each evaluated model.

GPT-5.3 Codex (High)

OpenAI

65.53

▼

Strengths

Strong instruction following and thoroughness in most task implementations.
Consistently solid server-authoritative design with clear validation boundaries.
Strong coverage across networking, datastore, UI, and security patterns.

Weaknesses

No significant weaknesses identified in this evaluation round.

Claude Opus 4.6 (Thinking)

Anthropic

64.17

▼

Strengths

Strong typed, deterministic core systems.
Strong server-authoritative combat/network logic.
Good registry and state lifecycle design.

Weaknesses

Weak anti-replay/anti-duplication in some flows.
Some cleanup and memory lifecycle issues.

Sonnet 4.6

Anthropic

62.8

GLM-5

Zhipu AI

61.9

Gemini 3 Pro

Google

56.7

Gemini 3 Flash

Google

49.58

▼

Strengths

Decent structure on simpler tasks.

Weaknesses

Weak on complex state machines and concurrency.
Several serious security/trust-boundary failures.
Weak edge-case and recovery handling.
Inconsistent performance/scalability control.

—

Grok 4.2

xAI

Pending

—

Kimi K2.5

Moonshot AI

Pending

—

MiniMax-M2.5

MiniMax

Pending

Evaluation Areas

What We Test

Each model is evaluated across six core areas of Roblox development knowledge.

Luau + API Correctness

Luau syntax, type annotations, idiomatic patterns, and correct usage of Roblox services, instances, methods, and events.

Client/Server Networking

RemoteEvents, RemoteFunctions, client-server architecture, replication, and secure communication patterns.

Debugging & Fault Fixing

Identifying bugs, reading error output, diagnosing common Roblox issues, and producing correct fixes.

DataStore Persistence

DataStoreService reliability, session locking, retry logic, data migration, and preventing data loss.

UI Engineering Quality

ScreenGui structure, UIListLayout, responsive scaling, tween animations, and polished player-facing interfaces.

Performance & Scalability

Optimization patterns, memory management, efficient loops, throttling, and building games that scale with player count.

How It Works

Methodology

A structured approach to evaluating Roblox-specific AI capabilities.

Prompt Construction

Questions crafted by experienced Roblox developers covering real-world scenarios and edge cases.

Zero-Shot Evaluation

Each model receives identical prompts with no prior context. Responses collected via official APIs.

Expert Grading

Responses scored by Roblox developers on correctness, completeness, and best practices.

This is an early alpha version. All scores are preliminary placeholders pending full evaluation. The question set, scoring, and model evaluations are actively being developed. Not affiliated with Roblox Corporation.