Early Access — Scores are preliminary

BloxBench

The benchmark for measuring how well AI models understand Roblox — from Luau scripting and the Roblox API to game architecture and Studio workflows.

Leaderboard

Overall scores across all Roblox knowledge categories. Scored out of 100. Higher is better.

1
GPT-5.3 Codex (High)
OpenAI
65.53
2
Claude Opus 4.6 (Thinking)
Anthropic
64.17
3
Sonnet 4.6
Anthropic
62.8
4
GLM-5
Zhipu AI
61.9
5
Gemini 3 Pro
Google
56.7
6
Gemini 3 Flash
Google
49.58
Grok 4.2
xAI
Pending
Kimi K2.5
Moonshot AI
Pending
MiniMax-M2.5
MiniMax
Pending

Model Breakdown

Strengths, weaknesses, and key findings from each evaluated model.

1
GPT-5.3 Codex (High)
OpenAI
65.53

Strengths

  • Strong instruction following and thoroughness in most task implementations.
  • Consistently solid server-authoritative design with clear validation boundaries.
  • Strong coverage across networking, datastore, UI, and security patterns.

Weaknesses

  • No significant weaknesses identified in this evaluation round.
2
Claude Opus 4.6 (Thinking)
Anthropic
64.17

Strengths

  • Strong typed, deterministic core systems.
  • Strong server-authoritative combat/network logic.
  • Good registry and state lifecycle design.

Weaknesses

  • Weak anti-replay/anti-duplication in some flows.
  • Some cleanup and memory lifecycle issues.
3
Sonnet 4.6
Anthropic
62.8
4
GLM-5
Zhipu AI
61.9
5
Gemini 3 Pro
Google
56.7
6
Gemini 3 Flash
Google
49.58

Strengths

  • Decent structure on simpler tasks.

Weaknesses

  • Weak on complex state machines and concurrency.
  • Several serious security/trust-boundary failures.
  • Weak edge-case and recovery handling.
  • Inconsistent performance/scalability control.
Grok 4.2
xAI
Pending
Kimi K2.5
Moonshot AI
Pending
MiniMax-M2.5
MiniMax
Pending

What We Test

Each model is evaluated across six core areas of Roblox development knowledge.

Luau + API Correctness

Luau syntax, type annotations, idiomatic patterns, and correct usage of Roblox services, instances, methods, and events.

Client/Server Networking

RemoteEvents, RemoteFunctions, client-server architecture, replication, and secure communication patterns.

Debugging & Fault Fixing

Identifying bugs, reading error output, diagnosing common Roblox issues, and producing correct fixes.

DataStore Persistence

DataStoreService reliability, session locking, retry logic, data migration, and preventing data loss.

UI Engineering Quality

ScreenGui structure, UIListLayout, responsive scaling, tween animations, and polished player-facing interfaces.

Performance & Scalability

Optimization patterns, memory management, efficient loops, throttling, and building games that scale with player count.

Methodology

A structured approach to evaluating Roblox-specific AI capabilities.

01
Prompt Construction

Questions crafted by experienced Roblox developers covering real-world scenarios and edge cases.

02
Zero-Shot Evaluation

Each model receives identical prompts with no prior context. Responses collected via official APIs.

03
Expert Grading

Responses scored by Roblox developers on correctness, completeness, and best practices.

This is an early alpha version. All scores are preliminary placeholders pending full evaluation. The question set, scoring, and model evaluations are actively being developed. Not affiliated with Roblox Corporation.