AUDHD-Bench & FrieND-LLM: What It Is, Why Edge, and a First LFM Result

llm
benchmark
edge
neurodivergent
lfm
A 130-question benchmark for neurodivergent adult life management tasks, built for small on-device models. Architecture, reasoning behind the design, and the first data point from LFM2.5-350M.
Author

Laith Zumot

Published

April 19, 2026

Why this benchmark

Mainstream LLM benchmarks measure things I do not use an LLM for. MMLU, HumanEval, GPQA, SWE-Bench: they all probe an imagined worker with a clear task and a verifiable answer. That is a narrow slice of what I actually reach for a model to help with. My real usage looks like a disciplinary dispute with a school I do not want to burn bridges with, a performance conversation where tone and framing matter more than content, a child whose sensory day just fell apart at 4pm, and institutional negotiations where precision in language has material consequences.

Those situations have two properties that frontier evals do not touch. They are genuinely ambiguous, and the cost of a bad response is relational rather than factual. The model has to read implicit rules, hold multiple goals in tension, and produce something a tired adult can send or say without causing collateral damage.

AUDHD-Bench is 130 questions of exactly that. Ten categories, five binary criteria per question, a public set for sharing results and a private set that stays on my own hardware. Scoring is done by an LLM judge, each criterion gets a 0 or a 1, no partial credit.

The name comes from autism plus ADHD, since those are the lenses I live inside. The use cases are broader. Anyone who has to navigate an institution while regulated differently than the default will recognise most of the questions. Workplace strategy, parenting, pragmatic communication under institutional ambiguity, writing craft, learning, psychology and self-reflection, career transition to name a few.

Why only edge devices and quantizations

Two reasons.

The first is privacy. A benchmark of neurodivergent adult life management contains the most sensitive text I could write. My real questions contain real names, real calibration write-ups, real parenting specifics, real negotiation positions. Sending that to a frontier cloud model is the opposite of what I want the final product to be. The public set scrubs and fictionalises the identifying details. The private set never leaves my machine. The target model for evaluation has to be something I can run on hardware I control.

The second is where the field is actually going. Apple Intelligence already runs a roughly 3B on-device model and escalates only when a query genuinely needs cloud-scale compute. Gemini Nano on Pixel does similar. Liquid’s LFM2 family is architecturally designed for the edge, with hybrid convolution plus grouped query attention tuned under hardware-in-the-loop constraints. If the efficiency trajectory continues, and the 2026 releases suggest it is, then a sub-4B model handling most of the tool calls and reasoning locally, with cloud escalation reserved for genuinely hard problems, becomes the default architecture. Benchmarks that only measure 100B+ cloud models are testing the wrong thing for the world we are heading into.

Quantization sweeps matter because the same model at Q4, Q5, Q6, and Q8 behaves differently on judgment-heavy tasks. Nuance degrades before logic does under quantization. That is exactly the property I want to measure. If a model holds on workplace_strategy at Q4 but collapses on pragmatic_communication, that is information I can act on, and information anyone deploying a quantized model at the edge should have. We should benchmark quantized versions by default. Most local model users run Q4 or Q5 quantizations, not BF16.

How the sweep works

The architecture is deliberately dumb.

Target models serve an OpenAI-compatible endpoint via llama.cpp on a single small GPU. The eval script sends one prompt per question, captures the response, and hands it to a Judge along with the five binary criteria for that question. Haiku returns a JSON object of ones and zeroes, which the script parses and stores. Results save incrementally with a resume flag, so a crash at question 80 does not cost the first 79.

A separate sweep script cycles through quantization levels for a given model family. It kills and restarts llama-server between runs, health-checks the new server, runs the eval, and moves on. A comparison script then aggregates the result JSONs into one summary per model and per category, which is what feeds into posts like this one.

target (llama.cpp, AMD, OpenAI-compatible /v1)
     |
     | prompt -> response
     v
eval_audhd_bench.py
     |
     | question + response + criteria
     v
judge (constrained generation)
     |
     | {"scores": {"1a":1,"1b":0,...}}
     v
results/{model}.json
     |
     v
compare_results_sweep.py
     |
     v
sweep_summary.json  ->  blog

The whole setup is intentionally boring. No litellm, no async, no orchestration frameworks. OpenAI SDK, bash for the loop. If it breaks I know where.

The entire pipeline was built by me, not generated. I used my ND-first AI Rubber Duck approach—pair programming with a code model where the LM read, listened, and played Socrates while I drove the design and implementation. This pattern was inspired by Answer.ai’s Solve It methodology, adapted for my own workflow.

First LFM run

The first target is LFM2.5-350M at BF16. It is the smallest model in the lineup by a wide margin, and the point of running it first was to confirm that the pipeline works end to end. The score was a secondary concern.

Initial results on the first batch of workplace_strategy questions land in the 20 to 30 percent range per question. A 350 million parameter model answering prompts that ask for an email without escalating tone and without mischaracterising facts, scoring that low on binary relational criteria, is roughly what I expected.

What the model misses is the interesting part. It gets the literal request right most of the time, meaning it usually produces an email with reasonable structure. The relational criteria are where it falls down. Positioning the sender as seeking clarification rather than filing a complaint. Referencing specific facts rather than generalisations. Holding a collaborative tone while still creating a written record suitable for institutional reference. Those are exactly the skills a neurodivergent adult often reaches for help with. At 350M parameters, BF16, zero-shot, the model cannot do them yet.

Below are some scores from LFM2.5-350m w/ default params. Some categories are marked private_category to protect sensitive evaluation domains.

Category Score
pragmatic_communication 42%
private_category 36%
parenting_neurodivergent 34%
writing 32%
psychology_reflection 27%
private_category 24%
workplace_strategy 22%
learning 20%
private_category 18%
career_transition 16%
private_category 6%
overall 25%

A quantization sweep on a larger base in the 4B to 8B range is next

What comes next

Four threads, ordered by how close I am to each one.

The nearest is a WASM build of a small model running entirely in the browser, with a pared-down set of AUDHD-Bench questions the visitor can run against it live. If the argument of this benchmark is that neurodivergent-support reasoning belongs on edge hardware, the blog post should let you feel that rather than just read about it. The visitor types into their own browser tab, the model runs locally, and the page shows the response plus the judge’s binary scores.

The second thread is fine-tuning. The eventual goal of AUDHD-Bench is a training dataset. The eval harness is scaffolding to get there. Once the questions stabilise and the criteria are calibrated across several models, the next move is to turn high-scoring responses into a preference dataset and fine-tune a small base against them. The product worth building is a sub-4B model that genuinely handles pragmatic communication under institutional ambiguity.

The third thread is the on-device experience. A phone-side assistant that knows your actual inbox, your actual calendar, your actual parenting history, without any of that leaving the device. Apple and Google have both shown the architecture works in narrow domains. Extending it to the reasoning this benchmark measures is a harder problem, but it is the one that matters.

The fourth thread is audio. The modality mismatch for a lot of neurodivergent adults is that the moment you most need help is the moment you least want to type. A voice interface to a small on-device reasoner, with the bench acting as the evaluation layer for voice-first interactions, is where the personal and technical threads I have been pulling on for a while actually meet. This one is the furthest out.

More posts as runs come in. The repo is private and so will the questions remain till further notice/modification. The scoring methodology, details on the category list, and the overall numbers will land here first.