@tpmjs/tools-tool-call-accuracy-score
Scores the accuracy of actual tool calls against expected tool calls in agent workflows. Returns a score (0-1), lists of correct/incorrect/missed/extra calls, and a detailed summary. Useful for testing and evaluating agent behavior.
Test @tpmjs/tools-tool-call-accuracy-score (toolCallAccuracyScoreTool) with AI-powered execution
0/2000 characters
Install this tool and use it with the AI SDK
npm install @tpmjs/tools-tool-call-accuracy-scorepnpm add @tpmjs/tools-tool-call-accuracy-scoreyarn add @tpmjs/tools-tool-call-accuracy-scorebun add @tpmjs/tools-tool-call-accuracy-scoredeno add npm:@tpmjs/tools-tool-call-accuracy-scoreimport { toolCallAccuracyScoreTool } from '@tpmjs/tools-tool-call-accuracy-score';import { generateText } from 'ai';
import { openai } from '@ai-sdk/openai';
import { toolCallAccuracyScoreTool } from '@tpmjs/tools-tool-call-accuracy-score';
const result = await generateText({
model: openai('gpt-4o'),
tools: { toolCallAccuracyScoreTool },
prompt: 'Your prompt here...',
});
console.log(result.text);Available configuration options
expectedarrayArray of expected tool calls with tool name and arguments
actualarrayArray of actual tool calls made with tool name and arguments
Schema extracted: 1/1/2026, 8:17:47 AM
Scores the accuracy of actual tool calls against expected tool calls in agent workflows. Useful for testing and evaluating agent behavior.
npm install @tpmjs/tools-tool-call-accuracy-score
import { toolCallAccuracyScoreTool } from '@tpmjs/tools-tool-call-accuracy-score'; import { generateText } from 'ai'; const result = await generateText({ model: yourModel, tools: { scoreToolCalls: toolCallAccuracyScoreTool, }, prompt: 'Score these tool calls...', });
import { toolCallAccuracyScoreTool } from '@tpmjs/tools-tool-call-accuracy-score'; const result = await toolCallAccuracyScoreTool.execute({ expected: [ { tool: 'searchWeb', args: { query: 'AI news' } }, { tool: 'summarize', args: { text: 'long article...' } }, ], actual: [ { tool: 'searchWeb', args: { query: 'AI news' } }, { tool: 'summarize', args: { text: 'different text' } }, { tool: 'translateText', args: { text: 'hello', to: 'es' } }, ], }); console.log(result); // { // score: 0.667, // totalExpected: 2, // totalActual: 3, // correctCalls: [{ expected: {...}, actual: {...}, status: 'correct', argsMatch: true }], // incorrectCalls: [{ expected: {...}, actual: {...}, status: 'incorrect', argsMatch: false }], // missedCalls: [], // extraCalls: [{ tool: 'translateText', args: {...} }], // summary: 'Accuracy Score: 66.7% | Correct: 1/2 | Incorrect: 1 | Missed: 0 | Extra: 1' // }
{ expected: Array<{ tool: string; // Name of the tool args: object; // Arguments passed to the tool }>; actual: Array<{ tool: string; // Name of the tool args: object; // Arguments passed to the tool }>; }
{ score: number; // F1 score (0-1) based on precision and recall totalExpected: number; // Number of expected tool calls totalActual: number; // Number of actual tool calls made correctCalls: Array<{ // Calls that matched perfectly expected: ToolCall; actual: ToolCall; status: 'correct'; argsMatch: true; }>; incorrectCalls: Array<{ // Calls with correct tool but wrong args expected: ToolCall; actual: ToolCall; status: 'incorrect'; argsMatch: false; details: string; }>; missedCalls: Array<{ // Expected calls that weren't made expected: ToolCall; status: 'missed'; argsMatch: false; details: string; }>; extraCalls: ToolCall[]; // Unexpected calls that were made summary: string; // Human-readable summary }
The tool uses an F1 score approach:
correctCalls / totalActual - How many actual calls were correct?correctCalls / totalExpected - How many expected calls were made?2 * (precision * recall) / (precision + recall) - Harmonic meanThis balances both making the right calls and avoiding extra/incorrect calls.
correctCallsincorrectCallsmissedCallsextraCallsArguments are compared using deep equality (recursive object comparison).
const expected = [ { tool: 'searchWeb', args: { query: 'latest AI research 2024' } }, { tool: 'fetchUrl', args: { url: 'https://arxiv.org/...' } }, { tool: 'summarize', args: { maxLength: 500 } }, ]; const actual = [ { tool: 'searchWeb', args: { query: 'latest AI research 2024' } }, { tool: 'fetchUrl', args: { url: 'https://arxiv.org/...' } }, // Agent forgot to call summarize ]; const score = await toolCallAccuracyScoreTool.execute({ expected, actual }); // score.score = 0.8 (missed one expected call) // score.missedCalls.length = 1
MIT
Downloads/month
46
Quality Score