Home/Tools/@tpmjs/tools-tool-call-accuracy-score

toolCallAccuracyScoreTool

@tpmjs/tools-tool-call-accuracy-score

Scores the accuracy of actual tool calls against expected tool calls in agent workflows. Returns a score (0-1), lists of correct/incorrect/missed/extra calls, and a detailed summary. Useful for testing and evaluating agent behavior.

Official
agent
v0.2.0
MIT

Interactive Playground

Test @tpmjs/tools-tool-call-accuracy-score (toolCallAccuracyScoreTool) with AI-powered execution

0/2000 characters

Installation & Usage

Install this tool and use it with the AI SDK

1. Install the package

npm install @tpmjs/tools-tool-call-accuracy-score
pnpm add @tpmjs/tools-tool-call-accuracy-score
yarn add @tpmjs/tools-tool-call-accuracy-score
bun add @tpmjs/tools-tool-call-accuracy-score
deno add npm:@tpmjs/tools-tool-call-accuracy-score

2. Import the tool

import { toolCallAccuracyScoreTool } from '@tpmjs/tools-tool-call-accuracy-score';

3. Use with AI SDK

import { generateText } from 'ai';
import { openai } from '@ai-sdk/openai';
import { toolCallAccuracyScoreTool } from '@tpmjs/tools-tool-call-accuracy-score';

const result = await generateText({
  model: openai('gpt-4o'),
  tools: { toolCallAccuracyScoreTool },
  prompt: 'Your prompt here...',
});

console.log(result.text);

Parameters

Available configuration options

Auto-extracted
expected
Required
Type: array

Array of expected tool calls with tool name and arguments

actual
Required
Type: array

Array of actual tool calls made with tool name and arguments

Schema extracted: 1/1/2026, 8:17:47 AM

README

Tool Call Accuracy Score

Scores the accuracy of actual tool calls against expected tool calls in agent workflows. Useful for testing and evaluating agent behavior.

Installation

npm install @tpmjs/tools-tool-call-accuracy-score

Usage

import { toolCallAccuracyScoreTool } from '@tpmjs/tools-tool-call-accuracy-score';
import { generateText } from 'ai';

const result = await generateText({
  model: yourModel,
  tools: {
    scoreToolCalls: toolCallAccuracyScoreTool,
  },
  prompt: 'Score these tool calls...',
});

Direct Usage

import { toolCallAccuracyScoreTool } from '@tpmjs/tools-tool-call-accuracy-score';

const result = await toolCallAccuracyScoreTool.execute({
  expected: [
    { tool: 'searchWeb', args: { query: 'AI news' } },
    { tool: 'summarize', args: { text: 'long article...' } },
  ],
  actual: [
    { tool: 'searchWeb', args: { query: 'AI news' } },
    { tool: 'summarize', args: { text: 'different text' } },
    { tool: 'translateText', args: { text: 'hello', to: 'es' } },
  ],
});

console.log(result);
// {
//   score: 0.667,
//   totalExpected: 2,
//   totalActual: 3,
//   correctCalls: [{ expected: {...}, actual: {...}, status: 'correct', argsMatch: true }],
//   incorrectCalls: [{ expected: {...}, actual: {...}, status: 'incorrect', argsMatch: false }],
//   missedCalls: [],
//   extraCalls: [{ tool: 'translateText', args: {...} }],
//   summary: 'Accuracy Score: 66.7% | Correct: 1/2 | Incorrect: 1 | Missed: 0 | Extra: 1'
// }

Input Schema

{
  expected: Array<{
    tool: string;      // Name of the tool
    args: object;      // Arguments passed to the tool
  }>;
  actual: Array<{
    tool: string;      // Name of the tool
    args: object;      // Arguments passed to the tool
  }>;
}

Output Schema

{
  score: number;                    // F1 score (0-1) based on precision and recall
  totalExpected: number;            // Number of expected tool calls
  totalActual: number;              // Number of actual tool calls made
  correctCalls: Array<{             // Calls that matched perfectly
    expected: ToolCall;
    actual: ToolCall;
    status: 'correct';
    argsMatch: true;
  }>;
  incorrectCalls: Array<{           // Calls with correct tool but wrong args
    expected: ToolCall;
    actual: ToolCall;
    status: 'incorrect';
    argsMatch: false;
    details: string;
  }>;
  missedCalls: Array<{              // Expected calls that weren't made
    expected: ToolCall;
    status: 'missed';
    argsMatch: false;
    details: string;
  }>;
  extraCalls: ToolCall[];           // Unexpected calls that were made
  summary: string;                  // Human-readable summary
}

Scoring Algorithm

The tool uses an F1 score approach:

  • Precision: correctCalls / totalActual - How many actual calls were correct?
  • Recall: correctCalls / totalExpected - How many expected calls were made?
  • F1 Score: 2 * (precision * recall) / (precision + recall) - Harmonic mean

This balances both making the right calls and avoiding extra/incorrect calls.

Matching Logic

  1. Perfect Match: Tool name and arguments match exactly → correctCalls
  2. Partial Match: Tool name matches but arguments differ → incorrectCalls
  3. No Match: Expected call not found in actual → missedCalls
  4. Extra: Actual call not matched to any expected → extraCalls

Arguments are compared using deep equality (recursive object comparison).

Use Cases

  • Agent Testing: Validate that agents make the correct tool calls
  • Workflow Evaluation: Score agent workflows against expected behavior
  • Regression Testing: Ensure agent behavior doesn't degrade over time
  • A/B Testing: Compare different agent configurations
  • Quality Metrics: Track agent accuracy over time

Example: Testing a Research Agent

const expected = [
  { tool: 'searchWeb', args: { query: 'latest AI research 2024' } },
  { tool: 'fetchUrl', args: { url: 'https://arxiv.org/...' } },
  { tool: 'summarize', args: { maxLength: 500 } },
];

const actual = [
  { tool: 'searchWeb', args: { query: 'latest AI research 2024' } },
  { tool: 'fetchUrl', args: { url: 'https://arxiv.org/...' } },
  // Agent forgot to call summarize
];

const score = await toolCallAccuracyScoreTool.execute({ expected, actual });
// score.score = 0.8 (missed one expected call)
// score.missedCalls.length = 1

License

MIT

Statistics

Downloads/month

46

Quality Score

78%

Bundle Size

NPM Keywords

tpmjs
agent
ai
testing
accuracy

Maintainers

thomasdavis(thomasalwyndavis@gmail.com)

Frameworks

vercel-ai