We put 5 AI models to the KPI test: Here's what happened

AI is everywhere these days, seamlessly becoming part of our daily lives. Just look at how many apps use AI - from streaming platforms giving personalised recommendations to voice assistants that understand and carry out our commands and smart home devices taking care of household chores. Beyond just entertainment and convenience, AI is making waves in crucial areas like healthcare, where it helps diagnose diseases, and finance, where it detects fraud and manages risks. This isn't just a fad - AI is changing the game across industries and revolutionising our lives.

AI helps us work smarter in the business world by providing insights and tools that enhance decision-making, streamline operations, and drive better results. A key area where AI can make a big difference is in setting and achieving key performance indicators (KPIs). AI helps businesses set more accurate and meaningful KPIs aligned with their specific goals. By using AI to improve KPI setting, companies can focus on the right metrics, optimise their performance, and stay agile.

We decided to run an experiment to see which of five AI models - ChatGPT, Claude, Gemini, Perplexity, and Copilot - is the best at helping set customer service KPIs. Read on for all the details and results…

Methodology

To ensure a thorough and unbiased evaluation of AI models for determining and setting KPIs, we used the following methodology:

1. Selection of prompts

Develop a set of clear and concise prompts related to customer service. For this experiment, these prompts will be as follows:

Test 1 - Identifying KPIs: "What are 10 KPIs I should be tracking for customer service in my business?"
Test 2 - Clarifying a KPI definition: "I don’t understand [select metric]. Explain what this is and why it’s important."
Test 3 - Identifying tools for tracking KPIs: "What tools can I use to track my KPIs effectively against my goals?"
Test 4 - KPI benchmarks and targets: "Provide a list of benchmarks and realistic targets for each of these KPIs, including sources/citations."

2. Testing and analysis

Input the same prompts into each AI model, ensuring consistency across all tests. The first prompt will be entered into a new chat for each AI model, and the subsequent prompts will be entered into the same chat to maintain context clarity, keep conversations focused, and avoid information spillover. Record and analyse the responses generated and compare the performance of each AI model.

3. Evaluation and scoring

Evaluation criteria will vary for each test and will be outlined in the results below. Numerical scores will be given for each criterion and tallied for an overall score in each test. The overall scores for each test will be added up to give a total score and ranking. The AI model with the highest total score will be the winner.

Test 1 - Identifying KPIs

Prompt: What are 10 KPIs I should be tracking for customer service in my business?

Purpose

This prompt is fundamental for establishing a baseline understanding of which KPIs are essential for monitoring the effectiveness and efficiency of customer service operations.

Insights

Consistencies and differences in KPIs

All AI models agreed on key KPIs like First Response Time, Average Resolution Time, CSAT, and NPS. This shows these metrics are widely seen as crucial for judging customer service.

ChatGPT and Claude kept it general, while Gemini broke it down into resolution rates, response times, customer effort, efficiency and volume, and loyalty and advocacy. Perplexity added in Call Abandonment Rate and Knowledge Base Views, focusing on call centre stats and self-service. Copilot stressed Consistent Resolutions across channels, showing the importance of a consistent customer experience.

Unique KPIs

ChatGPT: Service Level Agreement (SLA) Compliance, Quality Assurance (QA) Score

Claude: Ticket Volume, Ticket Backlog, Employee Engagement

Gemini: Average Handle Time (AHT)

Perplexity: Cost per Resolution, Agent Utilisation Rate

Copilot: Consistent Resolutions, Cost Per Conversation

Level of detail and explanation

ChatGPT and Claude give short descriptions of each KPI without adding much context. Gemini goes into more detail, explaining how to understand the KPIs and what they mean for customer service. Perplexity backs up its KPI suggestions with citations and references. Copilot uses practical examples to show why certain KPIs matter.

Evaluation criteria

Understanding of the prompt:

How well does the AI comprehend the nuances and context of the queries? [Poor understanding (1), Limited (2), Moderate (3), Good (4), Excellent (5)]

Accuracy of insights and recommendations:

How closely do the AI-generated outputs align with established best practices and industry standards? [Highly inaccurate (1), Moderately inaccurate (2), Neutral (3), Mostly accurate (4), Highly accurate (5)]

Effectiveness in guiding KPI setting:

How well does the AI assist the user in defining relevant and actionable KPIs? [Ineffective (1), Somewhat effective (2), Moderately effective (3), Effective (4), Highly effective (5)]

Outputs and scores

ChatGPT

‍Understanding of the prompt: 4 (Good)

ChatGPT demonstrates a good understanding of the prompt by providing a comprehensive list of customer service KPIs.

‍Accuracy of insights and recommendations: 4 (Mostly accurate)

The KPIs mentioned by ChatGPT align well with industry standards and best practices.

‍Effectiveness in guiding KPI setting: 3 (Moderately effective)

While ChatGPT provides a solid list of KPIs, it lacks detailed explanations or guidance on how to prioritise and implement them effectively.

Total score = 11/15

Claude

‍Understanding of the prompt: 4 (Good)

Claude understands the prompt well, delivering a well-structured list of customer service KPIs.

‍Accuracy of insights and recommendations: 4 (Mostly accurate)

The KPIs provided by Claude are mostly accurate and in line with industry expectations.

‍Effectiveness in guiding KPI setting: 4 (Effective)

Claude offers helpful context on how to use the KPIs to gain insights and make data-driven decisions, enhancing its effectiveness in guiding KPI setting.

Total score = 12/15

Gemini

‍Understanding of the prompt: 4 (Good)

Gemini showcases an excellent understanding of the prompt by categorising KPIs and providing detailed explanations.

‍Accuracy of insights and recommendations: 5 (Highly accurate)

The KPIs and insights provided by Gemini are highly accurate and closely align with industry best practices.

‍Effectiveness in guiding KPI setting: 5 (Highly effective)

Gemini's categorised approach and detailed explanations make it highly effective in guiding users to set relevant and actionable KPIs.

Total score = 14/15

Perplexity

‍‍

Understanding of the prompt: 4 (Good)

Perplexity demonstrates a good understanding of the prompt by providing a diverse set of customer service KPIs.

‍Accuracy of insights and recommendations: 4 (Mostly accurate)

The KPIs mentioned by Perplexity are mostly accurate and supported by citations from reputable sources.

‍Effectiveness in guiding KPI setting: 3 (Moderately effective)

While Perplexity offers a solid list of KPIs, it lacks guidance on how to prioritise and implement them effectively in a specific business context.

‍Total score = 11/15

Copilot

‍Understanding of the prompt: 4 (Good)

Copilot exhibits an excellent understanding of the prompt, providing a mix of common and unique KPIs with practical examples.

‍Accuracy of insights and recommendations: 5 (Highly accurate)

The KPIs and insights provided by Copilot are highly accurate and closely align with industry standards and best practices.

‍Effectiveness in guiding KPI setting: 4 (Effective)

Copilot's use of practical examples and emphasis on consistency across channels make it effective in guiding users to set meaningful KPIs.

‍Total score = 13/15

‍Test 1 WINNER = Gemini 🏆