Blog - Lucy Davis

Article: The impact of AI on the luxury market

Testing tone detection and guidance in large language models

I led a project to explore whether it’s possible to use large language models (LLMs) to give tone guidance in customer support.
The experiments focused on tone detection and enforcement.
Fast turnaround (3 days).
This is a dipstick set of tests, not a rigorous study, owing to the timeframe.
The results were promising, both from the 175B parameter model used, and from the 70B model.
- Both could consistently differentiate between professional and unprofessional language.
- Both could also use lists of rules to give feedback on acceptable and unacceptable agent utterances.
The models’ attempts to identify differences in tone in a more nuanced way (e.g. eager vs helpful) was less consistent, perhaps reflecting the subjectivity in interpreting tone and/or the complexity of the task.
The project won an award (“Creative Trailblazer”) as part of a LLM sprint.

Have you ever spoken to a customer service agent who has really got on your nerves?

Perhaps, they are too informal and chatty when you'd prefer something more transactional. Or perhaps you are from a culture where you are accustomed to informality and friendliness.

Maybe you just encountered someone who seemed rude or hostile?

It can be hard to control the quality of customer support interactions. Companies want to give agents freedom to be human and do their job with empathy. But sometimes personal or cultural differences lead to messaging that is off-brand or tonally wrong.

Global outsourcing of customer support means customers could be chatting with someone with a very different perception of tone. What might seem empathetic in one culture can be jarring in another.

Compile

Label

Experiment

Iterate

Red flags

Exposing legal risk

Sexist

Flattering

Inappropriate emojis

Lack of empathy

Apportioning blame

Hostile

Uncaring

Judgemental

Wrong tone

Using jargon

Too colloquial

Flowery language

Over-familiar

Over-enthusiastic

LIST OF RULES

Give the model a list of rules to follow, and then a number of good or bad examples of utterances to analyse and rewrite.

PERSONA TEST

Tell the model it was an agent with a persona and then ask it to rewrite examples of utterances

TONE IDENTIFICATION

List out some utterances in the prompt and ask the model to identify the tone.

LINGUISTIC INSTRUCTION

Can the model perform simple content changes like changing phrases from passive to active voice?

Both models could usually detect when a phrase wasn't right. However, they didn't always correctly identify exactly what rules had been broken.

Rewrites were, more often than not, an improvement on the original version.

The most successful prompts specified the tone of voice desired e.g. "friendly but professional" or specified that these were rules for "customer service". When either of these were specified in the prompt, the responses given were usually close to the mark, with acceptable rewrites even if the models' reason for the original being wrong was off the mark.

One of the surprises from this experiment for me, was that the models could handle quite complex, multi-step instructions.

Another surprise was that the 70B model struggled with more black and white rules. An example of one that I tested was "never say sorry or apologise". The model failed on multiple occasions to get this right, either failing to identify the apologetic agent, or going too far in the other direction and identifying this as the rule that was broken every time, even when it wasn't.

The models often correctly identified the tone. However it did not label the tones consistently (especially the 70B model). Thanks to the wonderfully huge number of synonyms we have in the English language, the same phrase could be labelled completely differently each time, without being wrong.

And let’s face it, tone is also quite subjective. In that sense, the model accurately reflected humanity: one person's description of the same phrase could be completely different to another's. Each time, the response could be valid.

So was this tone identification experiment successful? Well, although the labels applied to different tones varied considerably, they were consistent in terms of where you might put them on a positive to negative spectrum. The model might label the same phrase as ‘hostile’, ‘rude’ or ‘uncaring’, but you’d class all of these as being undesirable and something to correct.

I also observed that the model made a distinction between professional and unprofessional tones with reassuring consistency.

The experiments were carried out over just 2 days so the results are interesting but not robust. More methodical testing needs to be done. With that caveat aside:

1. Overall the LLMs tested did well in rephrasing unprofessional-sounding phrases into more professional ones.

2. Sometimes the rephrased sentences were a bit formal although still an improvement on the worst messages.

3. Results were surprisingly nuanced when it came to instructions around tone.

4. Both models could handle multi-step instructions.

5. Tone labelling was not always consistent. Perhaps this is not surprising given how subjective tone is.

These results were interesting, but I’d like to test them more quantitatively. I’d work with a developer to see how we could do this programmatically.

I only tested on two models. I’d be curious to test on others e.g. Bard.

Work on a prototype to look at how we could use this within customer service interfaces.

Designing the Meta Verified chatbot to offer customers quicker resolution

Analysis

Identify

Design

Iteration

Handoff

This was a rapid turnaround project, so it was essential to quickly build a picture of the user with existing resources.

To understand the users’ problems without conducting any fresh research there was one source of data: agent transcripts. Analysing these would be the key to understanding what the most common issues were and why.

A colleague had already shared a quantitative analysis but deep diving into the transcripts allowed me to understand:

user issues
what was causing them
and the resolution they sought.

I designed a simple chatbot flow by wireframing in a Figma flowchart.

I took into account:

Instagram style guidelines
Failure modes
Order of menu options
Ease of access to human handover
Providing links to longer help articles
UI choices - pros and cons of various options
UI limitations
Chunking for readability
Flesch reading score
Writing the correct responses based on user region

Although the bot flow itself is quite simple, there were plenty of considerations to work through!

Creating individual chatbots for clients using a scalable system

I led a project to create a system that would enable clients to customise their chatbots.
This had previously been done on a consultancy (one to one) basis and needed to be systematised.
I used user and competitive data to suggest a way to balance customisation needs and ability to scale.
I created a large set of utterances in two popular tones, while still allowing exceptions.
The ‘out of the box’ model was taken up by clients for 70%+ utterances and led to a more efficient use of samples for the training set.
In order to make the process even more efficient, I created a spreadsheet that generated the client file for the NLP model.

HelloDone make chatbots powered by a proprietary natural language processing (NLP) model.

The bots are used by retailers to help their customers track and manage their deliveries in WhatsApp and Messenger, along with answering any questions they might have about the order, the delivery, returns or shopping in general.

The chatbots use NLP to understand what people are asking for and give pre-designed responses in return.

The problem with pre-written responses is that clients all have different voices and tones but creating a individual bespoke bots for hundreds of clients is not scalable.

The team needed a system for customisation.

Understand user needs

Competitive analysis

Design

Create a system

We had a fantastic source of data about retailers' customers: their conversations. They were already telling us what they wanted to know. I could look both quantitatively and qualitatively at what the most common queries were and what people might then ask after they had the answer.

I could also look at the language people used and how they asked questions or followed up with responses, so that sample design could reflect real users' phraseology.

I'm a firm believer in not reinventing the wheel, our clients already had a deep knowledge of their customers. Working with stakeholders across multiple businesses was another route into understanding what their customers want to know in the "final mile" of the delivery process.

It wasn't enough to just look at our existing client base. If we were to expand, who would our dream clients be? How did they talk to their customers? What tone would their chatbot have?

A deep dive into retailer help centres and customer support systems showed that while brands had their own voices and tones, if you had to narrow them down to two styles of communication, you could, at least for an MVP.

These styles were really just degrees of formality. I classified them as "Polite and friendly" or "Chatty and informal".

OPT IN

Choice of whether to include a question or not (for example an online business would not need to answer many questions about bricks and mortar stores).

CHOICE OF TONE

Polite or chatty pre-designed utterance, or client’s own customised response.

Develop a user interface for clients to set up their bot.

Check whether the two tones hypothesis works outside the UK for international expansion.

Continuous research to check that bot is answering user needs.

Just a minute! Designing more accessible deliveries

People with temporary or permanent disabilities sometimes have difficulties with home deliveries because the driver doesn’t wait long enough for them to get to the door.
I designed a chatbot flow that allows people to ask the driver to give them more time to answer the door, while being sensitive to courier companies’ need to deliver as many parcels as efficiently as possible.
This functionality was only offered with one delivery carrier, so I also had to design to handle expectations when a parcel was not eligible for the service.
This was the first time this service was offered for online deliveries.

Planning a train journey with a chatbot (and other commuter delights)

I worked in a two person design team on a chatbot using conversational AI to help commuters with their everyday transport needs.
We worked to understand commuter pain points.
The most complex part of the design was a fully conversational journey planner.
The most delightful part of the design was a function that told passengers where to stand so that they could board the train near their seat.
I was responsible for writing the tests for a proprietary program to check we were providing the best possible customer service.
Commuters using the chatbot (on WhatsApp) expressed greater degrees of satisfaction with the train brand and perceived that they could carry out their desired action more quickly than using the phone or face to face contact.

Article: The impact of AI

on the luxury market

KEY INFORMATION

If you'd like to talk, I'd love to hear from you.

MORE WORK

Experiments with large language models

to detect tone

KEY INFORMATION

THE PROBLEM

THE PROCESS

Compile

Label

Experiment

Iterate

COMPILE

Red flags

Lack of empathy

Wrong tone

LABEL

THE 4 EXPERIMENTS

LIST OF RULES

PERSONA TEST

TONE IDENTIFICATION

LINGUISTIC INSTRUCTION

EXPERIMENT 1: LIST OF RULES

EXPERIMENT 2: PERSONA TEST

EXPERIMENT 3: TONE IDENTIFICATION EXERCISE

EXPERIMENT 4: LINGUISTIC INSTRUCTION

CONCLUSIONS

WHAT I'D DO NEXT

If you'd like to discuss this or any other case study, I'd love to hear from you.

MORE WORK

Designing the Meta Verified

customer support chatbot

KEY INFORMATION

THE PROBLEM

THE PROCESS

Analysis

Identify

Design

Iteration

Handoff

ANALYSIS

IDENTIFY

DESIGN

MENU ITEMS

UI

CHUNKING

HELP OPTIONS

ITERATE AND HANDOFF

OUTCOME

If you'd like to discuss this or any other case study, I'd love to hear from you.

MORE WORK

Creating unique chatbots for clients

with a scalable system

KEY INFORMATION

THE PROBLEM

THE PROCESS

Understand user needs

Competitive analysis

Design

Create a system

UNDERSTAND USER NEEDS

COMPETITIVE ANALYSIS

Polite and friendly

Chatty and informal

DESIGN

CREATE A SYSTEM

OPT IN

CHOICE OF TONE

LOCALISATION

ADD URL

OUTCOME

WHAT I'D DO NEXT

If you'd like to discuss this or any other case study, I'd love to hear from you.

MORE WORK

Just a minute!

Designing more accessible deliveries

KEY INFORMATION

FULL CASE STUDY COMING SOON