KEY INFORMATION
- Article published in Fox Quarterley Review
- https://www.foxcomms.com/foxquarterly/how-artificial-intelligence-is-transforming-the-luxury-sector/
Have you ever spoken to a customer service agent who has really got on your nerves?
Perhaps, they are too informal and chatty when you'd prefer something more transactional. Or perhaps you are from a culture where you are accustomed to informality and friendliness.
Maybe you just encountered someone who seemed rude or hostile?
It can be hard to control the quality of customer support interactions. Companies want to give agents freedom to be human and do their job with empathy. But sometimes personal or cultural differences lead to messaging that is off-brand or tonally wrong.
Global outsourcing of customer support means customers could be chatting with someone with a very different perception of tone. What might seem empathetic in one culture can be jarring in another.
Even with extensive training, agents are human and make mistakes. What if we could help them by offering in-product guidance to pick up potentially annoying messaging and offer guidance to them before they hit the send button?
Could large language models power this? Are they sophisticated enough to pick up nuances in tone or breaches of quality and then rephrase messages accordingly?
That's what this set of experiments aimed to find out.
The first step was to compile a set of real agent utterances with a broad range of messages to test various tones.
Standards across different customer service teams vary considerably, so the collection covered a wide spectrum of utterances.
The ones that needed guidance fell broadly into three categories: red flags, lack of empathy and wrong tone.
Exposing legal risk
Sexist
Flattering
Inappropriate emojis
Apportioning blame
Hostile
Uncaring
Judgemental
Using jargon
Too colloquial
Flowery language
Over-familiar
Over-enthusiastic
The 50-60 utterances were labelled to show what was wrong and then offered a corrected version, to use in 4 prompt experiments.
The experiments used different styles of prompt (around 50-100 prompt variations and regenerations per experiment) in two LLMs: one 70B parameter model and one 175B parameter model.
Give the model a list of rules to follow, and then a number of good or bad examples of utterances to analyse and rewrite.
Tell the model it was an agent with a persona and then ask it to rewrite examples of utterances
List out some utterances in the prompt and ask the model to identify the tone.
Can the model perform simple content changes like changing phrases from passive to active voice?
I tried various prompts with variations on the following structure:
Here is a list of 5-10 content guidance rules {list of rules}; here is a list of agent utterances {list of utterances}; identify which ones have broken the rules and why, and rewrite them.
Both models could usually detect when a phrase wasn't right. However, they didn't always correctly identify exactly what rules had been broken.
Rewrites were, more often than not, an improvement on the original version.
The most successful prompts specified the tone of voice desired e.g. "friendly but professional" or specified that these were rules for "customer service". When either of these were specified in the prompt, the responses given were usually close to the mark, with acceptable rewrites even if the models' reason for the original being wrong was off the mark.
One of the surprises from this experiment for me, was that the models could handle quite complex, multi-step instructions.
Another surprise was that the 70B model struggled with more black and white rules. An example of one that I tested was "never say sorry or apologise". The model failed on multiple occasions to get this right, either failing to identify the apologetic agent, or going too far in the other direction and identifying this as the rule that was broken every time, even when it wasn't.
For this prompt experiment, the models were given a personality type and asked to rewrite phrases that were tonally wrong.
You are a customer service agent for {brand + characteristics}. Rewrite the following phrases {list of phrases}.
I experimented with various personae, including "a hip technology company employee" and named various brands, (Apple, BT, gov.uk for example) and characteristics (including "hip", "trendy", "progressive" and "polite").
Interestingly, this form of prompt tended to result in high quality rewritten phrases, particularly for the 70B model, despite being much less prescriptive.
I was interested in getting to an informal but professional voice. The most effective persona for this was the 'hip technology company customer support agent'.
For this prompt listed out varying numbers of customer service messages and asked the models to identify the tone of each message.
The models often correctly identified the tone. However it did not label the tones consistently (especially the 70B model). Thanks to the wonderfully huge number of synonyms we have in the English language, the same phrase could be labelled completely differently each time, without being wrong.
And let’s face it, tone is also quite subjective. In that sense, the model accurately reflected humanity: one person's description of the same phrase could be completely different to another's. Each time, the response could be valid.
So was this tone identification experiment successful? Well, although the labels applied to different tones varied considerably, they were consistent in terms of where you might put them on a positive to negative spectrum. The model might label the same phrase as ‘hostile’, ‘rude’ or ‘uncaring’, but you’d class all of these as being undesirable and something to correct.
I also observed that the model made a distinction between professional and unprofessional tones with reassuring consistency.
This was a very quick experiment to start to look at whether LLMs can act on simple linguistic instructions.
The application for this would be to help customer support agents stick to a brand's style guide.
For example, would an LLM be able to prompt an agent to reword their sentence from the passive voice to the active voice?
Both models coped well with this task.
The experiments were carried out over just 2 days so the results are interesting but not robust. More methodical testing needs to be done. With that caveat aside:
1. Overall the LLMs tested did well in rephrasing unprofessional-sounding phrases into more professional ones.
2. Sometimes the rephrased sentences were a bit formal although still an improvement on the worst messages.
3. Results were surprisingly nuanced when it came to instructions around tone.
4. Both models could handle multi-step instructions.
5. Tone labelling was not always consistent. Perhaps this is not surprising given how subjective tone is.
These results were interesting, but I’d like to test them more quantitatively. I’d work with a developer to see how we could do this programmatically.
I only tested on two models. I’d be curious to test on others e.g. Bard.
Work on a prototype to look at how we could use this within customer service interfaces.
Meta Verified subscribers couldn’t find the answers to simple customer support queries, so were contacting customer support agents to resolve their problems.
The proposed solution was to rapidly design and build a chatbot to answer the most common questions.
Customers would still have human support one click away.
But if their questions were answered faster, it would be a good outcome for everyone.
This was a rapid turnaround project, so it was essential to quickly build a picture of the user with existing resources.
To understand the users’ problems without conducting any fresh research there was one source of data: agent transcripts. Analysing these would be the key to understanding what the most common issues were and why.
A colleague had already shared a quantitative analysis but deep diving into the transcripts allowed me to understand:
I designed a simple chatbot flow by wireframing in a Figma flowchart.
I took into account:
Although the bot flow itself is quite simple, there were plenty of considerations to work through!
Most common user issues first.
Work within character limits, reword for localization.
Separate long text into individual message bubbles.
Inline, help articles or human handover.
The chatbot was shipped and is currently shown to Meta Verified subscribers seeking support on Instagram.
While I can't share details of the impact, my work on it led to me winning an internal award for moving metrics.
HelloDone make chatbots powered by a proprietary natural language processing (NLP) model.
The bots are used by retailers to help their customers track and manage their deliveries in WhatsApp and Messenger, along with answering any questions they might have about the order, the delivery, returns or shopping in general.
The chatbots use NLP to understand what people are asking for and give pre-designed responses in return.
The problem with pre-written responses is that clients all have different voices and tones but creating a individual bespoke bots for hundreds of clients is not scalable.
The team needed a system for customisation.
We had a fantastic source of data about retailers' customers: their conversations. They were already telling us what they wanted to know. I could look both quantitatively and qualitatively at what the most common queries were and what people might then ask after they had the answer.
I could also look at the language people used and how they asked questions or followed up with responses, so that sample design could reflect real users' phraseology.
I'm a firm believer in not reinventing the wheel, our clients already had a deep knowledge of their customers. Working with stakeholders across multiple businesses was another route into understanding what their customers want to know in the "final mile" of the delivery process.
It wasn't enough to just look at our existing client base. If we were to expand, who would our dream clients be? How did they talk to their customers? What tone would their chatbot have?
A deep dive into retailer help centres and customer support systems showed that while brands had their own voices and tones, if you had to narrow them down to two styles of communication, you could, at least for an MVP.
These styles were really just degrees of formality. I classified them as "Polite and friendly" or "Chatty and informal".
Professional
Clear
Pleasant
Straightforward
Upbeat
Friendly
Emojis
Idiomatic
Having two sets of "out of the box" utterances was the first part of the scalable system.
The second part was to make it easy to set up a new bot.
Before the team invested time in creating a user interface specifically for this job, we needed to test an MVP.
Using macros and formulae in excel, I created a spreadsheet that generated the client file for the NLP model, meaning all the bot utterances could be defined and code-ready in one session just by choosing options in the sheet.
Choice of whether to include a question or not (for example an online business would not need to answer many questions about bricks and mortar stores).
Polite or chatty pre-designed utterance, or client’s own customised response.
Change variables like phone numbers or spelling according to region.
Add URLs for more information or to take action (for example, return a product).
New clients using the new system selected 75% of their answers from the pre-designed utterances.
Onboarding time was dramatically reduced from weeks to hours.
Training sets could be re-used, leading to greater efficiency in the code and all clients sharing the benefits of being in the same niche.
Develop a user interface for clients to set up their bot.
Check whether the two tones hypothesis works outside the UK for international expansion.
Continuous research to check that bot is answering user needs.