Use AI to Cleanse Salesforce Job Title Data in Treasure Data

Bring up the word “AI” and people think of science fiction, of extraordinary near-magical feats of advanced data processing. But some of the most remarkable displays of AI power in marketing and business operations are involved in doing seemingly mundane things. An example came up recently when using Treasure Data to analyze customer data in Salesforce.

When we humans read a list of job titles, we have no problem understanding that “Vice President of Product Marketing,” “VP Pct. Mktg.” and “Prod. Mkt VP” are the same. But if we want these different phrases to be treated the same by a computer, we first have to do a process called normalization. This turns out to be an extremely challenging real-world example of using Treasure Machine Learning (ML) to great result. Let’s take a look at how it works.

A Job Title Jumble

Take a look at this customer record from Salesforce:

 John Doe Company X US 000 MAIN ST SEATTLE WA 99999 VP Marketing (555)555-5555

Notice that most of the attributes except title follow a more or less specific, standardized format. Analyzing such formatted data is relatively easy both for humans and machines. Title, however, is completely human-generated; depending on the person who entered the title, there are numerous possible patterns of its value:

  • VP of Marketing
  • Eng. Mng.
  • Marketing Manager
  • Software Engineer and Entrepreneur
  • Founder and CTO
  • Chief Technology Officer

This is bad. Why? Because the machine needs to be able to understand job titles in order, for instance, to find similar contacts and create customer segments. How can the computer tell the difference between:

  • “VP of Marketing” vs. “Marketing Manager”
  • “Eng.” vs. “Engineer”
  • “Mng.” vs. “Manager”
  • “Engineer” vs. “Engineering”
  • “CTO” vs. “Chief Technology Officer”
  • …?

Handling this text information is not trivial for AI. Thankfully, we have a powerful tool to help us out: Treasure Machine Learning.

AI solves a wide variety of real-world problems, but there’s often a lot of tedious preprocessing that needs to be done first. Treasure Data provides a slew of end-to-end solution templates that take the sting out of gathering and preprocessing your data so you can get to the fun stuff, fast.

From Preprocessing to Prediction: Treasure Workflow

Since data is continuously generated as time passes, AI-based analytic systems need to be automated. This is another great strength of Treasure Data: Treasure Workflow, our state-of-the-art data automation system, enables you to easily create and schedule end-to-end data processing flow from preprocessing to prediction.

Here’s what our SFDC contact data analysis workflow looks like:

  1. Import SFDC raw data to Treasure Data
  2. Extract partial data into auxiliary tables
  3. Transform job titles into AI-friendly format
  4. Integrate transformed job titles with other attributes
  5. Launch AI in our Treasure ML service, and find out our potential customers for sales team

Let’s unpack step #3 and find out how we transformed those unhandy job titles.

Job Title Transformation at Treasure Data

Our goal is to find role and job categories for each job title. More concretely, here’s what we want the input and output to look like:


 1 VP of Marketing
 2 Eng. Mng.
 3 Marketing Manager
 4 Software Engineer and Entrepreneur
 5 Founder and CTO
 6 Chief Technology Officer
 … …


idcategory: rolecategory: job
 1 executive marketing
 2 manager engineering
 3 manager marketing
 4 entrepreneur engineering
 5 executive engineering
 6 executive engineering
 … … …

Fortunately, the number of possible categories is limited. That means we can readily incorporate the title categories into further analysis.

In order to map job titles to the categories, our workflow undergoes two sub-steps.

Step 1: Normalize and categorize job titles

The first thing our workflow needs to do is to create a title-category mapping table. Here are the steps:

  1. Normalize the title texts based on a predefined set of rules
    • Remove meaningless words/characters
      • of, and, &
    • Expand clipped words
      • Mng., CEO, VP, … to Manager, Chief Executive Officer, Vice President, …
    • Convert to lower case
      • Manager and manager should be treated exactly the same way
  2. Categorize the normalized titles
    • role: all Chief XXX Officers should be in the same role category “executive”
    • job: both VP of Engineering and Software Engineer are in the same job category “engineering”
  3. Create an intermediate title-category mapping table

Here’s what the mapping looks like:

(original) titlecategory: rolecategory: job
 CEO executive marketing
 VP of Engineering manager engineering
 Software Engineer manager marketing
 Sales Manager entrepreneur engineering
 Co-founder and COO executive engineering
 CEO executive engineering
 … … …

Step 2: Estimate Category of New Contacts’ Job Title

Once the title-category mapping table has been created, for new contacts, our workflow tries to estimate the category of their title. This step compares the difference of texts based on a well-know scientific technique called Levenshtein distance[1] (isn’t that fun to say?).

For example, if a title of new contact is “Software Architect”, our AI looks up the mapping table and picks up single mapping for the most similar title, e.g. “Software Engineer.” As a result, we can directly use the categories associated with “Software Engineer” as estimated categories of a new title named “Software Architect.”

Neat, huh?

What’s next?

Although preprocessing (i.e. job title transformation) is actually one of the most challenging parts of our data analytic workflow, now that it’s done it opens the way for all sorts of tantalizing possibilities:

  • Create micro-segments of target accounts for account-based marketing (ABM)
  • Traverse the decision hierarchy and find the right influencers to accelerate your sales cycle
  • Reverse-engineer attribution by job title in order to find out what content is most relevant to your most important customers

All of this is made possible because Treasure Workflow and Treasure Machine Learning services have the capability to transform complex human-generated data into simple, easy-to-understand AI-friendly format.

If you are interested in the technical details of this article or AI-based real-world data analysis, or if you are interested in collaborating to improve and utilize our SFDC analytic system, please give us a shout!

Request a demo
Takuya Kitazawa
Takuya Kitazawa
Takuya is a Data Science Engineer who works on Research and Development at Treasure Data.
Related Posts