Pages

Wednesday, August 22, 2018

Python for Salesforce Administrators - "In Both Tables?" Dummy-Check Script

I was asked how to use Python to "dummy check" that every transaction in a Salesforce log also appeared in a payment processor's log, and vice-versa.

The following are a few simple scripts that do just that. We're going to work with our "sample1" & "sample2" files, which have some people who overlap by name and email, and others who don't.
In your case, instead of "name & email" you might want to match on "transaction ID" or "product ID, customer ID, and timestamp." Whatever appears in both datasets.

Note that "Albert Howard" is going to show up as "not in the other file" for both files because he has a different email address in each file.

(Prep work: First, make sure you've created a ".CSV" file like the one described in "First Scripts". I called mine "sample1.csv". Second, make sure you've created a 2nd ".CSV" file like the one described in "Combining Multiple Tables". As in that post, in each example, we'll overwrite the contents of a piece of Python code saved to our hard drive as "script.py" and run the entire script.)


Code for one big "error log" output file

Code:

import pandas
pandas.set_option('expand_frame_repr', False)
df1 = pandas.read_csv('C:\\tempexamples\\sample1.csv', dtype=object)
df2 = pandas.read_csv('C:\\tempexamples\\sample2.csv', dtype=object)

df1matchfields=['First','Last','Email']
df2matchfields=['FirstName','LastName','Em']

mergedf = df1.merge(df2, left_on=df1matchfields, right_on=df2matchfields, how='outer', indicator=True)

uniquedf = mergedf[mergedf['_merge'].str.endswith('_only')]
uniquedf.to_csv('C:\\tempexamples\\output.csv', index=0, quoting=1)

Output:

"Id","First","Last","Email","Company","PersonId","FirstName","LastName","Em","FavoriteFood","_merge"
"5829","Jimmy","Buffet","jb@example.com","RCA","","","","","","left_only"
"2894","Shirley","Chisholm","sc@example.com","United States Congress","","","","","","left_only"
"30829","Cesar","Chavez","cc@example.com","United Farm Workers","","","","","","left_only"
"724","Albert","Howard","ah@example.com","Imperial College of Science","","","","","","left_only"
"","","","","","983mv","Shirley","Temple","st@example.com","Lollipops","right_only"
"","","","","","k28fo","Donald","Duck","dd@example.com","Pancakes","right_only"
"","","","","","8xi","Albert","Howard","ahotherem@example.com","Potatoes","right_only"

Code for two separate "error log" output files

Code:

import pandas
pandas.set_option('expand_frame_repr', False)
df1 = pandas.read_csv('C:\\tempexamples\\sample1.csv', dtype=object)
df2 = pandas.read_csv('C:\\tempexamples\\sample2.csv', dtype=object)

df1matchfields=['First','Last','Email']
df2matchfields=['FirstName','LastName','Em']

mergedf = df1.merge(df2, left_on=df1matchfields, right_on=df2matchfields, how='outer', indicator=True)

uniquedf1 = mergedf[mergedf['_merge']=='left_only'][df1.columns]
uniquedf2 = mergedf[mergedf['_merge']=='right_only'][df2.columns]
uniquedf1.to_csv('C:\\tempexamples\\output1.csv', index=0, quoting=1)
uniquedf2.to_csv('C:\\tempexamples\\output2.csv', index=0, quoting=1)

Output 1:

"Id","First","Last","Email","Company"
"5829","Jimmy","Buffet","jb@example.com","RCA"
"2894","Shirley","Chisholm","sc@example.com","United States Congress"
"30829","Cesar","Chavez","cc@example.com","United Farm Workers"
"724","Albert","Howard","ah@example.com","Imperial College of Science"

Output 2:

"PersonId","FirstName","LastName","Em","FavoriteFood"
"983mv","Shirley","Temple","st@example.com","Lollipops"
"k28fo","Donald","Duck","dd@example.com","Pancakes"
"8xi","Albert","Howard","ahotherem@example.com","Potatoes"


The code above only works if you have "Pandas version 0.17" or greater installed.

Pandas is up to version "0.23" as of this blog post, so that's a pretty safe bet.

Except that the "no admin rights, even with old installations of Windows" programming environment I walked you through installing is many years old, and it comes with "Pandas version 0.16." So here are two scripts that do the equivalent job in older versions, if you don't feel like upgrading everything.

"Old Pandas" code for one big "error log" output file

Code:

import pandas
pandas.set_option('expand_frame_repr', False)
df1 = pandas.read_csv('C:\\tempexamples\\sample1.csv', dtype=object)
df2 = pandas.read_csv('C:\\tempexamples\\sample2.csv', dtype=object)

df1matchfields=['First','Last','Email']
df2matchfields=['FirstName','LastName','Em']

mergedf = df1.assign(SourceDF='df1').merge(df2.assign(SourceDF='df2'), left_on=['First','Last','Email'], right_on=['FirstName','LastName','Em'], how='outer')
uniquedf = mergedf[mergedf['SourceDF_x'].isnull() | mergedf['SourceDF_y'].isnull()]

uniquedf.to_csv('C:\\tempexamples\\output.csv', index=0, quoting=1)

Output:

"Id","First","Last","Email","Company","SourceDF_x","PersonId","FirstName","LastName","Em","FavoriteFood","SourceDF_y"
"5829","Jimmy","Buffet","jb@example.com","RCA","df1","","","","","",""
"2894","Shirley","Chisholm","sc@example.com","United States Congress","df1","","","","","",""
"30829","Cesar","Chavez","cc@example.com","United Farm Workers","df1","","","","","",""
"724","Albert","Howard","ah@example.com","Imperial College of Science","df1","","","","","",""
"","","","","","","983mv","Shirley","Temple","st@example.com","Lollipops","df2"
"","","","","","","k28fo","Donald","Duck","dd@example.com","Pancakes","df2"
"","","","","","","8xi","Albert","Howard","ahotherem@example.com","Potatoes","df2"

"Old Pandas" code for two separate "error log" output files

Code:

import pandas
pandas.set_option('expand_frame_repr', False)
df1 = pandas.read_csv('C:\\tempexamples\\sample1.csv', dtype=object)
df2 = pandas.read_csv('C:\\tempexamples\\sample2.csv', dtype=object)
 
df1matchfields=['First','Last','Email']
df2matchfields=['FirstName','LastName','Em']
 
mergedf = df1.assign(SourceDF='df1').merge(df2.assign(SourceDF='df2'), left_on=['First','Last','Email'], right_on=['FirstName','LastName','Em'], how='outer')
 
uniquedf1 = mergedf[mergedf['SourceDF_y'].isnull()][df1.columns]
uniquedf2 = mergedf[mergedf['SourceDF_x'].isnull()][df2.columns]
uniquedf1.to_csv('C:\\tempexamples\\output1.csv', index=0, quoting=1)
uniquedf2.to_csv('C:\\tempexamples\\output2.csv', index=0, quoting=1)

Output 1:

"Id","First","Last","Email","Company"
"5829","Jimmy","Buffet","jb@example.com","RCA"
"2894","Shirley","Chisholm","sc@example.com","United States Congress"
"30829","Cesar","Chavez","cc@example.com","United Farm Workers"
"724","Albert","Howard","ah@example.com","Imperial College of Science"

Output 2:

"PersonId","FirstName","LastName","Em","FavoriteFood"
"983mv","Shirley","Temple","st@example.com","Lollipops"
"k28fo","Donald","Duck","dd@example.com","Pancakes"
"8xi","Albert","Howard","ahotherem@example.com","Potatoes"

Table of Contents

Monday, August 20, 2018

Python for Salesforce Administrators - A "Combining Multiple Tables" (VLOOKUP) Example with the "Simple Salesforce" plugin

Today I used Python to facilitate a quick UPDATE of 60 Contact records in Salesforce.

In Salesforce, there were 60-ish Contact records where I had instructed an end user to populate a field called "RegSys_External_Id_B__c" with the "ID" that they could see while logged into their registration system.

However, what I really wanted to do was make sure that "RegSys_External_Id_A__c" was populated with the "real external ID" (that my end users couldn't see, but that held everything together in the back-end of their registration system) of each person they had hand-matched.

I asked them to simply fill it in with dummy values, "XXXX01," "XXXX02," "XXXX03," etc. so that I could easily find the records they had already "hand-checked" later.

I then used Python as follows:

First, I used Python's "Simple Salesforce" plugin to log into our org and download the 60 Contact records into Python's "Pandas" plugin. I saved that data into a variable "cstofixdf" (as in "Contacts To Fix DataFrame"). Note that the data that comes back from Simple Salesforce has to have "['records']" appended to it to become something that "Pandas" can read and turn into a "DataFrame" (2-dimensional table). Also, I drop the "attributes" column that comes back with "Simple Salesforce" data because I think it's ugly and takes up unnecessary space on my screen when I preview what my "DataFrame" looks like. Anyway, I also printed out all the "RegSys_External_Id_B__c" values as a comma-separated, quote-surrounded list onto my screen so that I could easily copy them to my clipboard.

import pandas
pandas.set_option('expand_frame_repr', False)

from simple_salesforce import Salesforce
sf = Salesforce(username='myemail@example.com', password='mypassword', security_token='mysalesforcesecuritytoken')

cs = sf.query("SELECT Id, RegSys_External_Id_A__c, RegSys_External_Id_B__c FROM Contact WHERE RegSys_External_Id_B__c <> NULL AND RegSys_External_Id_A__c LIKE 'X%'")['records']
cstofixdf = pandas.DataFrame(cs)
cstofixdf.drop('attributes', axis='columns', inplace=True)

print(list(cstofixdf['Flatbridge_Student_ID__c']))

My output looked like this -- only with more like 60 items in the list instead of 4:

['8294', '29842', '8482', '2081']

Once I had copied the list between the square-brackets onto my clipboard, I pasted it into the tool by which I query the back-end of my end user's course registration system. It uses normal "SQL," so my code looked like this:

SELECT IdA, IdB
FROM ExternalIdMappingTable
WHERE IdB IN ('8294', '29842', '8482', '2081')

I exported the 60 rows of output from THAT database onto my hard drive at "c:\examples\otherdb.csv"

I then ran the following code -- note that I've "commented out" the "print(list(...))" line of my code with a "#" since I no longer need to print that information to my screen.

import pandas
pandas.set_option('expand_frame_repr', False)

from simple_salesforce import Salesforce
sf = Salesforce(username='myemail@example.com', password='mypassword', security_token='mysalesforcesecuritytoken')

cs = sf.query("SELECT Id, RegSys_External_Id_A__c, RegSys_External_Id_B__c FROM Contact WHERE RegSys_External_Id_B__c <> NULL AND RegSys_External_Id_A__c LIKE 'X%'")['records']
cstofixdf = pandas.DataFrame(cs)
cstofixdf.drop('attributes', axis='columns', inplace=True)

#print(list(cstofixdf['Flatbridge_Student_ID__c']))

regdf = pandas.read_csv('c:\\examples\\otherdb.csv', dtype=object)

mergedf = cstofixdf.merge(regdf, how='inner', left_on='RegSys_External_Id_B__c', right_on='IdB')
mergedf.drop(['IdB', 'RegSys_External_Id_A__c', 'RegSys_External_Id_B__c'], axis='columns', inplace=True)
mergedf.rename(columns={'IdA':'RegSys_External_Id_A__c'}, inplace=True)

mergedf.to_csv('c:\\examples\\uploadme.csv', index=0, quoting=1)

The last 5 lines of code are the new part.

First, I import "otherdb.csv"

Next, I "inner-merge" (like a VLOOKUP, with "inner" specifying that the "ID B" must show up in both datasets) the two "DataFrames" -- the one I downloaded from Salesforce and the one I just loaded in from CSV, and save the output into a new "DataFrame" called "mergedf."

Not shown here, I did a bit of quality-checking after I wrote the line of code that did the merge, before getting rid of "excess" columns. For example, I did:

print(mergedf)

and

print(len(cstofixdf))

and

print(len(regdf))

and

print(len(mergedf))

I double-checked that all the data-sets were the same length (there shouldn't have been any duplicates or missing values), and I hand-skimmed the results of "mergedf" to make sure I hadn't written the ".merge()" code wrong.

After that, I get rid of a few columns I won't want in my output CSV file: the old "A" & "B" values from Salesforce (I just needed the Salesforce ID), as well as the "B" value from the SQL download.

Then I renamed the SQL-downloaded "IdA" column to "RegSys_External_Id_A__c" to be explicit that this was the Salesforce Contact-table field I intended to update this data into.

Finally, I dumped the "mergedf" "DataFrame" to CSV on my hard drive.

From there, I was just a Data Load away from having proper data in Salesforce!


P.S. You may note that some of my code reads like this:

mergedf.drop(['IdB', 'RegSys_External_Id_A__c', 'RegSys_External_Id_B__c'], axis='columns', inplace=True)

Whereas older examples in my blog read like this:

mergedf = mergedf.drop(['IdB', 'RegSys_External_Id_A__c', 'RegSys_External_Id_B__c'], axis=1)

They do the same thing.

Many Pandas functions produce an altered copy of your "DataFrame" or "Series," rather than altering the actual data you have stored into a given "variable." Therefore, if you actually want to store the "altered" version into that "variable name," you have to explicitly do so. This is actually pretty common in programming -- you'll see things like "x = x + 1" all the time.

The "inplace=True" option on certain Pandas operations is a shorthand that keeps you from having to type all that. Sometimes, it gets buggy on me, and I didn't want your first lessons to fail, so I wrote my introductory examples the "long way."

You may also notice a difference between "axis='columns'" and "axis=1". They do the same thing. "1" is faster to type than "'columns'", but obviously "'columns'" is easier to read when you come back and need to figure out what you did later. Your choice!


Table of Contents