I was asked how to use Python to "dummy check" that every transaction in a Salesforce log also appeared in a payment processor's log, and vice-versa.
The following are a few simple scripts that do just that. We're going to work with our "sample1" & "sample2" files, which have some people who overlap by name and email, and others who don't.
In your case, instead of "name & email" you might want to match on "transaction ID" or "product ID, customer ID, and timestamp." Whatever appears in both datasets.
Note that "Albert Howard" is going to show up as "not in the other file" for both files because he has a different email address in each file.
(Prep work: First, make sure you've created a ".CSV" file like the one described in "First Scripts". I called mine "sample1.csv". Second, make sure you've created a 2nd ".CSV" file like the one described in "Combining Multiple Tables". As in that post, in each example, we'll overwrite the contents of a piece of Python code saved to our hard drive as "script.py" and run the entire script.)
Code for one big "error log" output file
Code:
import pandas pandas.set_option('expand_frame_repr', False) df1 = pandas.read_csv('C:\\tempexamples\\sample1.csv', dtype=object) df2 = pandas.read_csv('C:\\tempexamples\\sample2.csv', dtype=object) df1matchfields=['First','Last','Email'] df2matchfields=['FirstName','LastName','Em'] mergedf = df1.merge(df2, left_on=df1matchfields, right_on=df2matchfields, how='outer', indicator=True) uniquedf = mergedf[mergedf['_merge'].str.endswith('_only')] uniquedf.to_csv('C:\\tempexamples\\output.csv', index=0, quoting=1)
Output:
"Id","First","Last","Email","Company","PersonId","FirstName","LastName","Em","FavoriteFood","_merge" "5829","Jimmy","Buffet","jb@example.com","RCA","","","","","","left_only" "2894","Shirley","Chisholm","sc@example.com","United States Congress","","","","","","left_only" "30829","Cesar","Chavez","cc@example.com","United Farm Workers","","","","","","left_only" "724","Albert","Howard","ah@example.com","Imperial College of Science","","","","","","left_only" "","","","","","983mv","Shirley","Temple","st@example.com","Lollipops","right_only" "","","","","","k28fo","Donald","Duck","dd@example.com","Pancakes","right_only" "","","","","","8xi","Albert","Howard","ahotherem@example.com","Potatoes","right_only"
Code for two separate "error log" output files
Code:
import pandas pandas.set_option('expand_frame_repr', False) df1 = pandas.read_csv('C:\\tempexamples\\sample1.csv', dtype=object) df2 = pandas.read_csv('C:\\tempexamples\\sample2.csv', dtype=object) df1matchfields=['First','Last','Email'] df2matchfields=['FirstName','LastName','Em'] mergedf = df1.merge(df2, left_on=df1matchfields, right_on=df2matchfields, how='outer', indicator=True) uniquedf1 = mergedf[mergedf['_merge']=='left_only'][df1.columns] uniquedf2 = mergedf[mergedf['_merge']=='right_only'][df2.columns] uniquedf1.to_csv('C:\\tempexamples\\output1.csv', index=0, quoting=1) uniquedf2.to_csv('C:\\tempexamples\\output2.csv', index=0, quoting=1)
Output 1:
"Id","First","Last","Email","Company" "5829","Jimmy","Buffet","jb@example.com","RCA" "2894","Shirley","Chisholm","sc@example.com","United States Congress" "30829","Cesar","Chavez","cc@example.com","United Farm Workers" "724","Albert","Howard","ah@example.com","Imperial College of Science"
Output 2:
"PersonId","FirstName","LastName","Em","FavoriteFood" "983mv","Shirley","Temple","st@example.com","Lollipops" "k28fo","Donald","Duck","dd@example.com","Pancakes" "8xi","Albert","Howard","ahotherem@example.com","Potatoes"
The code above only works if you have "Pandas version 0.17" or greater installed.
Pandas is up to version "0.23" as of this blog post, so that's a pretty safe bet.
Except that the "no admin rights, even with old installations of Windows" programming environment I walked you through installing is many years old, and it comes with "Pandas version 0.16." So here are two scripts that do the equivalent job in older versions, if you don't feel like upgrading everything.
"Old Pandas" code for one big "error log" output file
Code:
import pandas pandas.set_option('expand_frame_repr', False) df1 = pandas.read_csv('C:\\tempexamples\\sample1.csv', dtype=object) df2 = pandas.read_csv('C:\\tempexamples\\sample2.csv', dtype=object) df1matchfields=['First','Last','Email'] df2matchfields=['FirstName','LastName','Em'] mergedf = df1.assign(SourceDF='df1').merge(df2.assign(SourceDF='df2'), left_on=['First','Last','Email'], right_on=['FirstName','LastName','Em'], how='outer') uniquedf = mergedf[mergedf['SourceDF_x'].isnull() | mergedf['SourceDF_y'].isnull()] uniquedf.to_csv('C:\\tempexamples\\output.csv', index=0, quoting=1)
Output:
"Id","First","Last","Email","Company","SourceDF_x","PersonId","FirstName","LastName","Em","FavoriteFood","SourceDF_y" "5829","Jimmy","Buffet","jb@example.com","RCA","df1","","","","","","" "2894","Shirley","Chisholm","sc@example.com","United States Congress","df1","","","","","","" "30829","Cesar","Chavez","cc@example.com","United Farm Workers","df1","","","","","","" "724","Albert","Howard","ah@example.com","Imperial College of Science","df1","","","","","","" "","","","","","","983mv","Shirley","Temple","st@example.com","Lollipops","df2" "","","","","","","k28fo","Donald","Duck","dd@example.com","Pancakes","df2" "","","","","","","8xi","Albert","Howard","ahotherem@example.com","Potatoes","df2"
"Old Pandas" code for two separate "error log" output files
Code:
import pandas pandas.set_option('expand_frame_repr', False) df1 = pandas.read_csv('C:\\tempexamples\\sample1.csv', dtype=object) df2 = pandas.read_csv('C:\\tempexamples\\sample2.csv', dtype=object) df1matchfields=['First','Last','Email'] df2matchfields=['FirstName','LastName','Em'] mergedf = df1.assign(SourceDF='df1').merge(df2.assign(SourceDF='df2'), left_on=['First','Last','Email'], right_on=['FirstName','LastName','Em'], how='outer') uniquedf1 = mergedf[mergedf['SourceDF_y'].isnull()][df1.columns] uniquedf2 = mergedf[mergedf['SourceDF_x'].isnull()][df2.columns] uniquedf1.to_csv('C:\\tempexamples\\output1.csv', index=0, quoting=1) uniquedf2.to_csv('C:\\tempexamples\\output2.csv', index=0, quoting=1)
Output 1:
"Id","First","Last","Email","Company" "5829","Jimmy","Buffet","jb@example.com","RCA" "2894","Shirley","Chisholm","sc@example.com","United States Congress" "30829","Cesar","Chavez","cc@example.com","United Farm Workers" "724","Albert","Howard","ah@example.com","Imperial College of Science"
Output 2:
"PersonId","FirstName","LastName","Em","FavoriteFood" "983mv","Shirley","Temple","st@example.com","Lollipops" "k28fo","Donald","Duck","dd@example.com","Pancakes" "8xi","Albert","Howard","ahotherem@example.com","Potatoes"
Table of Contents
- Introduction & Table of Contents
- Before You Begin Programming
- Installing Python 3 on Windows
- First Scripts - Import/Export CSVs, Filter Out Rows
- Rename/Drop/Add Columns
- Recognizing Pandas DataFrames vs. Series (Important!)
- Fancier Row-Filtering and Data-Editing
- Combining Multiple Tables
- Filtering Rows By Maximum Date Per Group
- Introduction to XML and JSON