Pages

Tuesday, October 23, 2018

Python for CSV Files - Deduplicating "Some States, But Not Others"

(Version française en dessous)

I encountered a lovely "Excel to Python" business problem this morning. Many thanks to my colleague for explaining it to me, and apologies for snooping.

He had a spreadsheet containing 20,000 American high schools (column B), what state they were in (column C), and which admissions counselor was assigned to the school (column A).

His university assigns individual admissions counselors to schools within the states of Delaware (DE) and Maryland (MD), but elsewhere, each counselor covers an entire state. Nevertheless, this particular spreadsheet had every single high school listed out, with its state and counselor, no matter where it was located.

What my colleague wanted to do was reduce the size of the spreadsheet so that Delaware and Maryland schools would be listed individually, but so that the rest of the spreadsheet would simply contain a list of the remaining 48 U.S. states and the counselors assigned to them.

The input CSV file looked like this:

Counselor,School,State
Ana,Dover HS,DE
Leia,Anapolis HS,MD
Ana,Baltimore HS,MD
Leia,Best Prep,DE
Mo,Denver HS,CO
Ana,Howard HS,DE
Mo,Boulder HS,CO
Mo,Boise HS,ID
Leia,Fairfax HS,VA
Mo,Excel Academy,ID
Mo,Mann HS,ID
Leia,Roanoke HS,VA

He wanted the output CSV file to look like this:

Counselor,School,State
Ana,Dover HS,DE
Leia,Anapolis HS,MD
Ana,Baltimore HS,MD
Leia,Best Prep,DE
Ana,Howard HS,DE
Mo,STATE-CO,CO
Mo,STATE-ID,ID
Leia,STATE-VA,VA

Because we know exactly which states are treated "one way" versus "another way," this is a very straightforward "split the dataset horizontally, do something to one part of the split, and concatenate it back together" operation.

In fact, we don't even really need Python for this. In Excel, in 2 minutes, we could:

  1. Filter the spreadsheet to "everything but DE & MD"
  2. Extract those rows to a new tab of the spreadsheet
  3. Do "Remove Duplicates" on column C
    • (or, if we wanted to dummy-check for accidental territory splits in the "other 48" states, on columns A & C)
  4. Alter the contents of column B to be a concatenation of the word "STATE-" and the contents of column C
  5. Un-filter our original (now shorter) worksheet
  6. Cut-paste our "temporary" worksheet back onto the end of it

But if we did want to use Python, the script would be pretty straightforward. It would imitate those same steps in Excel:

import pandas
df = pandas.read_csv('c:\\example\\HighSchools.csv', dtype=object)
dfdemd = df[df['State'].isin(['DE','MD'])]
dfoth48 = df[~df['State'].isin(['DE','MD'])]
dfoth48 = dfoth48.drop_duplicates(subset=['State'], keep='first')
dfoth48['School'] = 'STATE-'+dfoth48['State']
df = pandas.concat([dfdemd,dfoth48])
df.to_csv('c:\\example\\FilteredHighSchools.csv', index=0, quoting=1)

Note that if this were an Excel XLSX file, we would use read_excel and to_excel instead (eliminating the ", quoting=1" option).


Where the problem gets really interesting is that we can let the computer tell us which states have "just 1 counselor for the whole state" and which states "should be listed school-by-school, because they seem to be split."

In Excel, we would:

  1. Add a "column D" counting the number of unique "column A" values per value in "column C."
    • (There are various tips and tricks available on the internet to accomplish this in Excel.)
  2. Then we would do the same "split, alter part of the data, and re-concatenate" trick we did earlier, only we would "split" based on "column D" instead of based on "column C" (looking for "1" vs. "2 & up").

Again, in Python, we'll imitate our steps in Excel:

import pandas
df = pandas.read_csv('c:\\example\\HighSchools.csv', dtype=object)
df['Counts'] = df['State'].map(df[['Counselor','State']].drop_duplicates(keep='first')['State'].value_counts())
df_mult_couns = df[df['Counts'] > 1]
df_1_couns = df[df['Counts'] == 1]
df_1_couns = df_1_couns.drop_duplicates(subset=['State'], keep='first')
df_1_couns['School'] = 'STATE-'+df_1_couns['State']
df = pandas.concat([df_mult_couns,df_1_couns])
df = df.drop(['Counts'], axis=1)
df.to_csv('c:\\example\\SmartFilteredHighSchools.csv', index=0, quoting=1)

This script is a lot like the previous one, only I renamed "dfdemd" to "df_mult_couns" because we are presuming that we don't yet know it's Delaware and Maryland that have multiple counselors per state. Similarly, I renamed "dfoth48" to "df_1_couns."

The new lines of code are the 3rd line and the 2nd-to-last line. Also, the 4th & 5th lines have been altered to filter on "Counts" instead of filtering on "State." Still, once we have split our data, we no longer need the "Counts" column (we could have "dropped" it from each fragment of our dataset here instead of from the re-concatenated dataset right before exporting to CSV), and we do our "dropping of duplicates" in the "1-counselor-per-state" states as usual, by "State."

Line 3 is our most complicated line of code. Let's work from the middle, starting inside the ".map()" command.

  1. "df[['Counselor','State']]" produces a new DataFrame that is a copy of our original DataFrame, but with data from only those 2 columns.
  2. Appending "drop_duplicates(keep='first')" to it produces a new DataFrame that is a copy of it, only using "Counselor" and "State" together (that is, all 2 columns of our input DataFrame) as a matching key.
  3. Appending "['State'].value_counts()" to that "deduplicated" DataFrame produces a new "Series" (fancy list-like thing) where the "item IDs" are the states from our data set, and the values are the number of unique "Counselor" values for that "State" value.

Now that we know what's inside the ".map()" command (a "Series" of states and their "unique Counselor value counts"), let's take a look at Line 3 of our code, starting after the "=."

  1. "df['State'].map(ourSeriesOfStatesWithCounselorCounts)" produces a new "Series," with the same set of "item IDs" as the "row IDs" of our original DataFrame we imported from CSV, and with values populated by numbers indicating the number of "distinct counselor values" for the "State" found in that row of our data.

Its output looks like this:

0     2
1     2
2     2
3     2
4     1
5     2
6     1
7     1
8     1
9     1
10    1
11    1
Name: State, dtype: int64
  1. Finally, all we do by putting this to the right of "df['Counts'] = " is to add a new "Counts" column to our original DataFrame and fill it in with these numbers.

Voila! We've added "column D!"

It was a little fancy getting there, but save this script to your computer -- or bookmark this blog post -- and you can copy/paste/tweak it any time it seems useful, whether or not you understand it enough to write it from scratch. (Don't worry -- I had to Google it and copy/paste/modify other people's code myself. 😊)



🇫🇷 - en français

Aujourd'hui j'ai rencontré un beau problème Excel bien adapté à résoudre avec Python.

Un collègue travaillait sur une feuille de calcul contenant 20 000 lycées américains (colonne B), l'état des E-U où chacun se trouve (colonne C) et le nom du recruteur qui se charge de l'école (colonne A).

L'université où travaille mon collègue divise la charge de travail du Maryland ("MD") et du Delaware ("DE"). Aux autres 48 états, un seul recruteur gère tous les lycées de l'état.

Mon collègue voulait réduire la taille de cette feuille de calcul. Il ne voulait changer aucun enregistrement du Delaware ni du Maryland. Pour les 48 autres états, il voulait éliminer la liste des écoles et garder uniquement 48 enregistrements, contenant chaque état et le recruteur qui s'en charge.

Sa feuille de calcul ressemblait à celle-ci:

Counselor,School,State
Ana,Dover HS,DE
Leia,Anapolis HS,MD
Ana,Baltimore HS,MD
Leia,Best Prep,DE
Mo,Denver HS,CO
Ana,Howard HS,DE
Mo,Boulder HS,CO
Mo,Boise HS,ID
Leia,Fairfax HS,VA
Mo,Excel Academy,ID
Mo,Mann HS,ID
Leia,Roanoke HS,VA

Il souhaitait une feulle de calcul de sortie comme celle-ci:

Counselor,School,State
Ana,Dover HS,DE
Leia,Anapolis HS,MD
Ana,Baltimore HS,MD
Leia,Best Prep,DE
Ana,Howard HS,DE
Mo,STATE-CO,CO
Mo,STATE-ID,ID
Leia,STATE-VA,VA

Comme on sait déjà comment diviser la feuille pour traiter les données de telle ou telle manière, le travail est très simple. Cela consiste à diviser les données en "DE + MD" et "ailleurs", à dédupliquer la partie "ailleurs", et à concaténer les deux divisions.

En fait, on n'a pas vraiment besoin de Python. C'est un travail de 2 minutes avec Excel.

  1. filtrer la feuille de calcul pour extraire les enregistrements "ailleurs"
  2. extraire ces lignes dans un nouvel onglet
  3. dédupliquer sur la colonne "C"
    • (ou, pour vérifier qu'il n'y ait pas d'erreurs de saisie de données, sur "A" et "C" ensemble, cherchant plus que 2 enregistrements par état)
  4. modifier le contenu de la colonne "B" pour qu'il soit une concaténation du mot "STATE-" et du contenu de la colonne "C"
  5. supprimer le filtre de notre onglet de feuille de calcul d'origine (où rien ne reste que les données "Maryland" et "Delaware")
  6. couper-coller les contenus du nouvel onglet à la fin de l'originale

Mais le code Python ne serait pas trop compliqué. Le script imiterait nos étapes Excel:

import pandas
df = pandas.read_csv('c:\\example\\HighSchools.csv', dtype=object)
dfdemd = df[df['State'].isin(['DE','MD'])]
dfoth48 = df[~df['State'].isin(['DE','MD'])]
dfoth48 = dfoth48.drop_duplicates(subset=['State'], keep='first')
dfoth48['School'] = 'STATE-'+dfoth48['State']
df = pandas.concat([dfdemd,dfoth48])
df.to_csv('c:\\example\\FilteredHighSchools.csv', index=0, quoting=1)

(Pour un fichier XLSX et non CSV, on se servirait des commandes "read_excel" et "to_excel" et on supprimerait le code ", quoting=1" de la fin du script.)


Mais notre problème devient vraiment intéressant car nous pouvons laisser Python calculer quels états n'ont qu'un seul recruteur et quels états sont la charge de plusieurs recruteurs.

Avec Excel, il s'agit de:

  1. Ajouter une "colonne D" qui compte le nombre de valeurs uniques du "colonne A" pour chaque valeur dans "colonne C."
    • (Il y a des astuces en ligne pour y arriver.)
  2. Répéter notre opération de "scinder, modifier une partie, concaténer" de plus tôt. Sauf que nous divisons par la colonne "D" ("1" et "2 ou plus").

Encore une fois, le code Python va imiter nos étapes Excel:

import pandas
df = pandas.read_csv('c:\\example\\HighSchools.csv', dtype=object)
df['Counts'] = df['State'].map(df[['Counselor','State']].drop_duplicates(keep='first')['State'].value_counts())
df_mult_couns = df[df['Counts'] > 1]
df_1_couns = df[df['Counts'] == 1]
df_1_couns = df_1_couns.drop_duplicates(subset=['State'], keep='first')
df_1_couns['School'] = 'STATE-'+df_1_couns['State']
df = pandas.concat([df_mult_couns,df_1_couns])
df = df.drop(['Counts'], axis=1)
df.to_csv('c:\\example\\SmartFilteredHighSchools.csv', index=0, quoting=1)

Ce script ressemble beaucoup au précédent. J'ai renommé "dfdemd" en "df_mult_couns" car dans ce cas, nous ne savons pas encore que "Delaware" et "Maryland" auront plusieurs recruteurs par état. De même, j'ai renommé "dfoth48" en "df_1_couns".

Les nouvelles lignes de code sont la 3ème ligne et l'avant-dernière ligne.

En outre, les lignes 4 et 5 ont été modifiées pour filtrer sur "Counts" au lieu de filtrer sur "State".

Apres les lignes 4 et 5, nous n'avons plus besoin de la colonne "Counts". (Nous aurions pu le supprimer de chaque scinde après les lignes 4 et 5, mais j'ai décidé plutôt de supprimer cette colonne des données reconcaténées dans l'avant-dernière ligne).

On fait toujours notre déduplication principale par "State."

La ligne 3 du code est la ligne la plus compliquée. Inspectons-la en commençant au milieu, à l'intérieur de la commande ".map()".

  1. "df[['Counselor','State']]" génère un nouveau "DataFrame" qui est une copie de notre DataFrame d'origine, mais avec des données provenant uniquement de ces 2 colonnes.
  2. L'ajout de ".drop_duplicates(keep = 'first')" le déduplique par recruteur et état, pour générer encore un nouveau "DataFrame".
  3. L'ajout de "['State'].value_counts()" à ce DataFrame "dédupliqué" génère une nouvelle "Série" (élément de type liste ... amélioré ... plutôt tableau associatif) où les clés sont nos états et les valeurs sont le nombre de recruteurs uniques pour chaque état.

Maintenant que nous comprenons l'intérieur de ".map()" ("Série" avec les états comme clé), examinons la partie de la ligne 3 qui suit le "=".

  1. "df['State'].map(notre_Tableau_Associatif)" produit une nouvelle "Série" avec le même ensemble de clés que sont les clés des enregistrements de notre "DataFrame" d'origine (de notre fichier CSV) -- c'est-à-dire les numéros qui identifie les lignes du fichier. Les valeurs de cette nouvelle "Série" sont les nombres de recruteurs qui correspondent à l'état trouvé sous "State" pour chaque enregistrement.

La sortie de cette commande ressemble à:

0     2
1     2
2     2
3     2
4     1
5     2
6     1
7     1
8     1
9     1
10    1
11    1
Name: State, dtype: int64
  1. Finalement, "df['Counts'] = " ne fait qu'ajouter cette liste nombres de recruteurs au DataFrame d'origine comme nouvelle colonne (nommée "Counts").

Et voilà! On a ajouté la "colonne D" ! Après, ce n'est qu'une question de finir les filtres.

Comme j'ai dit, la ligne 3 était peut-être un peu compliqué, mais enregistrez ce script dans un fichier sur votre ordinateur -- ou ajoutez cet article à vos favoris -- et vous pouvez le copier-coller-modifier au moment qu'il vous convient (que vous le compreniez bien ou non). (Ne vous inquiétez pas, moi aussi j'ai Googlé pour écrire cet article ! 😊)



"Python for Salesforce Administrators" Table of Contents

Thursday, October 18, 2018

Salesforce Custom Metadata vs. Custom Objects: Where Should I Put My Configuration and Validation Settings?

(Version française en dessous)

Yesterday, a colleague asked me the difference between "custom metadata" and "custom objects" for storing "configuration and validation" information in Salesforce.

My answer is:

  • Custom metadata if you can. Salesforce says so.
    And it survives sandbox refreshes!
  • Data tables (custom objects) if you truly need it to be part of your data
    (e.g. people will want to include it, hyperlinked to other data, in reports).

When I wrote an Apex trigger that determined which User should "own" each "Admissions Application" in our org, I ended up splitting the configuration data in two.

Here's why:

Custom Metadata

Used for a table that helps Apex code ask, "Who does this?"

  • key: the code for a graduate degree we offer
  • value: a username responsible for recruiting people to that graduate degree

Data (Custom Objects)

Used for a list of U.S. states, their full spellings, and fields with "lookup" links to the User table indicating who does what work for that state.

  • There was strong demand to generate native Salesforce reports per User showing all the states they're responsible for and the high schools in those states. It made sense to ensure that the "high school" table could have a "lookup" field to this "states" table.
  • Custom metadata can have "master-detail" and "lookup" relational links to other custom metadata, but it can't link to ordinary data.
    • This meant we needed to store the the "states" as data (custom objects), even though we would also be using it as configuration information for Apex triggers.

UI & Editability Considerations

I'll let you in on a dirty little secret about another reason I used data tables ("custom objects") for most of the Undergraduate Admissions counselor assignment configuration data.

Undergraduate Admissions tweaks their "recruiter assignment" business rules several times a year. The real-world configuration data for their business is a lot more complex than a simple "list of U.S. states."

I'll be honest: Salesforce's user interfaces for hand-editing, viewing, and bulk-editing data are a lot more end-user-friendly than their user interfaces for the same operations on custom metadata, and setting granular "edit" permissions is a lot more sysadmin-friendly for data. I wanted to make sure end users, not sysadmins, were the ones whose time was spent tweaking the configuration several times a year!

I was thoroughly scolded at Dreamforce. Actually, I stand by my decision to use "data" tables, because there truly is a business need to report on the configuration data alongside normal data. But ... building your own user interfaces (typically using Lightning Components) to help end users edit custom metadata was a big theme. You have been warned:

  1. Dan Appleman's "Build Awesome Configuration Pages with Lightning Components & Custom Metadata"
  2. Gustavo Melendez & Krystian Charubin's "Crafting Flexible APIs in Apex Using Custom Metadata"
  3. Beth Breisness & Randi Wilson's "Create Guided User Experiences for Managing ISV Custom Metadata"


🇫🇷 - en français

Une collègue, peu familière avec Salesforce, m'a demandée quelle était la différence entre « métadonnées personnalisées » et « objets personnalisés » pour stocker des données de configuration et de validation dans une organisation Salesforce.

J'ai répondu:

  • Selon Salesforce, s'il est possible, utilisez des métadonnées personnalisées.
    (Elles survivent l'actualisation d'un environnement sandbox.)

  • Stockez vos informations aux tables de bases de données (objets personnalisés) s'il faut créer des relations aux autres objets de votre organisation.
    (Par exemple, s'il y a des utilisateurs qui vont générer des rapports sur ces données.)

Mon approche

Lorsque j'ai écrit un déclencheur Apex pour attribuer les enregistrements de notre objet « demande d'admission » aux utilisateurs corrects, j'ai fini par utiliser les deux approches:

Métadonnées personnalisées

J'ai choisi une type de métadonnées personnalisées pour stocker des données simples concernant « qui gère chaque spécialisation ? »

  • clé: le code qui indique un diplôme d'études supérieures que l'université offre
  • valeur: un nom d'utilisateur chargé de recruter des étudiants au diplôme

Objets personnalisés

J'ai choisi un objet personalisé pour stocker une liste d'états américains (abréviation et nom), avec quelques champs de référence indiquant les utilisateurs qui y gèrent divers aspects du recrutement d'étudiants en license.

  • Nos utilisateurs ont besoin de générer des rapports, regroupés par utilisateur, avec les états qu'ils gèrent et les lycées qui y sont situés. Donc il fallait pouvoir créer une relation de l'objet « lycée » à l'objet « état ».
  • On peut créer des relations entre métadonnées personnalisées, mais pas entre les types de métadonnées personnalisées et les objets personalisés.
    • Donc j'ai choisi un modèle « objet personnalisé » pour stocker les informations à propos des états, même si ces informations servent aussi comme données de configuration pour un déclencheur Apex.

Considérations au sujet de l'interface utilisateur et au contrôle d'accès

Je vais vous révéler un secret : ce n'est pas qu‘à propos des besoins relationnels que j'ai choisi le modèle « objet personnalisé » pour les informations à propos de la gestion du recrutement en licence.

Le service admission en license changent les règles de « qui gère quoi » plusieurs fois par an. Et les données de configuration sont, en réalité, beaucoup plus complexes qu'une seule liste d'états.

Actuellement, les interfaces utilisateur de Salesforce pour voir et enregistrer des données aux objets personnalisés sont supérieures à celles pour les métadonnées personnalisées – surtout si l'on est utilisateur ordinaire. Le contrôle d'accès, aussi, est plus facile à gérer pour les administrateurs. Je voulais m'assurer que ça soit des utilisateurs ordinaires qui font plusieurs fois par an cette saisie de données, et non pas les administrateurs !

On m'a fait honte de cette position à Dreamforce. Ben, je maintiens ma choix d'objet personnalisé pour raisons de relations entre objets. Mais … on a beaucoup répété qu'il est de bonne pratique de construire ses propres interfaces utilisateur pour autoriser des utilisateurs ordinaires à modifier les métadonnées personnalisées en toute sécurité. Je vous aurai prévenus !

Vidéos pertinentes de Dreamforce

  1. « Build Awesome Configuration Pages with Lightning Components & Custom Metadata » de Dan Appleman
  2. « Crafting Flexible APIs in Apex Using Custom Metadata » de Gustavo Melendez & Krystian Charubin
  3. « Create Guided User Experiences for Managing ISV Custom Metadata » de Beth Breisness & Randi Wilson

Wednesday, August 22, 2018

Python for Salesforce Administrators - "In Both Tables?" Dummy-Check Script

I was asked how to use Python to "dummy check" that every transaction in a Salesforce log also appeared in a payment processor's log, and vice-versa.

The following are a few simple scripts that do just that. We're going to work with our "sample1" & "sample2" files, which have some people who overlap by name and email, and others who don't.
In your case, instead of "name & email" you might want to match on "transaction ID" or "product ID, customer ID, and timestamp." Whatever appears in both datasets.

Note that "Albert Howard" is going to show up as "not in the other file" for both files because he has a different email address in each file.

(Prep work: First, make sure you've created a ".CSV" file like the one described in "First Scripts". I called mine "sample1.csv". Second, make sure you've created a 2nd ".CSV" file like the one described in "Combining Multiple Tables". As in that post, in each example, we'll overwrite the contents of a piece of Python code saved to our hard drive as "script.py" and run the entire script.)


Code for one big "error log" output file

Code:

import pandas
pandas.set_option('expand_frame_repr', False)
df1 = pandas.read_csv('C:\\tempexamples\\sample1.csv', dtype=object)
df2 = pandas.read_csv('C:\\tempexamples\\sample2.csv', dtype=object)

df1matchfields=['First','Last','Email']
df2matchfields=['FirstName','LastName','Em']

mergedf = df1.merge(df2, left_on=df1matchfields, right_on=df2matchfields, how='outer', indicator=True)

uniquedf = mergedf[mergedf['_merge'].str.endswith('_only')]
uniquedf.to_csv('C:\\tempexamples\\output.csv', index=0, quoting=1)

Output:

"Id","First","Last","Email","Company","PersonId","FirstName","LastName","Em","FavoriteFood","_merge"
"5829","Jimmy","Buffet","jb@example.com","RCA","","","","","","left_only"
"2894","Shirley","Chisholm","sc@example.com","United States Congress","","","","","","left_only"
"30829","Cesar","Chavez","cc@example.com","United Farm Workers","","","","","","left_only"
"724","Albert","Howard","ah@example.com","Imperial College of Science","","","","","","left_only"
"","","","","","983mv","Shirley","Temple","st@example.com","Lollipops","right_only"
"","","","","","k28fo","Donald","Duck","dd@example.com","Pancakes","right_only"
"","","","","","8xi","Albert","Howard","ahotherem@example.com","Potatoes","right_only"

Code for two separate "error log" output files

Code:

import pandas
pandas.set_option('expand_frame_repr', False)
df1 = pandas.read_csv('C:\\tempexamples\\sample1.csv', dtype=object)
df2 = pandas.read_csv('C:\\tempexamples\\sample2.csv', dtype=object)

df1matchfields=['First','Last','Email']
df2matchfields=['FirstName','LastName','Em']

mergedf = df1.merge(df2, left_on=df1matchfields, right_on=df2matchfields, how='outer', indicator=True)

uniquedf1 = mergedf[mergedf['_merge']=='left_only'][df1.columns]
uniquedf2 = mergedf[mergedf['_merge']=='right_only'][df2.columns]
uniquedf1.to_csv('C:\\tempexamples\\output1.csv', index=0, quoting=1)
uniquedf2.to_csv('C:\\tempexamples\\output2.csv', index=0, quoting=1)

Output 1:

"Id","First","Last","Email","Company"
"5829","Jimmy","Buffet","jb@example.com","RCA"
"2894","Shirley","Chisholm","sc@example.com","United States Congress"
"30829","Cesar","Chavez","cc@example.com","United Farm Workers"
"724","Albert","Howard","ah@example.com","Imperial College of Science"

Output 2:

"PersonId","FirstName","LastName","Em","FavoriteFood"
"983mv","Shirley","Temple","st@example.com","Lollipops"
"k28fo","Donald","Duck","dd@example.com","Pancakes"
"8xi","Albert","Howard","ahotherem@example.com","Potatoes"


The code above only works if you have "Pandas version 0.17" or greater installed.

Pandas is up to version "0.23" as of this blog post, so that's a pretty safe bet.

Except that the "no admin rights, even with old installations of Windows" programming environment I walked you through installing is many years old, and it comes with "Pandas version 0.16." So here are two scripts that do the equivalent job in older versions, if you don't feel like upgrading everything.

"Old Pandas" code for one big "error log" output file

Code:

import pandas
pandas.set_option('expand_frame_repr', False)
df1 = pandas.read_csv('C:\\tempexamples\\sample1.csv', dtype=object)
df2 = pandas.read_csv('C:\\tempexamples\\sample2.csv', dtype=object)

df1matchfields=['First','Last','Email']
df2matchfields=['FirstName','LastName','Em']

mergedf = df1.assign(SourceDF='df1').merge(df2.assign(SourceDF='df2'), left_on=['First','Last','Email'], right_on=['FirstName','LastName','Em'], how='outer')
uniquedf = mergedf[mergedf['SourceDF_x'].isnull() | mergedf['SourceDF_y'].isnull()]

uniquedf.to_csv('C:\\tempexamples\\output.csv', index=0, quoting=1)

Output:

"Id","First","Last","Email","Company","SourceDF_x","PersonId","FirstName","LastName","Em","FavoriteFood","SourceDF_y"
"5829","Jimmy","Buffet","jb@example.com","RCA","df1","","","","","",""
"2894","Shirley","Chisholm","sc@example.com","United States Congress","df1","","","","","",""
"30829","Cesar","Chavez","cc@example.com","United Farm Workers","df1","","","","","",""
"724","Albert","Howard","ah@example.com","Imperial College of Science","df1","","","","","",""
"","","","","","","983mv","Shirley","Temple","st@example.com","Lollipops","df2"
"","","","","","","k28fo","Donald","Duck","dd@example.com","Pancakes","df2"
"","","","","","","8xi","Albert","Howard","ahotherem@example.com","Potatoes","df2"

"Old Pandas" code for two separate "error log" output files

Code:

import pandas
pandas.set_option('expand_frame_repr', False)
df1 = pandas.read_csv('C:\\tempexamples\\sample1.csv', dtype=object)
df2 = pandas.read_csv('C:\\tempexamples\\sample2.csv', dtype=object)
 
df1matchfields=['First','Last','Email']
df2matchfields=['FirstName','LastName','Em']
 
mergedf = df1.assign(SourceDF='df1').merge(df2.assign(SourceDF='df2'), left_on=['First','Last','Email'], right_on=['FirstName','LastName','Em'], how='outer')
 
uniquedf1 = mergedf[mergedf['SourceDF_y'].isnull()][df1.columns]
uniquedf2 = mergedf[mergedf['SourceDF_x'].isnull()][df2.columns]
uniquedf1.to_csv('C:\\tempexamples\\output1.csv', index=0, quoting=1)
uniquedf2.to_csv('C:\\tempexamples\\output2.csv', index=0, quoting=1)

Output 1:

"Id","First","Last","Email","Company"
"5829","Jimmy","Buffet","jb@example.com","RCA"
"2894","Shirley","Chisholm","sc@example.com","United States Congress"
"30829","Cesar","Chavez","cc@example.com","United Farm Workers"
"724","Albert","Howard","ah@example.com","Imperial College of Science"

Output 2:

"PersonId","FirstName","LastName","Em","FavoriteFood"
"983mv","Shirley","Temple","st@example.com","Lollipops"
"k28fo","Donald","Duck","dd@example.com","Pancakes"
"8xi","Albert","Howard","ahotherem@example.com","Potatoes"

Table of Contents

Monday, August 20, 2018

Python for Salesforce Administrators - A "Combining Multiple Tables" (VLOOKUP) Example with the "Simple Salesforce" plugin

Today I used Python to facilitate a quick UPDATE of 60 Contact records in Salesforce.

In Salesforce, there were 60-ish Contact records where I had instructed an end user to populate a field called "RegSys_External_Id_B__c" with the "ID" that they could see while logged into their registration system.

However, what I really wanted to do was make sure that "RegSys_External_Id_A__c" was populated with the "real external ID" (that my end users couldn't see, but that held everything together in the back-end of their registration system) of each person they had hand-matched.

I asked them to simply fill it in with dummy values, "XXXX01," "XXXX02," "XXXX03," etc. so that I could easily find the records they had already "hand-checked" later.

I then used Python as follows:

First, I used Python's "Simple Salesforce" plugin to log into our org and download the 60 Contact records into Python's "Pandas" plugin. I saved that data into a variable "cstofixdf" (as in "Contacts To Fix DataFrame"). Note that the data that comes back from Simple Salesforce has to have "['records']" appended to it to become something that "Pandas" can read and turn into a "DataFrame" (2-dimensional table). Also, I drop the "attributes" column that comes back with "Simple Salesforce" data because I think it's ugly and takes up unnecessary space on my screen when I preview what my "DataFrame" looks like. Anyway, I also printed out all the "RegSys_External_Id_B__c" values as a comma-separated, quote-surrounded list onto my screen so that I could easily copy them to my clipboard.

import pandas
pandas.set_option('expand_frame_repr', False)

from simple_salesforce import Salesforce
sf = Salesforce(username='myemail@example.com', password='mypassword', security_token='mysalesforcesecuritytoken')

cs = sf.query("SELECT Id, RegSys_External_Id_A__c, RegSys_External_Id_B__c FROM Contact WHERE RegSys_External_Id_B__c <> NULL AND RegSys_External_Id_A__c LIKE 'X%'")['records']
cstofixdf = pandas.DataFrame(cs)
cstofixdf.drop('attributes', axis='columns', inplace=True)

print(list(cstofixdf['Flatbridge_Student_ID__c']))

My output looked like this -- only with more like 60 items in the list instead of 4:

['8294', '29842', '8482', '2081']

Once I had copied the list between the square-brackets onto my clipboard, I pasted it into the tool by which I query the back-end of my end user's course registration system. It uses normal "SQL," so my code looked like this:

SELECT IdA, IdB
FROM ExternalIdMappingTable
WHERE IdB IN ('8294', '29842', '8482', '2081')

I exported the 60 rows of output from THAT database onto my hard drive at "c:\examples\otherdb.csv"

I then ran the following code -- note that I've "commented out" the "print(list(...))" line of my code with a "#" since I no longer need to print that information to my screen.

import pandas
pandas.set_option('expand_frame_repr', False)

from simple_salesforce import Salesforce
sf = Salesforce(username='myemail@example.com', password='mypassword', security_token='mysalesforcesecuritytoken')

cs = sf.query("SELECT Id, RegSys_External_Id_A__c, RegSys_External_Id_B__c FROM Contact WHERE RegSys_External_Id_B__c <> NULL AND RegSys_External_Id_A__c LIKE 'X%'")['records']
cstofixdf = pandas.DataFrame(cs)
cstofixdf.drop('attributes', axis='columns', inplace=True)

#print(list(cstofixdf['Flatbridge_Student_ID__c']))

regdf = pandas.read_csv('c:\\examples\\otherdb.csv', dtype=object)

mergedf = cstofixdf.merge(regdf, how='inner', left_on='RegSys_External_Id_B__c', right_on='IdB')
mergedf.drop(['IdB', 'RegSys_External_Id_A__c', 'RegSys_External_Id_B__c'], axis='columns', inplace=True)
mergedf.rename(columns={'IdA':'RegSys_External_Id_A__c'}, inplace=True)

mergedf.to_csv('c:\\examples\\uploadme.csv', index=0, quoting=1)

The last 5 lines of code are the new part.

First, I import "otherdb.csv"

Next, I "inner-merge" (like a VLOOKUP, with "inner" specifying that the "ID B" must show up in both datasets) the two "DataFrames" -- the one I downloaded from Salesforce and the one I just loaded in from CSV, and save the output into a new "DataFrame" called "mergedf."

Not shown here, I did a bit of quality-checking after I wrote the line of code that did the merge, before getting rid of "excess" columns. For example, I did:

print(mergedf)

and

print(len(cstofixdf))

and

print(len(regdf))

and

print(len(mergedf))

I double-checked that all the data-sets were the same length (there shouldn't have been any duplicates or missing values), and I hand-skimmed the results of "mergedf" to make sure I hadn't written the ".merge()" code wrong.

After that, I get rid of a few columns I won't want in my output CSV file: the old "A" & "B" values from Salesforce (I just needed the Salesforce ID), as well as the "B" value from the SQL download.

Then I renamed the SQL-downloaded "IdA" column to "RegSys_External_Id_A__c" to be explicit that this was the Salesforce Contact-table field I intended to update this data into.

Finally, I dumped the "mergedf" "DataFrame" to CSV on my hard drive.

From there, I was just a Data Load away from having proper data in Salesforce!


P.S. You may note that some of my code reads like this:

mergedf.drop(['IdB', 'RegSys_External_Id_A__c', 'RegSys_External_Id_B__c'], axis='columns', inplace=True)

Whereas older examples in my blog read like this:

mergedf = mergedf.drop(['IdB', 'RegSys_External_Id_A__c', 'RegSys_External_Id_B__c'], axis=1)

They do the same thing.

Many Pandas functions produce an altered copy of your "DataFrame" or "Series," rather than altering the actual data you have stored into a given "variable." Therefore, if you actually want to store the "altered" version into that "variable name," you have to explicitly do so. This is actually pretty common in programming -- you'll see things like "x = x + 1" all the time.

The "inplace=True" option on certain Pandas operations is a shorthand that keeps you from having to type all that. Sometimes, it gets buggy on me, and I didn't want your first lessons to fail, so I wrote my introductory examples the "long way."

You may also notice a difference between "axis='columns'" and "axis=1". They do the same thing. "1" is faster to type than "'columns'", but obviously "'columns'" is easier to read when you come back and need to figure out what you did later. Your choice!


Table of Contents

Wednesday, June 13, 2018

Trigger/Process/Workflow or Scheduled Script/Tool?

I often get asked to write triggers to detect the existence of certain types of data in Salesforce and, when such data exists, make a modification to some other data.

The only problem is, writing "a trigger" isn't always easy to do in a well-normalized database (that is, a database that leverages master-detail and lookup relationships to avoid redundant data entry).

Take, for example, an architecture where "App Document" records have a master-detail relationship to a parent "Admissions Application" and the "Admissions Application," in turn, has a master-detail relationship to a parent "Contact" record.

The other day, I was asked to automatically flip the "Status" to "Waived" on any "App Document" records that meet the following criteria:

  • The record is of type "English Proficiency" and is in blank/"Required" status
  • The record's parent "Application" has an "application citizenship category" of "International" and a "level" of "Undergraduate"
  • The record's grandparent "Contact" has a "foreign citizenship country" of "United Kingdom," "Ireland," "Canada," "Australia," "New Zealand," (etc.)

This is ill-suited to a Trigger/Process/Workflow because I would actually need THREE "insert"/"update" automations, each with relatively redundant code:

  1. one for all Contacts
    (in case the citizenship country changes ... then go looking for appropriate "child" apps and "grandchild" documents)
  2. one for all Applications
    (in case the "application citizenship category" or "level" changes ... then double-check the parent Contact's citizenship and go looking for "child" documents)
  3. and one for all App Documents
    (check the parent & grandparent details and change self if appropriate)

Yikes! That's a lot of redundant trigger code, and some of those operations aren't very efficient against Salesforce governor limits.

Especially since just one SOQL query can easily fetch the ID of all "App Document" records whose "Status" needs to be set to "Waived":

SELECT Id
FROM AdmDocument__c
WHERE Type__c='English Proficiency'
AND (Status__c = NULL OR Status__c = 'Required')
AND Application__r.Level__r.Name='Undergraduate'
AND Application__r.Citizenship_Category__c='International'
AND Application.Contact__r.Foreign_Citizenship_Country__c IN 
   ('Australia','Canada','Ireland','New Zealand','United Kingdom')

When someone asks you to "write a trigger" or "write a process builder" or "write a workflow" to automate data migration inside of Salesforce, be sure to ask yourself, "How many tables' values changing could cause a scenario to arise that would make this data need to be modified as requested?"

If the answer is "2 or more," and especially if it's "3 or more," think hard about whether "every 15 minutes" / "daily" is frequent enough for your end users to see "automatic fixes," and whether a SOQL query could extract the IDs of the records that need to be changed.

If you can come up with a SOQL query that represents the "problem records," you should be able to use a schedulable ETL tool or scheduled Apex to "extract and load back" with much lower overhead against governor limits than triggers/processes/workflows would incur. Your future self will also thank you when someone (inevitably) asks for a change to your script.

Tuesday, January 16, 2018

Re-Parented Child Objects Don't Fire Triggers In Parent Record Merges

Today I learned, via the Salesforce forums:

  • In Salesforce, if you merge 2 "Contact" records, and if you have a bunch of "child"-table records pointing to the "about-to-be-deleted" Contact record as a foreign key, Salesforce will automatically change the foreign key cross-reference in the "child" records to the ID of the "surviving" Contact record, and it will update those child records' "last modified" timestamp, BUT it will not allow any "UPDATE" triggers from those child tables to fire.
  • The only thing you can latch onto to detect that a "Contact merge" has just happened is the "AFTER DELETE" trigger-context against the "Contact" table (you can detect a "deleted but merged" Contact in that context from an ordinary "deleted" Contact because it has a "MasterRecordId" value).
    You can't "detect" that a "child" record has just been "re-parented" in the context of a "merge."
  • Annoyingly, once you're that far into the merge ("after contact delete" trigger context), you can no longer tell which "child" records were pointing to the old "parent" because that's long since been "fixed" by Salesforce.
    All you can see is which "child" records are now cross-referencing the surviving "Contact" – which includes all "child" records that were already pointing to the surviving "Contact" in the first place.
  • Finally, if you use that knowledge to kick off DML against surviving Contacts' "child" records, you must be careful, because "after Contact delete" is still in the context of the initial Contact merge, and you have to avoid loops (Salesforce will yell at you if you don't).
    If your DML against the "child" records kicks off triggers that result in more DML against one of the very Contacts involved in the merge, you need to make sure you kick off the DML against the "child" records in a "@future" context to force a brand new "execution context" (roughly like a transaction, but a little different in some ways – see Dan Appleman's Advanced Apex book).

Friday, January 5, 2018

Loving my "Just Push Play" Run-Specified-Tests Script

Power tools are fun.

THIS PYTHON SCRIPT HAS CHANGED MY LIFE.

Whipped it out twice this week trying to get code written under tight deadlines.

Having a nearly-instant, low-clicking-around answer to, "Would it probably work in production?" so I can easily watch how every single change to just 1 line of code behaves, before doing a full-on deploy of code that "seems to work" with a proper deployment tool and "Run All Tests," has been AMAZING.