Oracle 2 Salesforce

Python for CSV Files - Deduplicating "Some States, But Not Others"

2018-10-23T14:20:00.001-05:00

I encountered a lovely "Excel to Python" business problem this morning. Many thanks to my colleague for explaining it to me, and apologies for snooping.

He had a spreadsheet containing 20,000 American high schools (column B), what state they were in (column C), and which admissions counselor was assigned to the school (column A).

His university assigns individual admissions counselors to schools within the states of Delaware (DE) and Maryland (MD), but elsewhere, each counselor covers an entire state. Nevertheless, this particular spreadsheet had every single high school listed out, with its state and counselor, no matter where it was located.

What my colleague wanted to do was reduce the size of the spreadsheet so that Delaware and Maryland schools would be listed individually, but so that the rest of the spreadsheet would simply contain a list of the remaining 48 U.S. states and the counselors assigned to them.

The input CSV file looked like this:

Counselor,School,State
Ana,Dover HS,DE
Leia,Anapolis HS,MD
Ana,Baltimore HS,MD
Leia,Best Prep,DE
Mo,Denver HS,CO
Ana,Howard HS,DE
Mo,Boulder HS,CO
Mo,Boise HS,ID
Leia,Fairfax HS,VA
Mo,Excel Academy,ID
Mo,Mann HS,ID
Leia,Roanoke HS,VA

He wanted the output CSV file to look like this:

Counselor,School,State
Ana,Dover HS,DE
Leia,Anapolis HS,MD
Ana,Baltimore HS,MD
Leia,Best Prep,DE
Ana,Howard HS,DE
Mo,STATE-CO,CO
Mo,STATE-ID,ID
Leia,STATE-VA,VA

Because we know exactly which states are treated "one way" versus "another way," this is a very straightforward "split the dataset horizontally, do something to one part of the split, and concatenate it back together" operation.

In fact, we don't even really need Python for this. In Excel, in 2 minutes, we could:

Filter the spreadsheet to "everything but DE & MD"
Extract those rows to a new tab of the spreadsheet
Do "Remove Duplicates" on column C
- (or, if we wanted to dummy-check for accidental territory splits in the "other 48" states, on columns A & C)
Alter the contents of column B to be a concatenation of the word "STATE-" and the contents of column C
Un-filter our original (now shorter) worksheet
Cut-paste our "temporary" worksheet back onto the end of it

But if we did want to use Python, the script would be pretty straightforward. It would imitate those same steps in Excel:

import pandas
df = pandas.read_csv('c:\\example\\HighSchools.csv', dtype=object)
dfdemd = df[df['State'].isin(['DE','MD'])]
dfoth48 = df[~df['State'].isin(['DE','MD'])]
dfoth48 = dfoth48.drop_duplicates(subset=['State'], keep='first')
dfoth48['School'] = 'STATE-'+dfoth48['State']
df = pandas.concat([dfdemd,dfoth48])
df.to_csv('c:\\example\\FilteredHighSchools.csv', index=0, quoting=1)

Note that if this were an Excel XLSX file, we would use read_excel and to_excel instead (eliminating the ", quoting=1" option).

Where the problem gets really interesting is that we can let the computer tell us which states have "just 1 counselor for the whole state" and which states "should be listed school-by-school, because they seem to be split."

In Excel, we would:

Add a "column D" counting the number of unique "column A" values per value in "column C."
- (There are various tips and tricks available on the internet to accomplish this in Excel.)
Then we would do the same "split, alter part of the data, and re-concatenate" trick we did earlier, only we would "split" based on "column D" instead of based on "column C" (looking for "1" vs. "2 & up").

Again, in Python, we'll imitate our steps in Excel:

import pandas
df = pandas.read_csv('c:\\example\\HighSchools.csv', dtype=object)
df['Counts'] = df['State'].map(df[['Counselor','State']].drop_duplicates(keep='first')['State'].value_counts())
df_mult_couns = df[df['Counts'] > 1]
df_1_couns = df[df['Counts'] == 1]
df_1_couns = df_1_couns.drop_duplicates(subset=['State'], keep='first')
df_1_couns['School'] = 'STATE-'+df_1_couns['State']
df = pandas.concat([df_mult_couns,df_1_couns])
df = df.drop(['Counts'], axis=1)
df.to_csv('c:\\example\\SmartFilteredHighSchools.csv', index=0, quoting=1)

This script is a lot like the previous one, only I renamed "dfdemd" to "df_mult_couns" because we are presuming that we don't yet know it's Delaware and Maryland that have multiple counselors per state. Similarly, I renamed "dfoth48" to "df_1_couns."

The new lines of code are the 3rd line and the 2nd-to-last line. Also, the 4th & 5th lines have been altered to filter on "Counts" instead of filtering on "State." Still, once we have split our data, we no longer need the "Counts" column (we could have "dropped" it from each fragment of our dataset here instead of from the re-concatenated dataset right before exporting to CSV), and we do our "dropping of duplicates" in the "1-counselor-per-state" states as usual, by "State."

Line 3 is our most complicated line of code. Let's work from the middle, starting inside the ".map()" command.

"df[['Counselor','State']]" produces a new DataFrame that is a copy of our original DataFrame, but with data from only those 2 columns.
Appending "drop_duplicates(keep='first')" to it produces a new DataFrame that is a copy of it, only using "Counselor" and "State" together (that is, all 2 columns of our input DataFrame) as a matching key.
Appending "['State'].value_counts()" to that "deduplicated" DataFrame produces a new "Series" (fancy list-like thing) where the "item IDs" are the states from our data set, and the values are the number of unique "Counselor" values for that "State" value.

Now that we know what's inside the ".map()" command (a "Series" of states and their "unique Counselor value counts"), let's take a look at Line 3 of our code, starting after the "=."

"df['State'].map(ourSeriesOfStatesWithCounselorCounts)" produces a new "Series," with the same set of "item IDs" as the "row IDs" of our original DataFrame we imported from CSV, and with values populated by numbers indicating the number of "distinct counselor values" for the "State" found in that row of our data.

Its output looks like this:

0     2
1     2
2     2
3     2
4     1
5     2
6     1
7     1
8     1
9     1
10    1
11    1
Name: State, dtype: int64

Finally, all we do by putting this to the right of "df['Counts'] = " is to add a new "Counts" column to our original DataFrame and fill it in with these numbers.

Voila! We've added "column D!"

It was a little fancy getting there, but save this script to your computer -- or bookmark this blog post -- and you can copy/paste/tweak it any time it seems useful, whether or not you understand it enough to write it from scratch. (Don't worry -- I had to Google it and copy/paste/modify other people's code myself. 😊)

🇫🇷 - en français

Aujourd'hui j'ai rencontré un beau problème Excel bien adapté à résoudre avec Python.

Un collègue travaillait sur une feuille de calcul contenant 20 000 lycées américains (colonne B), l'état des E-U où chacun se trouve (colonne C) et le nom du recruteur qui se charge de l'école (colonne A).

L'université où travaille mon collègue divise la charge de travail du Maryland ("MD") et du Delaware ("DE"). Aux autres 48 états, un seul recruteur gère tous les lycées de l'état.

Mon collègue voulait réduire la taille de cette feuille de calcul. Il ne voulait changer aucun enregistrement du Delaware ni du Maryland. Pour les 48 autres états, il voulait éliminer la liste des écoles et garder uniquement 48 enregistrements, contenant chaque état et le recruteur qui s'en charge.

Sa feuille de calcul ressemblait à celle-ci:

Counselor,School,State
Ana,Dover HS,DE
Leia,Anapolis HS,MD
Ana,Baltimore HS,MD
Leia,Best Prep,DE
Mo,Denver HS,CO
Ana,Howard HS,DE
Mo,Boulder HS,CO
Mo,Boise HS,ID
Leia,Fairfax HS,VA
Mo,Excel Academy,ID
Mo,Mann HS,ID
Leia,Roanoke HS,VA

Il souhaitait une feulle de calcul de sortie comme celle-ci:

Counselor,School,State
Ana,Dover HS,DE
Leia,Anapolis HS,MD
Ana,Baltimore HS,MD
Leia,Best Prep,DE
Ana,Howard HS,DE
Mo,STATE-CO,CO
Mo,STATE-ID,ID
Leia,STATE-VA,VA

Comme on sait déjà comment diviser la feuille pour traiter les données de telle ou telle manière, le travail est très simple. Cela consiste à diviser les données en "DE + MD" et "ailleurs", à dédupliquer la partie "ailleurs", et à concaténer les deux divisions.

En fait, on n'a pas vraiment besoin de Python. C'est un travail de 2 minutes avec Excel.

filtrer la feuille de calcul pour extraire les enregistrements "ailleurs"
extraire ces lignes dans un nouvel onglet
dédupliquer sur la colonne "C"
- (ou, pour vérifier qu'il n'y ait pas d'erreurs de saisie de données, sur "A" et "C" ensemble, cherchant plus que 2 enregistrements par état)
modifier le contenu de la colonne "B" pour qu'il soit une concaténation du mot "STATE-" et du contenu de la colonne "C"
supprimer le filtre de notre onglet de feuille de calcul d'origine (où rien ne reste que les données "Maryland" et "Delaware")
couper-coller les contenus du nouvel onglet à la fin de l'originale

Mais le code Python ne serait pas trop compliqué. Le script imiterait nos étapes Excel:

import pandas
df = pandas.read_csv('c:\\example\\HighSchools.csv', dtype=object)
dfdemd = df[df['State'].isin(['DE','MD'])]
dfoth48 = df[~df['State'].isin(['DE','MD'])]
dfoth48 = dfoth48.drop_duplicates(subset=['State'], keep='first')
dfoth48['School'] = 'STATE-'+dfoth48['State']
df = pandas.concat([dfdemd,dfoth48])
df.to_csv('c:\\example\\FilteredHighSchools.csv', index=0, quoting=1)

(Pour un fichier XLSX et non CSV, on se servirait des commandes "read_excel" et "to_excel" et on supprimerait le code ", quoting=1" de la fin du script.)

Mais notre problème devient vraiment intéressant car nous pouvons laisser Python calculer quels états n'ont qu'un seul recruteur et quels états sont la charge de plusieurs recruteurs.

Avec Excel, il s'agit de:

Ajouter une "colonne D" qui compte le nombre de valeurs uniques du "colonne A" pour chaque valeur dans "colonne C."
- (Il y a des astuces en ligne pour y arriver.)
Répéter notre opération de "scinder, modifier une partie, concaténer" de plus tôt. Sauf que nous divisons par la colonne "D" ("1" et "2 ou plus").

Encore une fois, le code Python va imiter nos étapes Excel:

import pandas
df = pandas.read_csv('c:\\example\\HighSchools.csv', dtype=object)
df['Counts'] = df['State'].map(df[['Counselor','State']].drop_duplicates(keep='first')['State'].value_counts())
df_mult_couns = df[df['Counts'] > 1]
df_1_couns = df[df['Counts'] == 1]
df_1_couns = df_1_couns.drop_duplicates(subset=['State'], keep='first')
df_1_couns['School'] = 'STATE-'+df_1_couns['State']
df = pandas.concat([df_mult_couns,df_1_couns])
df = df.drop(['Counts'], axis=1)
df.to_csv('c:\\example\\SmartFilteredHighSchools.csv', index=0, quoting=1)

Ce script ressemble beaucoup au précédent. J'ai renommé "dfdemd" en "df_mult_couns" car dans ce cas, nous ne savons pas encore que "Delaware" et "Maryland" auront plusieurs recruteurs par état. De même, j'ai renommé "dfoth48" en "df_1_couns".

Les nouvelles lignes de code sont la 3ème ligne et l'avant-dernière ligne.

En outre, les lignes 4 et 5 ont été modifiées pour filtrer sur "Counts" au lieu de filtrer sur "State".

Apres les lignes 4 et 5, nous n'avons plus besoin de la colonne "Counts". (Nous aurions pu le supprimer de chaque scinde après les lignes 4 et 5, mais j'ai décidé plutôt de supprimer cette colonne des données reconcaténées dans l'avant-dernière ligne).

On fait toujours notre déduplication principale par "State."

La ligne 3 du code est la ligne la plus compliquée. Inspectons-la en commençant au milieu, à l'intérieur de la commande ".map()".

"df[['Counselor','State']]" génère un nouveau "DataFrame" qui est une copie de notre DataFrame d'origine, mais avec des données provenant uniquement de ces 2 colonnes.
L'ajout de ".drop_duplicates(keep = 'first')" le déduplique par recruteur et état, pour générer encore un nouveau "DataFrame".
L'ajout de "['State'].value_counts()" à ce DataFrame "dédupliqué" génère une nouvelle "Série" (élément de type liste ... amélioré ... plutôt tableau associatif) où les clés sont nos états et les valeurs sont le nombre de recruteurs uniques pour chaque état.

Maintenant que nous comprenons l'intérieur de ".map()" ("Série" avec les états comme clé), examinons la partie de la ligne 3 qui suit le "=".

"df['State'].map(notre_Tableau_Associatif)" produit une nouvelle "Série" avec le même ensemble de clés que sont les clés des enregistrements de notre "DataFrame" d'origine (de notre fichier CSV) -- c'est-à-dire les numéros qui identifie les lignes du fichier. Les valeurs de cette nouvelle "Série" sont les nombres de recruteurs qui correspondent à l'état trouvé sous "State" pour chaque enregistrement.

La sortie de cette commande ressemble à:

0     2
1     2
2     2
3     2
4     1
5     2
6     1
7     1
8     1
9     1
10    1
11    1
Name: State, dtype: int64

Finalement, "df['Counts'] = " ne fait qu'ajouter cette liste nombres de recruteurs au DataFrame d'origine comme nouvelle colonne (nommée "Counts").

Et voilà! On a ajouté la "colonne D" ! Après, ce n'est qu'une question de finir les filtres.

Comme j'ai dit, la ligne 3 était peut-être un peu compliqué, mais enregistrez ce script dans un fichier sur votre ordinateur -- ou ajoutez cet article à vos favoris -- et vous pouvez le copier-coller-modifier au moment qu'il vous convient (que vous le compreniez bien ou non). (Ne vous inquiétez pas, moi aussi j'ai Googlé pour écrire cet article ! 😊)

"Python for Salesforce Administrators" Table of Contents

Salesforce Custom Metadata vs. Custom Objects: Where Should I Put My Configuration and Validation Settings?

2018-10-18T10:44:00.000-05:00

(Version française en dessous)

Yesterday, a colleague asked me the difference between "custom metadata" and "custom objects" for storing "configuration and validation" information in Salesforce.

My answer is:

Custom metadata if you can. Salesforce says so.
And it survives sandbox refreshes!

Data tables (custom objects) if you truly need it to be part of your data
(e.g. people will want to include it, hyperlinked to other data, in reports).

When I wrote an Apex trigger that determined which User should "own" each "Admissions Application" in our org, I ended up splitting the configuration data in two.

Here's why:

Custom Metadata

Used for a table that helps Apex code ask, "Who does this?"

key: the code for a graduate degree we offer
value: a username responsible for recruiting people to that graduate degree

Data (Custom Objects)

Used for a list of U.S. states, their full spellings, and fields with "lookup" links to the User table indicating who does what work for that state.

There was strong demand to generate native Salesforce reports per User showing all the states they're responsible for and the high schools in those states. It made sense to ensure that the "high school" table could have a "lookup" field to this "states" table.
Custom metadata can have "master-detail" and "lookup" relational links to other custom metadata, but it can't link to ordinary data.
- This meant we needed to store the the "states" as data (custom objects), even though we would also be using it as configuration information for Apex triggers.

UI & Editability Considerations

I'll let you in on a dirty little secret about another reason I used data tables ("custom objects") for most of the Undergraduate Admissions counselor assignment configuration data.

Undergraduate Admissions tweaks their "recruiter assignment" business rules several times a year. The real-world configuration data for their business is a lot more complex than a simple "list of U.S. states."

I'll be honest: Salesforce's user interfaces for hand-editing, viewing, and bulk-editing data are a lot more end-user-friendly than their user interfaces for the same operations on custom metadata, and setting granular "edit" permissions is a lot more sysadmin-friendly for data. I wanted to make sure end users, not sysadmins, were the ones whose time was spent tweaking the configuration several times a year!

I was thoroughly scolded at Dreamforce. Actually, I stand by my decision to use "data" tables, because there truly is a business need to report on the configuration data alongside normal data. But ... building your own user interfaces (typically using Lightning Components) to help end users edit custom metadata was a big theme. You have been warned:

🇫🇷 - en français

Une collègue, peu familière avec Salesforce, m'a demandée quelle était la différence entre « métadonnées personnalisées » et « objets personnalisés » pour stocker des données de configuration et de validation dans une organisation Salesforce.

J'ai répondu:

Selon Salesforce, s'il est possible, utilisez des métadonnées personnalisées.
(Elles survivent l'actualisation d'un environnement sandbox.)

Stockez vos informations aux tables de bases de données (objets personnalisés) s'il faut créer des relations aux autres objets de votre organisation.
(Par exemple, s'il y a des utilisateurs qui vont générer des rapports sur ces données.)

Mon approche

Lorsque j'ai écrit un déclencheur Apex pour attribuer les enregistrements de notre objet « demande d'admission » aux utilisateurs corrects, j'ai fini par utiliser les deux approches:

Métadonnées personnalisées

J'ai choisi une type de métadonnées personnalisées pour stocker des données simples concernant « qui gère chaque spécialisation ? »

clé: le code qui indique un diplôme d'études supérieures que l'université offre
valeur: un nom d'utilisateur chargé de recruter des étudiants au diplôme

Objets personnalisés

J'ai choisi un objet personalisé pour stocker une liste d'états américains (abréviation et nom), avec quelques champs de référence indiquant les utilisateurs qui y gèrent divers aspects du recrutement d'étudiants en license.

Nos utilisateurs ont besoin de générer des rapports, regroupés par utilisateur, avec les états qu'ils gèrent et les lycées qui y sont situés. Donc il fallait pouvoir créer une relation de l'objet « lycée » à l'objet « état ».
On peut créer des relations entre métadonnées personnalisées, mais pas entre les types de métadonnées personnalisées et les objets personalisés.
- Donc j'ai choisi un modèle « objet personnalisé » pour stocker les informations à propos des états, même si ces informations servent aussi comme données de configuration pour un déclencheur Apex.

Considérations au sujet de l'interface utilisateur et au contrôle d'accès

Je vais vous révéler un secret : ce n'est pas qu‘à propos des besoins relationnels que j'ai choisi le modèle « objet personnalisé » pour les informations à propos de la gestion du recrutement en licence.

Le service admission en license changent les règles de « qui gère quoi » plusieurs fois par an. Et les données de configuration sont, en réalité, beaucoup plus complexes qu'une seule liste d'états.

Actuellement, les interfaces utilisateur de Salesforce pour voir et enregistrer des données aux objets personnalisés sont supérieures à celles pour les métadonnées personnalisées – surtout si l'on est utilisateur ordinaire. Le contrôle d'accès, aussi, est plus facile à gérer pour les administrateurs. Je voulais m'assurer que ça soit des utilisateurs ordinaires qui font plusieurs fois par an cette saisie de données, et non pas les administrateurs !

On m'a fait honte de cette position à Dreamforce. Ben, je maintiens ma choix d'objet personnalisé pour raisons de relations entre objets. Mais … on a beaucoup répété qu'il est de bonne pratique de construire ses propres interfaces utilisateur pour autoriser des utilisateurs ordinaires à modifier les métadonnées personnalisées en toute sécurité. Je vous aurai prévenus !

Vidéos pertinentes de Dreamforce

Python for Salesforce Administrators - "In Both Tables?" Dummy-Check Script

2018-08-22T16:33:00.002-05:00

I was asked how to use Python to "dummy check" that every transaction in a Salesforce log also appeared in a payment processor's log, and vice-versa.

The following are a few simple scripts that do just that. We're going to work with our "sample1" & "sample2" files, which have some people who overlap by name and email, and others who don't.
In your case, instead of "name & email" you might want to match on "transaction ID" or "product ID, customer ID, and timestamp." Whatever appears in both datasets.

Note that "Albert Howard" is going to show up as "not in the other file" for both files because he has a different email address in each file.

(Prep work: First, make sure you've created a ".CSV" file like the one described in "First Scripts". I called mine "sample1.csv". Second, make sure you've created a 2nd ".CSV" file like the one described in "Combining Multiple Tables". As in that post, in each example, we'll overwrite the contents of a piece of Python code saved to our hard drive as "script.py" and run the entire script.)

Code for one big "error log" output file

Code:

import pandas
pandas.set_option('expand_frame_repr', False)
df1 = pandas.read_csv('C:\\tempexamples\\sample1.csv', dtype=object)
df2 = pandas.read_csv('C:\\tempexamples\\sample2.csv', dtype=object)

df1matchfields=['First','Last','Email']
df2matchfields=['FirstName','LastName','Em']

mergedf = df1.merge(df2, left_on=df1matchfields, right_on=df2matchfields, how='outer', indicator=True)

uniquedf = mergedf[mergedf['_merge'].str.endswith('_only')]
uniquedf.to_csv('C:\\tempexamples\\output.csv', index=0, quoting=1)

Output:

"Id","First","Last","Email","Company","PersonId","FirstName","LastName","Em","FavoriteFood","_merge"
"5829","Jimmy","Buffet","jb@example.com","RCA","","","","","","left_only"
"2894","Shirley","Chisholm","sc@example.com","United States Congress","","","","","","left_only"
"30829","Cesar","Chavez","cc@example.com","United Farm Workers","","","","","","left_only"
"724","Albert","Howard","ah@example.com","Imperial College of Science","","","","","","left_only"
"","","","","","983mv","Shirley","Temple","st@example.com","Lollipops","right_only"
"","","","","","k28fo","Donald","Duck","dd@example.com","Pancakes","right_only"
"","","","","","8xi","Albert","Howard","ahotherem@example.com","Potatoes","right_only"

Code for two separate "error log" output files

Code:

import pandas
pandas.set_option('expand_frame_repr', False)
df1 = pandas.read_csv('C:\\tempexamples\\sample1.csv', dtype=object)
df2 = pandas.read_csv('C:\\tempexamples\\sample2.csv', dtype=object)

df1matchfields=['First','Last','Email']
df2matchfields=['FirstName','LastName','Em']

mergedf = df1.merge(df2, left_on=df1matchfields, right_on=df2matchfields, how='outer', indicator=True)

uniquedf1 = mergedf[mergedf['_merge']=='left_only'][df1.columns]
uniquedf2 = mergedf[mergedf['_merge']=='right_only'][df2.columns]
uniquedf1.to_csv('C:\\tempexamples\\output1.csv', index=0, quoting=1)
uniquedf2.to_csv('C:\\tempexamples\\output2.csv', index=0, quoting=1)

Output 1:

"Id","First","Last","Email","Company"
"5829","Jimmy","Buffet","jb@example.com","RCA"
"2894","Shirley","Chisholm","sc@example.com","United States Congress"
"30829","Cesar","Chavez","cc@example.com","United Farm Workers"
"724","Albert","Howard","ah@example.com","Imperial College of Science"

Output 2:

"PersonId","FirstName","LastName","Em","FavoriteFood"
"983mv","Shirley","Temple","st@example.com","Lollipops"
"k28fo","Donald","Duck","dd@example.com","Pancakes"
"8xi","Albert","Howard","ahotherem@example.com","Potatoes"

The code above only works if you have "Pandas version 0.17" or greater installed.

Pandas is up to version "0.23" as of this blog post, so that's a pretty safe bet.

Except that the "no admin rights, even with old installations of Windows" programming environment I walked you through installing is many years old, and it comes with "Pandas version 0.16." So here are two scripts that do the equivalent job in older versions, if you don't feel like upgrading everything.

"Old Pandas" code for one big "error log" output file

Code:

import pandas
pandas.set_option('expand_frame_repr', False)
df1 = pandas.read_csv('C:\\tempexamples\\sample1.csv', dtype=object)
df2 = pandas.read_csv('C:\\tempexamples\\sample2.csv', dtype=object)

df1matchfields=['First','Last','Email']
df2matchfields=['FirstName','LastName','Em']

mergedf = df1.assign(SourceDF='df1').merge(df2.assign(SourceDF='df2'), left_on=['First','Last','Email'], right_on=['FirstName','LastName','Em'], how='outer')
uniquedf = mergedf[mergedf['SourceDF_x'].isnull() | mergedf['SourceDF_y'].isnull()]

uniquedf.to_csv('C:\\tempexamples\\output.csv', index=0, quoting=1)

Output:

"Id","First","Last","Email","Company","SourceDF_x","PersonId","FirstName","LastName","Em","FavoriteFood","SourceDF_y"
"5829","Jimmy","Buffet","jb@example.com","RCA","df1","","","","","",""
"2894","Shirley","Chisholm","sc@example.com","United States Congress","df1","","","","","",""
"30829","Cesar","Chavez","cc@example.com","United Farm Workers","df1","","","","","",""
"724","Albert","Howard","ah@example.com","Imperial College of Science","df1","","","","","",""
"","","","","","","983mv","Shirley","Temple","st@example.com","Lollipops","df2"
"","","","","","","k28fo","Donald","Duck","dd@example.com","Pancakes","df2"
"","","","","","","8xi","Albert","Howard","ahotherem@example.com","Potatoes","df2"

"Old Pandas" code for two separate "error log" output files

Code:

import pandas
pandas.set_option('expand_frame_repr', False)
df1 = pandas.read_csv('C:\\tempexamples\\sample1.csv', dtype=object)
df2 = pandas.read_csv('C:\\tempexamples\\sample2.csv', dtype=object)
 
df1matchfields=['First','Last','Email']
df2matchfields=['FirstName','LastName','Em']
 
mergedf = df1.assign(SourceDF='df1').merge(df2.assign(SourceDF='df2'), left_on=['First','Last','Email'], right_on=['FirstName','LastName','Em'], how='outer')
 
uniquedf1 = mergedf[mergedf['SourceDF_y'].isnull()][df1.columns]
uniquedf2 = mergedf[mergedf['SourceDF_x'].isnull()][df2.columns]
uniquedf1.to_csv('C:\\tempexamples\\output1.csv', index=0, quoting=1)
uniquedf2.to_csv('C:\\tempexamples\\output2.csv', index=0, quoting=1)

Output 1:

"Id","First","Last","Email","Company"
"5829","Jimmy","Buffet","jb@example.com","RCA"
"2894","Shirley","Chisholm","sc@example.com","United States Congress"
"30829","Cesar","Chavez","cc@example.com","United Farm Workers"
"724","Albert","Howard","ah@example.com","Imperial College of Science"

Output 2:

"PersonId","FirstName","LastName","Em","FavoriteFood"
"983mv","Shirley","Temple","st@example.com","Lollipops"
"k28fo","Donald","Duck","dd@example.com","Pancakes"
"8xi","Albert","Howard","ahotherem@example.com","Potatoes"

Table of Contents

Python for Salesforce Administrators - A "Combining Multiple Tables" (VLOOKUP) Example with the "Simple Salesforce" plugin

2018-08-20T14:09:00.001-05:00

Today I used Python to facilitate a quick UPDATE of 60 Contact records in Salesforce.

In Salesforce, there were 60-ish Contact records where I had instructed an end user to populate a field called "RegSys_External_Id_B__c" with the "ID" that they could see while logged into their registration system.

However, what I really wanted to do was make sure that "RegSys_External_Id_A__c" was populated with the "real external ID" (that my end users couldn't see, but that held everything together in the back-end of their registration system) of each person they had hand-matched.

I asked them to simply fill it in with dummy values, "XXXX01," "XXXX02," "XXXX03," etc. so that I could easily find the records they had already "hand-checked" later.

I then used Python as follows:

First, I used Python's "Simple Salesforce" plugin to log into our org and download the 60 Contact records into Python's "Pandas" plugin. I saved that data into a variable "cstofixdf" (as in "Contacts To Fix DataFrame"). Note that the data that comes back from Simple Salesforce has to have "['records']" appended to it to become something that "Pandas" can read and turn into a "DataFrame" (2-dimensional table). Also, I drop the "attributes" column that comes back with "Simple Salesforce" data because I think it's ugly and takes up unnecessary space on my screen when I preview what my "DataFrame" looks like. Anyway, I also printed out all the "RegSys_External_Id_B__c" values as a comma-separated, quote-surrounded list onto my screen so that I could easily copy them to my clipboard.

import pandas
pandas.set_option('expand_frame_repr', False)

from simple_salesforce import Salesforce
sf = Salesforce(username='myemail@example.com', password='mypassword', security_token='mysalesforcesecuritytoken')

cs = sf.query("SELECT Id, RegSys_External_Id_A__c, RegSys_External_Id_B__c FROM Contact WHERE RegSys_External_Id_B__c <> NULL AND RegSys_External_Id_A__c LIKE 'X%'")['records']
cstofixdf = pandas.DataFrame(cs)
cstofixdf.drop('attributes', axis='columns', inplace=True)

print(list(cstofixdf['Flatbridge_Student_ID__c']))

My output looked like this -- only with more like 60 items in the list instead of 4:

['8294', '29842', '8482', '2081']

Once I had copied the list between the square-brackets onto my clipboard, I pasted it into the tool by which I query the back-end of my end user's course registration system. It uses normal "SQL," so my code looked like this:

SELECT IdA, IdB
FROM ExternalIdMappingTable
WHERE IdB IN ('8294', '29842', '8482', '2081')

I exported the 60 rows of output from THAT database onto my hard drive at "c:\examples\otherdb.csv"

I then ran the following code -- note that I've "commented out" the "print(list(...))" line of my code with a "#" since I no longer need to print that information to my screen.

import pandas
pandas.set_option('expand_frame_repr', False)

from simple_salesforce import Salesforce
sf = Salesforce(username='myemail@example.com', password='mypassword', security_token='mysalesforcesecuritytoken')

cs = sf.query("SELECT Id, RegSys_External_Id_A__c, RegSys_External_Id_B__c FROM Contact WHERE RegSys_External_Id_B__c <> NULL AND RegSys_External_Id_A__c LIKE 'X%'")['records']
cstofixdf = pandas.DataFrame(cs)
cstofixdf.drop('attributes', axis='columns', inplace=True)

#print(list(cstofixdf['Flatbridge_Student_ID__c']))

regdf = pandas.read_csv('c:\\examples\\otherdb.csv', dtype=object)

mergedf = cstofixdf.merge(regdf, how='inner', left_on='RegSys_External_Id_B__c', right_on='IdB')
mergedf.drop(['IdB', 'RegSys_External_Id_A__c', 'RegSys_External_Id_B__c'], axis='columns', inplace=True)
mergedf.rename(columns={'IdA':'RegSys_External_Id_A__c'}, inplace=True)

mergedf.to_csv('c:\\examples\\uploadme.csv', index=0, quoting=1)

The last 5 lines of code are the new part.

First, I import "otherdb.csv"

Next, I "inner-merge" (like a VLOOKUP, with "inner" specifying that the "ID B" must show up in both datasets) the two "DataFrames" -- the one I downloaded from Salesforce and the one I just loaded in from CSV, and save the output into a new "DataFrame" called "mergedf."

Not shown here, I did a bit of quality-checking after I wrote the line of code that did the merge, before getting rid of "excess" columns. For example, I did:

print(mergedf)

and

print(len(cstofixdf))

and

print(len(regdf))

and

print(len(mergedf))

I double-checked that all the data-sets were the same length (there shouldn't have been any duplicates or missing values), and I hand-skimmed the results of "mergedf" to make sure I hadn't written the ".merge()" code wrong.

After that, I get rid of a few columns I won't want in my output CSV file: the old "A" & "B" values from Salesforce (I just needed the Salesforce ID), as well as the "B" value from the SQL download.

Then I renamed the SQL-downloaded "IdA" column to "RegSys_External_Id_A__c" to be explicit that this was the Salesforce Contact-table field I intended to update this data into.

Finally, I dumped the "mergedf" "DataFrame" to CSV on my hard drive.

From there, I was just a Data Load away from having proper data in Salesforce!

P.S. You may note that some of my code reads like this:

mergedf.drop(['IdB', 'RegSys_External_Id_A__c', 'RegSys_External_Id_B__c'], axis='columns', inplace=True)

Whereas older examples in my blog read like this:

mergedf = mergedf.drop(['IdB', 'RegSys_External_Id_A__c', 'RegSys_External_Id_B__c'], axis=1)

They do the same thing.

Many Pandas functions produce an altered copy of your "DataFrame" or "Series," rather than altering the actual data you have stored into a given "variable." Therefore, if you actually want to store the "altered" version into that "variable name," you have to explicitly do so. This is actually pretty common in programming -- you'll see things like "x = x + 1" all the time.

The "inplace=True" option on certain Pandas operations is a shorthand that keeps you from having to type all that. Sometimes, it gets buggy on me, and I didn't want your first lessons to fail, so I wrote my introductory examples the "long way."

You may also notice a difference between "axis='columns'" and "axis=1". They do the same thing. "1" is faster to type than "'columns'", but obviously "'columns'" is easier to read when you come back and need to figure out what you did later. Your choice!

Table of Contents

Trigger/Process/Workflow or Scheduled Script/Tool?

2018-06-13T16:19:00.001-05:00

I often get asked to write triggers to detect the existence of certain types of data in Salesforce and, when such data exists, make a modification to some other data.

The only problem is, writing "a trigger" isn't always easy to do in a well-normalized database (that is, a database that leverages master-detail and lookup relationships to avoid redundant data entry).

Take, for example, an architecture where "App Document" records have a master-detail relationship to a parent "Admissions Application" and the "Admissions Application," in turn, has a master-detail relationship to a parent "Contact" record.

The other day, I was asked to automatically flip the "Status" to "Waived" on any "App Document" records that meet the following criteria:

The record is of type "English Proficiency" and is in blank/"Required" status
The record's parent "Application" has an "application citizenship category" of "International" and a "level" of "Undergraduate"
The record's grandparent "Contact" has a "foreign citizenship country" of "United Kingdom," "Ireland," "Canada," "Australia," "New Zealand," (etc.)

This is ill-suited to a Trigger/Process/Workflow because I would actually need THREE "insert"/"update" automations, each with relatively redundant code:

one for all Contacts
(in case the citizenship country changes ... then go looking for appropriate "child" apps and "grandchild" documents)
one for all Applications
(in case the "application citizenship category" or "level" changes ... then double-check the parent Contact's citizenship and go looking for "child" documents)
and one for all App Documents
(check the parent & grandparent details and change self if appropriate)

Yikes! That's a lot of redundant trigger code, and some of those operations aren't very efficient against Salesforce governor limits.

Especially since just one SOQL query can easily fetch the ID of all "App Document" records whose "Status" needs to be set to "Waived":

SELECT Id
FROM AdmDocument__c
WHERE Type__c='English Proficiency'
AND (Status__c = NULL OR Status__c = 'Required')
AND Application__r.Level__r.Name='Undergraduate'
AND Application__r.Citizenship_Category__c='International'
AND Application.Contact__r.Foreign_Citizenship_Country__c IN 
   ('Australia','Canada','Ireland','New Zealand','United Kingdom')

When someone asks you to "write a trigger" or "write a process builder" or "write a workflow" to automate data migration inside of Salesforce, be sure to ask yourself, "How many tables' values changing could cause a scenario to arise that would make this data need to be modified as requested?"

If the answer is "2 or more," and especially if it's "3 or more," think hard about whether "every 15 minutes" / "daily" is frequent enough for your end users to see "automatic fixes," and whether a SOQL query could extract the IDs of the records that need to be changed.

If you can come up with a SOQL query that represents the "problem records," you should be able to use a schedulable ETL tool or scheduled Apex to "extract and load back" with much lower overhead against governor limits than triggers/processes/workflows would incur. Your future self will also thank you when someone (inevitably) asks for a change to your script.

Re-Parented Child Objects Don't Fire Triggers In Parent Record Merges

2018-01-16T13:22:00.001-06:00

Today I learned, via the Salesforce forums:

In Salesforce, if you merge 2 "Contact" records, and if you have a bunch of "child"-table records pointing to the "about-to-be-deleted" Contact record as a foreign key, Salesforce will automatically change the foreign key cross-reference in the "child" records to the ID of the "surviving" Contact record, and it will update those child records' "last modified" timestamp, BUT it will not allow any "UPDATE" triggers from those child tables to fire.
The only thing you can latch onto to detect that a "Contact merge" has just happened is the "AFTER DELETE" trigger-context against the "Contact" table (you can detect a "deleted but merged" Contact in that context from an ordinary "deleted" Contact because it has a "MasterRecordId" value).
You can't "detect" that a "child" record has just been "re-parented" in the context of a "merge."
Annoyingly, once you're that far into the merge ("after contact delete" trigger context), you can no longer tell which "child" records were pointing to the old "parent" because that's long since been "fixed" by Salesforce.
All you can see is which "child" records are now cross-referencing the surviving "Contact" – which includes all "child" records that were already pointing to the surviving "Contact" in the first place.
Finally, if you use that knowledge to kick off DML against surviving Contacts' "child" records, you must be careful, because "after Contact delete" is still in the context of the initial Contact merge, and you have to avoid loops (Salesforce will yell at you if you don't).
If your DML against the "child" records kicks off triggers that result in more DML against one of the very Contacts involved in the merge, you need to make sure you kick off the DML against the "child" records in a "@future" context to force a brand new "execution context" (roughly like a transaction, but a little different in some ways – see Dan Appleman's Advanced Apex book).

Loving my "Just Push Play" Run-Specified-Tests Script

2018-01-05T10:45:00.000-06:00

Power tools are fun.

THIS PYTHON SCRIPT HAS CHANGED MY LIFE.

Whipped it out twice this week trying to get code written under tight deadlines.

Having a nearly-instant, low-clicking-around answer to, "Would it probably work in production?" so I can easily watch how every single change to just 1 line of code behaves, before doing a full-on deploy of code that "seems to work" with a proper deployment tool and "Run All Tests," has been AMAZING.

Dreamforce 2017 Lessons Learned & Takeaways

2017-11-13T17:29:00.000-06:00

Lucky me, I got to attend Dreamforce again. Here's a summary.

Misc

"Machine learning" tools are getting point-and-click enough that feeding hundreds of pieces of historical student data (e.g. "days before schoolyear applied," "GPA," "time between inquiry & admission," "time of year applied," etc.) into them and letting a computer look for trends (e.g. "likely to submit a confirming deposit?") isn't all that out of reach. Drag-and-drop tools are coming into prominence that make dimensionality reduction & linear regression problems like this intuitive & easy to play with.
There's something called "macros" I should look into more. At a glance, not sure it's anything I can't do a lot faster w/ DemandTools, but worth a glance.

Org configuration change management

One of the concepts behind "DX scratch orgs'" mere 7-day lifespan is to encourage you to put the result of all your coding time & effort into a source code version control system.
Two guiding philosophies behind development with "DX" are that: 1) you can make all the changes you like by hand in them -- just be sure to use the DX command-line tool to pull those changes down to your computer so you can promptly get them into your version control system, and 2) you should never again be making any changes to configuration in production / normal-sandbox orgs by hand (only via deploying from a copy of whatever you did in a "DX scratch org" into one of those orgs via something like the DX command-line tool).
On a related note, I still haven't heard a really admin-geared lecture about DX and "please don't mess around with things by hand in production/staging sandboxes, and please don't deploy anything via change set."
Gearset looks like an absolute miracle for shops where those kinds of "no change sets, please" disciplined version-control practices aren't yet in place (e.g. shops transitioning from Salesforce as an "interesting side project" with a no-code solo admin). It's an enterprise-class-(seeming?) tool that simply lets you log into Salesforce and a cloud Git repository and say, "Hey, synchronize my entire org's metadata once a day -- kthxbye." It also seems to be able to facilitate rollbacks. In a small shop where communication lines are open enough that once you know what changed since things were working, sending out a few emails to figure out who changed it and if they could kindly fix it is easy, this "daily snapshot" seems well worth $3600/year. Even after you have "version control" in place -- when it's this easy, what's the harm in a "side repository" of "CYA snapshots?" (Such Git-stored snapshots could also, potentially, be mined by local scripts that transform them into things like SQL commands for dropping & rebuilding tables & SOQL queries for a local daily cache of Salesforce data.) And that's without all the other things Gearset does. (Please stay just $3600/year, Gearset!) Getting a tour from a staffer makes their product look even cooler than their summary on their web site -- apparently they can give you one over the web. *drools* I want.

Lightning Experience

Although there's a lot that's still missing from "Lightning Experience" (e.g. "bucketing" & "formulas" in native reporting tools), by the Winter '19 edition, there might be enough LE-only functionality in Salesforce's native reporting tools to justify a switch--for example, something called "joined reports," field-to-field filtering, subfoldering, easier permissions management, and bookmarking of reports. (So far, we haven't found it worthwhile.) I'm hoping to get a lot of data integration projects done & stable over the next few months so I have time to play around with Lightning Experience in a sandbox.
Apparently, having a really nice "Lightning Experience" home page for a user doesn't automatically translate into a really nice mobile version of the same page -- that has to be set up separately.
There's a thing called the "Lightning Data Service" that might make Lightning Components require almost as little code as Visualforce for simple "put this SOQL query's results on a page" use cases, unlike last year when I took a hands-on training. It also looks like it has something to do w/ getting Lightning Components on the same page to auto-update when data from the database that one of them just edited changes.

SOQL query performance & data models

OpportunityContactRole isn't anywhere close to becoming a first-class object (a table against which you can write triggers). *sigh*
The "affiliations" package might be replaceable with the "multiple accounts for contacts" feature & a few triggers similar to ones from the "affiliations" package. That said, the "affiliations" package doesn't really seem to be hurting anything. But Pardot and other vendors are working to get on board with the native version of the feature, so it's worth keeping an eye on.
Having a single "Generic Account" record for people whose account you don't know the value of slows down most SOQL queries. It improves performance to give everyone their own accounts named after them, as in the HEDA model.
Salesforce objects/tables aren't really tables. They're sets of tables (e.g. indexes are a table, fields are a table, etc.). That's why Salesforce doesn't call them "tables" and impacts SOQL query performance. We don't seem to have hit issues yet, but there's also something Salesforce can do with their back-end data model and materialized views for storing your org's data called skinny tables if you ask -- it can speed up SOQL queries.
Reminder: "=" is fast in a query that can leverage a highly selective index, "!=" is not -- "!=" always requires a table scan. Query optimization 101, but so easy to overlook when you get busy and don't write queries daily where you have to worry about it.
There's a technology called "big objects," paired with "asynchronous SOQL," coming along, but it doesn't yet look useful for us. You've got to be really sure about data not changing, because right now, you can't even drop such a table once it's created.

Other programming-related notes

Stay on top of the "metadata in Apex" project, but overall, it's on purpose that you can't change your schema & change your data via the same programming language / execution environment, so don't get too excited.
Hopefully coming within a year: the ability to easily make Apex respect CRUD permissions & field-level security permissions (when you don't want your triggers to "play God").
If you're a web developer, you can write your Visualforce/Apex to include a Visualforce Boolean variable as a toggle that switches whether Visualforce pages rely on Salesforce-hosted "static resources" (e.g. CSS, JavaScript) or "localhost"-hosted ones. This can make testing a lot easier & less fragile & faster than actually trying to update the contents of Salesforce-hosted "static resources" when you're not yet even sure if they're correct. Talk to Jon Schleicher for more info about how.
Via the reporting API (warning: the JSON to parse "is a doozy"), you can wrap Reports from Salesforce's native reporting tool in Lightning Components and therefore make them easy to embed in other pages. (Natively, only dashboards can be drag-and-dropped into Lightning Components.)
Google Docs are a form of "cloud-hosted info storage with an API" -- which means that you can include info from them in custom-programmed user interface elements like Lightning Components. Possibly overlooked form of integration of results from a cheap-and-easy aggregate-reporting tool (spreadsheets) with business users' main daily work environment (Salesforce).
Personal goal: blast through data integration projects so I can free up time to work on Trailhead modules and learn to be better able to leverage a lot of the technologies I learned about "tips & tricks" for. There's a lot out there that I'd like to do, but at 8 hours apiece, one really has to work to free up the time. But they're so well-written.

Recursion-Reduction Tips For Apex Trigger Code

2017-10-18T11:03:00.002-05:00

I wanted to share with you some lessons I learned while working on a trigger.

Because a trigger against a "just-saved" record can kick off another "save" operation on that record (typically the point of a trigger), the very same trigger can get re-fired during what's known as the same "execution context."

In Apex trigger programming, it's considered best-practice to make sure that any subsequent re-firings of the same trigger against the same record don't waste valuable CPU cycles when this happens (because Salesforce limits them within an "execution context").

Therefore, when writing a trigger that "does ____ for a just-saved record," it's important to make sure that, at some point, the trigger saves the ID of that record into a "I already did _____ on all of these records" flag that's viewable across all of these recursions (usually a class-level "Static"-flagged "Set<Id>"-typed variable in the trigger handler).

And, of course, you need to program your trigger to pay attention to its own flags and avoid running expensive "consider doing ____" code against records already in that "already-did-____" set of IDs.

The interesting question is: When do you set the "I already did _____" flag?

In certain cases, one can trust that all field-values contributing to a trigger's yes/no decision of "should I do _____ to this just-saved record?" will be set the moment that the record is first saved.

In those cases, the most efficient place in your trigger code to set the "I already did _____ on this record" flag is "as soon as the trigger has seen the record for the first time, no matter whether it ends up qualifying for 'doing ____' or not."

That's how I usually write my triggers if I can, since it's the most efficient way to write the trigger.
However, in certain cases (often discovered when people test your newly-written trigger and tell you that it fails under normal usage circumstances through 3rd-party record-editing environments like external portals), the values contributing to the answer to "should I do ____ to this record?" change so that the answer goes from "no" to "yes" in the middle of the "execution context."

For example, other "triggers" or equivalent pieces of code do some post-processing to the just-saved record, and it's only after those pieces of code re-save the record that the answer flips to "yes."

In those cases, the most efficient place in your trigger code to set the "I already did ______" on this record" flag is "as soon as the trigger has determined that it needs to do ______ to the record."

This, unfortunately, will make the trigger run "Should I do _____?" checks all the way through the execution context for records that remain "no" throughout. That's why it's less efficient.

But sometimes, it's simply necessary in cases where the answer can flip from "no" to "yes" mid-execution-context.

If one is comfortable authoring / editing the triggers/processes/workflows that are responsible for such impactful mid-execution-context value-changes, sometimes it's possible to refactor them so that they're simply part of the same "trigger handler" code as the one you're in the middle of writing.
You could precisely control the order of code execution "after a record is saved" and author these actions in a way that ensure there will never be any "mid-execution-context surprise value-changes."
That might let you use the more efficient recursion-reduction pattern instead.

Sometimes, though, there's nothing you can do but choose the 2nd option.

Python for Salesforce Developers: "Just Push Play" Run-Specified-Tests Code Deployment

2017-10-16T13:18:00.001-05:00

Warning: the attached Python script makes you put a password straight into the code and helps you make an end-run around all sorts of change-management best practices.

Basically, if you're not a little horrified to see this script publicly shared, you probably don't understand what it's capable of enough to be using it -- so please dont!

That said, for circumstances in which you were going to ignore a lot of change-management issues anyway, or just want to do a "check only" deploy of code to a Salesforce org, etc., this code basically lets you type a handful of classes / pages / etc. into a Python script, type in your username & password, say which tests you want to run and how you want to deploy it, and see the results really quickly.

The idea is to be almost as handy, to a developer, for small changes, as right-clicking on code files in Eclipse and "deploying" from there -- the problem with Eclipse being that it doesn't have a "Run Specified Tests" option.

At some point I might create something similar to this that lets you log into two Salesforce orgs, download a smattering of code from one to your local hard drive (instead of already having to have downloaded it with Eclipse), and proceed with the deploy from there. That really feels like playing with laziness/sloppiness fire, though (somehow seems to take out the "Would I really have deployed this just with Eclipse, anyway?" factor).

This code is probably best just for "checkOnly=TRUE" deploys. For real deploys, it's probably still best to Run All Tests, and using a "Change Set" is a lot better for anyone else who might have to stumble into your org a few weeks later and see what's been happening as far as code deploys. (And that's just a bare minimum of version control for simply-maintained orgs.)

A few notes on the code:

Change "#'''" to "'''" around blocks of code to quickly toggle them off (no point, for example, re-logging in if your Python IDE already has your session ID in memory, and you don't want to fire up a new "deploy" just to run the "check deployment status" code again).
"thingsToAdd" is where most of the "what you need to type" exists.
You'll also need to set "username" & "password" values (including your security token appended to your password) in "toOrgLoginEnvelope" -- I recommend remembering to change it back by saving this script to your hard drive (if you do save it) with a filename that includes something like "INCLUDES PASSWORD" so, as you exit the IDE, you remember to change things back.
"inputstr" will need the actual path to where you've used Eclipse to download the files you have "ready to deploy" to on your hard drive.
"outpbase" should be somewhere easy to find and delete later, like a sub-folder on your desktop.
"deployEnvelope" will need "checkOnly" set inside the XML itself, "testLevel" set just above it, and a (brackets-and-comma-delimited) list of tests to run towards the end of the "runtests" parameter inside the parentheses that follow the XML (if n/a, use "[]").

import os
import shutil
import base64
from xml.etree import ElementTree
import xml.dom.minidom
import requests
import re   

def getUniqueElementValueFromXmlString(xmlString, elementName):
    #Extracts an element value from an XML string.
    #For example, invoking getUniqueElementValueFromXmlString('<?xml version="1.0" encoding="UTF-8"?><foo>bar</foo>', 'foo') should return the value 'bar'.
    elementsByName = xml.dom.minidom.parseString(xmlString).getElementsByTagName(elementName)
    elementValue = None
    if len(elementsByName) > 0: elementValue = elementsByName[0].toxml().replace('<' + elementName + '>', '').replace('</' + elementName + '>', '')
    return elementValue

metadataAPIVer = '40.0'

#'''
toOrgLoginEnvelope = """<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"><soapenv:Body><login xmlns="urn:partner.soap.sforce.com"><username>{username}</username><password>{password}</password></login></soapenv:Body></soapenv:Envelope>""".format(
username='username', password='password')
tor = requests.post('https://login.salesforce.com/services/Soap/u/'+metadataAPIVer, toOrgLoginEnvelope, headers={'content-type':'text/xml','charset':'UTF-8','SOAPAction':'login'})
tosessid = getUniqueElementValueFromXmlString(tor.content, 'sessionId')
tohost = getUniqueElementValueFromXmlString(tor.content, 'serverUrl').replace('http://', '').replace('https://', '').split('/')[0].replace('-api', '')
toorgid = re.sub(r'^.*/([a-zA-Z0-9]{15})$', r'\1', getUniqueElementValueFromXmlString(tor.content, 'serverUrl'))
toapiver = re.sub(r'^.*/([0-9.]+)/[a-zA-Z0-9]{15}$', r'\1', getUniqueElementValueFromXmlString(tor.content, 'serverUrl'))
#'''

# EXAMPLE CODE:  thingsToAdd = {'classes':[''],'pages':['']}
thingsToAdd = {'classes': {'singCaps':'ApexClass','ext':'cls','toUpl':['OpportunityETLHandler','OpportunityETLTest']}, 'pages': {'singCaps':'ApexPage','ext':'page','toUpl':['OpportunityETLPage']}}
inputstr = 'C:\\EXAMPLEFOLDER\\EclipseWorkspace\\My Sandbox\\src\\'
outpbase = 'C:\\EXAMPLETEMPFOLDER\\temppkgtouploadfromeclipse\\'
outpfiles = outpbase + '\\filesbeforezip\\'
outpzip = outpbase + 'uploadme'

#'''
# BEGIN:  Code to create package
if not os.path.exists(outpbase): os.makedirs(outpbase)
if not os.path.exists(outpfiles): os.makedirs(outpfiles)
pkgroot = ElementTree.Element('Package', attrib={'xmlns':'http://soap.sforce.com/2006/04/metadata'})
for folder in thingsToAdd.keys():
    innerDict = thingsToAdd[folder]
    typesElem = ElementTree.Element('types')
    if not os.path.exists(outpfiles+folder+'\\'): os.makedirs(outpfiles+folder+'\\')
    for item in innerDict['toUpl']:
        membersElem = ElementTree.Element('members')
        membersElem.text = item
        typesElem.append(membersElem)
        shutil.copy(inputstr+folder+'\\'+item+'.'+innerDict['ext'], outpfiles+folder+'\\'+item+'.'+innerDict['ext'])
        shutil.copy(inputstr+folder+'\\'+item+'.'+innerDict['ext']+'-meta.xml', outpfiles+folder+'\\'+item+'.'+innerDict['ext']+'-meta.xml')
    namesElem = ElementTree.Element('name')
    namesElem.text = innerDict['singCaps']
    typesElem.append(namesElem)
    pkgroot.append(typesElem)
verElem = ElementTree.Element('version')
verElem.text = metadataAPIVer
pkgroot.append(verElem)
with open(outpfiles+'package.xml', 'w', newline='') as fw:
    dom_string = xml.dom.minidom.parseString(ElementTree.tostring(pkgroot)).toprettyxml(encoding='UTF-8', indent='    ')
    dom_string = '\n'.join([s for s in dom_string.decode('UTF-8').splitlines() if s.strip()]) + '\n'
    fw.write(dom_string)
# END:  Code to create package
#'''

#'''
# BEGIN:  Code to create ZIP
shutil.make_archive(base_name=outpzip, format='zip', root_dir=outpfiles, base_dir='./')
zipString = None
with open(outpzip+'.zip', 'rb') as f: zipString = base64.b64encode(f.read()).decode('UTF-8')
# END:  Code to create ZIP
#'''

# TO DO:  Figure out how to run several tests.  Maybe it's just a comma-separation?  I think it it's that in the point-and-click Change Set UI.

#'''
# BEGIN:  Code to deploy ZIP
testLevel = 'RunSpecifiedTests' # Most common values will be 'RunSpecifiedTests' or 'RunLocalTests'
deployEnvelope = """<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:met="http://soap.sforce.com/2006/04/metadata">
      <soapenv:Header>
         <met:SessionHeader>
            <met:sessionId>{sessionid}</met:sessionId>
         </met:SessionHeader>
      </soapenv:Header>
      <soapenv:Body>
         <met:deploy>
             <met:zipFile>{zipfile}</met:zipFile>
             <met:deployOptions>
                 <met:checkOnly>true</met:checkOnly>
                 <met:rollbackOnError>true</met:rollbackOnError>
                 {runtests}
                 <met:singlePackage>true</met:singlePackage>
                 <met:testLevel>{testlev}</met:testLevel>
             </met:deployOptions>
         </met:deploy>
      </soapenv:Body>
      </soapenv:Envelope>""".format(sessionid=tosessid, zipfile=zipString, testlev=testLevel, runtests=''.join(['<met:runTests>'+x+'</met:runTests>' for x in ['OpportunityETLTest1','OpportunityETLTest2']]) if testLevel=='RunSpecifiedTests' else '')
deploytor = requests.post('https://'+tohost+'/services/Soap/m/'+toapiver+'/'+toorgid, deployEnvelope, headers={'content-type': 'text/xml', 'charset': 'UTF-8', 'SOAPAction': 'deploy'})
# END:  Code to deploy ZIP
#'''

#'''
# BEGIN:  Code to check deploy
checkDeployStatusEnvelope = """<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:met="http://soap.sforce.com/2006/04/metadata">
      <soapenv:Header>
         <met:SessionHeader>
            <met:sessionId>{sessionid}</met:sessionId>
         </met:SessionHeader>
      </soapenv:Header>
      <soapenv:Body>
         <met:checkDeployStatus>
             <met:asyncProcessId>{process_id}</met:asyncProcessId>
             <met:includeDetails>true</met:includeDetails>
         </met:checkDeployStatus>
      </soapenv:Body>
      </soapenv:Envelope>""".format(sessionid=tosessid, process_id=getUniqueElementValueFromXmlString(deploytor.content, 'id'))

checkdeploytor = requests.post('https://'+tohost+'/services/Soap/m/'+toapiver+'/'+toorgid, checkDeployStatusEnvelope, headers={'content-type': 'text/xml', 'charset': 'UTF-8', 'SOAPAction': 'checkDeployStatus'})

#print(checkdeploytor.content)
#print()
print('Id: ', getUniqueElementValueFromXmlString(checkdeploytor.content, 'id'))
print('Done: ', getUniqueElementValueFromXmlString(checkdeploytor.content, 'done'))
print('Success: ', getUniqueElementValueFromXmlString(checkdeploytor.content, 'success'))
print('Status: ', getUniqueElementValueFromXmlString(checkdeploytor.content, 'status'))
print('Problem: ', getUniqueElementValueFromXmlString(checkdeploytor.content, 'problem'))
print('NumberComponentsTotal: ', getUniqueElementValueFromXmlString(checkdeploytor.content, 'numberComponentsTotal'))
print('RunTestResult: ', getUniqueElementValueFromXmlString(checkdeploytor.content, 'runTestResult'))
print('NumberTestsTotal: ', getUniqueElementValueFromXmlString(checkdeploytor.content, 'numberTestsTotal'))
#print(getUniqueElementValueFromXmlString(checkdeploytor.content, 'details'))
# END:  Code to check deploy
#'''

DemandTools MassImpact equivalent of UPDATE...SET...WHERE SQL

2017-10-09T11:53:00.000-05:00

The CRMFusion company makes powerful Salesforce-record-editing software called DemandTools.

Below are some screenshots of setting up its "MassImpact" (single-table-editing) module to do a job equivalent to running an "UPDATE...SET...WHERE" DML SQL statement against a traditional database, for data cleansing within a specific single table.

For example, to turn all values of table Contact, field Home_Phone__c that are filled with nothing but 10 digits in a row, no punctuation, into a properly formatted US phone number, you might traditionally use the following Oracle-database-friendly DML SQL:

UPDATE Contact
SET Home_Phone__c = REGEXP_REPLACE(Home_Phone__c,'^(\d{3})(\d{3})(\d{4})$','(\1) \2-\3')
WHERE REGEXP_LIKE(Home_Phone__c,'^(\d{3})(\d{3})(\d{4})$')

In DemandTools, you would set up a MassImpact "scenario" as follows:

Step 1: Tell DemandTools that you want to operate on the "Contact" object/table, and say which fields you want to be able to see the values of while you screen go/no-go on potential updates in the 3rd step.
Step 2: Tell DemandTools that you want to potentially-update all records where "Home_Phone__c" isn't null (unfortunately, you can't use a regular expression in the "WHERE" with DemandTools – but step 3 has some dummy-proofing to get around too wide a selection),
and say that you want to propose a parentheses-and-dashes-formatted replacement value for any of the returned values that consist of nothing but a string of 10 bare digits in Home_Phone__c.
Step 3: Ensure that you aren't pushing "UPDATE" calls for any records where there is no change to the value of Home_Phone__c, or where the new value of Home_Phone__c would be blank,
and skim the records to make sure your logic is doing what you thought it would,
and click "Update Records."

P.S. Just for geekiness, and to compare ease of use, here's some "Execute Anonymous" Salesforce Apex code along the same idea.
(Note: not tested at scale. Depending on your trigger/workflow/process builder load, might not actually work since it probably all runs in 1 "execution context" of post-DML CPU usage "governor limits," whereas DemandTools will run in truly separate "execution contexts" per 200 records to UPDATE.)

Map<Integer, List<Contact>> csToUpdate = new Map<Integer, List<Contact>>();
Integer csToUpdateCount = 0;
Integer currentBatch = 0;
List<Contact> cs = [SELECT Phone FROM Contact WHERE Phone <> null];
Pattern p = Pattern.compile('^(\\d{3})(\\d{3})(\\d{4})$');
for (Contact c : cs) {
 Matcher m = p.matcher(c.Phone);
 if(m.matches() == true) {
        csToUpdateCount++;
        currentBatch = (csToUpdateCount/200)+1;
        if (!csToUpdate.containsKey(currentBatch)) { csToUpdate.put(currentBatch, new List<Contact>()); }
        c.Phone = m.replaceFirst('($1) $2-$3');
        (csToUpdate.get(currentBatch)).add(c);
 }
}
if (csToUpdate.size() > 0) {
    for (List<Contact> csToU : csToUpdate.values()) {
        UPDATE csToU;
    }
}

Python To Facilitate Adding A New Field To Multiple Custom Report Types (involves XML)

2017-09-26T18:23:00.001-05:00

Quick code dump: if you download all your "report types" with Eclipse, this can help you put together "fresh" copies of them for use with the Workbench Deploy web site (just zip them up). Very handy if you just created a custom field on an object that's involved in dozens of report types.

This is an example of why it's good to be able to work with XML, too, in Python!

Hopefully I can clean up (or delete & replace) this post at a later date.

import os
from xml.etree import ElementTree
import xml.dom.minidom

def stripNSFromET(etRoot):
    etRoot.tag = etRoot.tag.split('}', 1)[1] # strip all namespaces level 1
    for el in etRoot:
        if '}' in el.tag:
            el.tag = el.tag.split('}', 1)[1] # strip all namespaces level 2
            for el3 in el:
                if '}' in el3.tag:
                    el3.tag = el3.tag.split('}', 1)[1] # strip all namespaces level 3
                    for el4 in el3:
                        if '}' in el4.tag:
                            el4.tag = el4.tag.split('}', 1)[1] # strip all namespaces level 4
                            for el5 in el4:
                                if '}' in el5.tag:
                                    el5.tag = el5.tag.split('}', 1)[1] # strip all namespaces level 5
    return etRoot

def isTableOfInterest(tableTagText, fieldsDatabaseTableAPIName):
    if fieldsDatabaseTableAPIName == 'Contact':
        return tableTagText == 'Contact' or tableTagText.endswith('.Contacts') or tableTagText.endswith('OpportunityContactRoles')
    else:
        return tableTagText == fieldsDatabaseTableAPIName or tableTagText.endswith('.'+fieldsDatabaseTableAPIName)

outpstr = 'C:\\example\\temppackagereporttypes\\'
pstr = 'C:\\examplepath\\EclipseWorkspace\\orgfolder\\src\\reportTypes\\'
fstr = 'C:\\examplepath\\EclipseWorkspace\\orgfolder\\src\\reportTypes\\Contact_With_Campaign_Members.reportType'

metadataAPIVer = '40.0'
fieldsDBTableAPIName = 'Contact'
fieldAPINames = ['New_Field_1__c','New_Field_2__c']
fieldAPINames.sort()

# SOME NOTES OF INTEREST:
# Contact is not only referred to in the normal way, but also often as OpportunityContactRole coming off an "Opportunity."  Not sure if ALL such OCRs are actually able to take Contact fields.
# This could make it a bit difficult to decide into what "Section," exactly, to add a new database field when adding it to every ReportType:
    # A given "Section" tag within a ReportType's XML file is NOT limited to containing "Columns" elements with the same database "Table" value.  For example, Graduate_Admissions.reportType, with a base object of Contact, has a section named "Recruit data" with fields from both the "Contact.Program_Interests__r" and "Contact" tables in it.
    # A given database "Table" value can appear in multiple "Section" tags within a ReportType's XML file (e.g. "Contact" fields could be spread across multiple sections).
    # CampaignMember often ends up repeated -- once as a top-level, once below itself -- in a ReportType where it is included (with fields duplicated and everything).  Good thing I'm not yet adding any CampaignMember fields to reports.

# SOME CODE TO REMEMBER:
# The number of tables mentioned in a ReportType file:  len({e.text for e in root.findall('./sections/columns/table')})
# The actual tables mentioned in a ReportType file:  {e.text for e in root.findall('./sections/columns/table')}
# The number of "sections" in a ReportType file:  len(root.findall('./sections'))
# The labels of the actual "sections" in a ReportType file:  {e.find('masterLabel').text for e in root.findall('./sections')}
# Iterating over a dict:  for k, v in d.items():


"""
Algorithm per ReportType file, parsed into an XML-ElementTree node called "root":
- Figure out how many 1st-level nodes called "sections" there are.
- For each 1st-level "section" node, if it's "of interest," set it aside as a key to a dict called "d",
    and a list of all its relevant child 2nd-level "columns" nodes as the values.
- Loop through "d" and set aside any keys that have the most list-items as a value, of all keys in "d";
    set aside a list of any such keys as "topSections"
- If there was just 1 such "top section" key, set that 1st-level "section" node aside as the variable "se"
    (This is, visually to the end user, where we'll be adding the new field.)
- If there was more than 1, arbitrarily pick 1 and set that 1st-level "section" node aside as the variable "se"
- Once we've picked a 1st-level "section" node as "se," cache a dict "uniqueTableNames" 
    of the distinct words appearing among its grandchild (3rd-level) "table" tag values as "key"
    and the count-of-table-tag-per-distinct-word as "value."
- Presuming there were any (I guess this serves as a sort of dummy-check for the section),
    instantiate a new ElementTree "columns" node called "newColumn,"
    append "newColumn" to "se" (make it 2nd-level)
    and flesh out "newColumn" with details of the field we want to add.
    (When fleshing it out, we arbitrarily pick a value for the 3rd-level "table" tag if "uniqueTableNames" had several keys.)

"""

changedThese = []
pkgroot = None
for i in os.listdir(pstr):
    root = None
    if not i.startswith('wrt_'):
        with open(pstr+i, 'r', encoding='utf-8') as f: # Open a ReportType metadata XML file
            root = stripNSFromET(ElementTree.parse(f).getroot()) # Store a reference to the root of the XML file currently open in a variable
            d = {}
            allFieldsForTableOfInterestColsInAllSectionsOfInterest = []
            for sec in root.findall('./sections'):
                columnsOfInterest = [e for e in sec.findall('columns/table/..') if isTableOfInterest(e.find('table').text, fieldsDBTableAPIName)]
                if len(columnsOfInterest) > 0:
                    d[sec] = columnsOfInterest # Add to "d" any "section" and a list of applicable "columns" inside it
            if len(d) < 1:
                continue # This file is not of interest if nothing got added to "d" (if it's not a ReportType that includes any fields of the object of interest) -- move on to next file
            else:
                print('\r\n' + i + ', baseObject:  ' + root.find('./baseObject').text) # Display which file we are working with and what its "Base Object" is
                se = None
                if len(d) > 0: # Why did I have this at ">1" when I found it 6 months later?  Should it be >0?
                    allFieldsForTableOfInterestColsInAllSectionsOfInterest = [item.find('field').text for sublist in d.values() for item in sublist]
                    if all(fldNm in allFieldsForTableOfInterestColsInAllSectionsOfInterest for fldNm in fieldAPINames):
                        continue # This file is not of interest if all fields are already in it
                    topSections = {k for k, v in d.items() if len(v) == max([len(arr) for arr in d.values()])}
                    if len(topSections) >= 1:
                        if len(topSections) == 1:
                            se = (next(iter(topSections))) # There was only 1 top-ranked section -- pick it
                        else:
                            topSecsWithLabelLikeTableNewFieldIsFrom = [s for s in topSections if s.find('masterLabel').text in [fieldsDBTableAPIName, fieldsDBTableAPIName+'s']]
                            if len(topSecsWithLabelLikeTableNewFieldIsFrom) > 0:
                                se = (next(iter(topSecsWithLabelLikeTableNewFieldIsFrom))) # I don't really care which it is, honestly
                            else:
                                se = (next(iter(topSections))) # I tried my best -- moving on.  Just picking a section.
                    else:
                        se = (next(iter(d.keys()))) # It was a 1-section file -- just pick the one section
                if se:
                    uniqueTableNames = {tbstr : [e.text for e in se.findall('columns/table')].count(tbstr) for tbstr in [e.text for e in se.findall('columns/table')]}
                    if len(uniqueTableNames) >= 1: # We can just tack our new column onto the only section that already has other Contact values
                        changedAnything = False
                        for fieldAPIName in fieldAPINames:
                            if fieldAPIName not in allFieldsForTableOfInterestColsInAllSectionsOfInterest:
                                changedAnything = True
                                newColumn = ElementTree.Element('columns')
                                se.append(newColumn)
                                newColumn.append(ElementTree.Element('checkedByDefault'))
                                newColumn.find('checkedByDefault').text = 'false'
                                newColumn.append(ElementTree.Element('field'))
                                newColumn.find('field').text = fieldAPIName
                                newColumn.append(ElementTree.Element('table'))
                                if len(uniqueTableNames) == 1:
                                    newColumn.find('table').text = next(iter(uniqueTableNames))
                                elif len(uniqueTableNames) > 1:
                                    newColumn.find('table').text = max(uniqueTableNames, key=uniqueTableNames.get)
                        if changedAnything:
                            se[:] = sorted(se, key=lambda x: x.tag) # Put masterLabel tag back at the end of the section.  We don't need to re-sort the fields because we might screw up where people were expecting to see them, if they weren't yet in alphabetical order.
                            root.set('xmlns','http://soap.sforce.com/2006/04/metadata')
                            if not os.path.exists(outpstr): os.makedirs(outpstr)
                            if not os.path.exists(outpstr+'reportTypes'): os.makedirs(outpstr+'reportTypes')
                            with open(outpstr+'reportTypes\\'+i, 'w', newline='') as fw:
                                dom_string = xml.dom.minidom.parseString(ElementTree.tostring(root)).toprettyxml(encoding='UTF-8', indent='    ')
                                dom_string = '\n'.join([s for s in dom_string.decode('UTF-8').splitlines() if s.strip()]) + '\n'
                                fw.write(dom_string)
                            changedThese.append(i[:-11])
if len(changedThese) > 0:
    pkgroot = ElementTree.Element('Package', attrib={'xmlns':'http://soap.sforce.com/2006/04/metadata'})
    typesElem = ElementTree.Element('types')
    for x in changedThese:
        membersElem = ElementTree.Element('members')
        membersElem.text = x
        typesElem.append(membersElem)
    namesElem = ElementTree.Element('name')
    namesElem.text = 'ReportType'
    typesElem.append(namesElem)
    pkgroot.append(typesElem)
    verElem = ElementTree.Element('version')
    verElem.text = metadataAPIVer
    pkgroot.append(verElem)
    with open(outpstr+'package.xml', 'w', newline='') as fw:
        dom_string = xml.dom.minidom.parseString(ElementTree.tostring(pkgroot)).toprettyxml(encoding='UTF-8', indent='    ')
        dom_string = '\n'.join([s for s in dom_string.decode('UTF-8').splitlines() if s.strip()]) + '\n'
        fw.write(dom_string)

print('all done')

Vendor-Provided Unit Test Bug "Fix" -- Just Add "SeeAllData=true"!

2017-09-13T16:23:00.000-05:00

Wow, does one of our managed-package vendors have me riled up.

I detected a bug in their code that means an "@isTest"-flagged unit test can't INSERT a new record from various tables (including "User" records of a certain type of "Profile").

In other words, I can't write unit tests that have anything to do with several of our core tables.

I reported the bug and pointed out that:

such INSERT statements work fine against "the real database," executed in the "Execute Anonymous Window" of the Developer Console, and that
they work fine with the "Run Test" button in "@isTest(SeeAllData=True)"-flagged unit tests (a big no-no), but that
as soon as you wrap such an INSERT statement in a test method/class annotated simply with "@isTest," it fails against various closed-source-code triggers in their managed package.

Clearly, their triggers have some weird bugs that they never detected because they "cheated" when writing all their unit tests by adding "(SeeAllData=True)" to "@isTest" (that's the case in all the open-source work they custom-developed for us).
Their codebase doesn't seem to be able to hold up to an empty database the way Salesforce code is supposed to, so it seems they simply make sure they're never showing it off to the customer on an empty database and have all unit tests look at "actual data" in the database.
Lovely.

So what'd they do in response to my problem?

Added "(SeeAllData=True)" to my proof-of-bug code and told me everything was all better.

*glares*

No.

It. most. definitely. is. not.

P.S. In the 29 new lines of code you "added" to my proof-of-bug unit test as well, vendor, you exposed a 2nd broken closed-source trigger when I turned off your "(SeeAllData=True)." So thanks for that, I guess -- saved me opening a 2nd ticket? *cough* #NotThatYouShouldHaveFoundItYourselfOrAnything

Python & Oracle SQL Hash + Regexp Replace Equivalents -- random note to self

2017-07-14T14:51:00.002-05:00

Oracle SQL, for pulling from ERP:

standard_hash(regexp_replace(COLUMNNAMEGOESHERE, '\s', ''), 'SHA1')

Python, for cleaning up old values already loaded from ERP into Salesforce:

[hashlib.sha1(str(val).encode('utf-8')).hexdigest().upper() for val in (PANDASDATAFRAMEVARIABLENAMEGOESHERE['COLUMNNAMEGOESHERE'].str.replace(r'\s', ''))]

Note to self, lest I forget. This strips the whitespace (including line breaks) out of text, then condenses it into a 40-character hash of itself. Used for turning a big chunk of text that I expect to be unique in a table into a shorter Salesforce "unique external ID" text field when the table containing the text doesn't have a standard "primary key."

REST APIs and Salesforce - A Primer

2017-07-06T16:51:00.002-05:00

A non-programmer colleague was getting advice to integrate an external database with Salesforce by building a "REST API" inside of Salesforce, and I chimed in because for a long time, 3 different ways of referring to "REST APIs" and "Salesforce" confused me.

Here's a breakdown of them that I hope will facilitate discussions with developers and help you learn the right kind of programming for your needs, if you're teaching yourself:

When people mention "building REST APIs inside of Salesforce," they mean using the "Apex" language to program a specific Salesforce database ("org") to ACCEPT requests FROM THE OUTSIDE WORLD to engage in a computer-to-computer data-transferring conversation over the web.

The data can potentially flow both ways (more in a second), but only via the outside world knocking on your Salesforce org's door and saying, "Hey, I've got another communications request!" in the exact manner that you've said you want your Salesforce org to be communicated with.

That's what "writing an API" means -- it means using code to define how a computer is willing to have its door knocked on and be talked to.

When you "build a REST API" inside of Salesforce, you're building a door in your fortress and posting a guard by it who's cautious about how she interacts with strangers on the other side.
- You might program your Salesforce org to know what to do with data from the outside world, but only when delivered correctly (e.g. "If you have something for the fortress, bring it in a series of pink 12x12x12 boxes, only paint the first one green and the last one purple, and put your name and return address on all of them -- if we like them, we'll figure out what to do with the stuff inside the boxes from there, and after we're done, we'll post a receipt on the door telling you whether we handled them neatly or burned them").
- Or you might program your Salesforce org to hand out data from inside the fortress to the outside world, if asked just right ("Tap out 'shave-and-a-haircut-two-bits,' then the morse code for your name and return address, and a description of the data you want, and we'll FedEx a list to you").
Or you might set up communications protocols for each, to really facilitate 2-way conversation--yet note that every piece of that conversation is initiated by some "outsider" piece of software knocking on your fortress's door in a specific way, no matter which way the "data you care about" is flowing.
There's ALSO a general-purpose API, built into every Salesforce database ("org") in existence, that uses the "REST" communications protocol. Salesforce defines and maintains how it works. It's a door that Salesforce the corporation cut into your fortress and posted a guard at.

This "API" is what the authors of "ETL" tools like Apex Data Loader & dataloader.io & Jitterbit & Zapier study carefully before writing such tools.

Technically, you can write your own tool taking advantage of this protocol, but most people don't bother, since other companies have done such great work.

Also, if you do option #1, you can sometimes INSERT more data in fewer "API calls" than by using tools that leverage the general-purpose Salesforce "REST" data API.

Both option #1 & option #2 give you the chance to write code inside your Salesforce database that post-processes incoming data according to your business's needs -- although in the case of approach #1, you might bundle that logic into the back end of the API for efficiency, whereas in the case of approach #2, you have to use "triggers" and such.
Finally, you can write Apex code within your Salesforce org that goes knocking on the doors of OTHER fortresses.

Doing so might involve writing Apex that uses the "REST" protocol and someone else's "API," but it's not at all the same thing as the 1st concept we discussed.

When learning how to do this, you'll often hear terms like "writing asynchronous Apex" and "making HTTP requests with Apex."

Thoughts:

Most admins' approaches to connecting an outside database land in the realm of approach #2, using someone else's "ETL" tool -- maybe with some "developer" help to write "triggers" to post-process incoming data according to your business needs. It's the easiest approach by far.

Some "apps" and "packages" and "plugins" might have done #1 as part of the software package they deliver to you (that is, they might cut a hole in your fortress and post a guard).

Personally, I've barely dabbled in "approach #2, roll-your-own-ETL-tool," and I've only been to classes and read books on approaches #1 & #3.

Watching from the outside, it seems to me that "approach #1" & "approach #2, roll-your-own-ETL-tool" achieve pretty similar business-need ends and require similar levels of developer skill.

It's just a question of whether you prefer to be constrained to programming in Salesforce's language, yet have more control over Salesforce database performance (approach #1),
Or whether you prefer to be a little less efficient, but have more control over the language in which you program ("approach #2, roll-your-own-ETL-tool").

(Although even in the case of approach #1, you'll probably be writing some sort of "ETL" software in your favorite language that does the actual knocking on your Salesforce database's ("org's") door -- it'll just probably be a lot simpler to write than whatever you would've written that relies solely upon Salesforce's standard, built-in API.)

Finally, a brief analogy for any Oracle developers following along:

REST-based data transmission : data flow in & out of Salesforce databases :: ODBC/JDBC-based data transmission : data flow in & out of traditional relational databases

Specifically, the built-in "approach #2" to using REST might be most comparable -- not sure how many database management systems let you write your own APIs into them (#1) or make callouts to internet services from them (#3).

Neat PDF About When To Consider Python

2017-06-28T16:01:00.001-05:00

Neat poster here (PDF) by Roxanne Johnson indicating when to consider Python for data projects, when to stick with Excel, how to find self-help resources if you delve into Python, etc.

Large page and small print, so zoom in.

Job-Hunting Advice For Career Changers -- Talking About Transferable Skills

2017-05-24T15:05:00.001-05:00

I'm mentoring a student worker as she searches for jobs after finishing her graduate degree at our university. She came to graduate school full-time to learn skills in a new field, which means she has recently learned a number of new programming languages in an academic setting, rather than on the job. Unsure how to keep recruiters on the phone and find some way to make her several years of work experience before and during graduate school "count" towards a job using these programming languages, she asked me if I had any advice. Here are some tips I gave her.

The written job description is gold to you.
- The bottom half usually describes the technical skills you're afraid of seeming too new at. Don't get too wrapped up in it.
- The top half usually describes the job itself. It's where a career-changer like you will find the most "source material" for talking about why you're qualified to do the job.
  - You wish they would "give you a chance" to "learn on the job," right?
    - Well, they won't. Not if you put it like that.
    - But if you point out how you've done things just like that, only different, but pretty much the same, and HERE'S HOW -- well, now you sound like someone ready to "jump into the deep end of the pool," don't you?
      Welcome to the world of "selling your transferable skills."
  - The "what this position does" "top half" of a job description is where you find out exactly what you need to say your prior experience is "just like," and where you'll figure out the difference between your prior experience and the future job (so that you can explain what other transferable skills you have that address the difference).
- If you don't have a written job description because a recruiter is summarizing a job orally and focuses on the "bottom half," confidently direct them to tell you about about the "top half" so you can direct the rest of the phone call from a position of strength.
  - Example: "Ah, yes, I have worked with those technologies in my graduate coursework, and I spent a year at ___ and three years at ___ doing related work with [related technology or transferable experience/skills] -- that might be a good fit. Can you tell me more about the job responsibilities and day-to-day work?"
Rehearse, rehearse, rehearse. Practice makes perfect.
- Treat this like you would ask a child to treat a role in a "school play."
  1. Memorize the lines
  2. Look in the mirror and practice the lines -- master your projection of confidence and your ability to remember your lines while looking someone in the eyes / get over any sense of disgust at watching yourself.
  3. If you have time, look in the mirror again and practice making your lines your own -- practice being easy and relaxed and you while delivering them.
Of course, you have to write your own lines -- you're the playwright as well. But you already know what the play is going to be about -- it's going to be about the job in the advertisement and about you! So write that script!
- Print a copy of the job description and a copy of your resume. Maybe grab some spare paper if there isn't a lot of room in the margins.
- Look through the "job duties" and TRANSLATE them from the way they're written to "something your grandmother would understand when she asks what you do for a living."
- Look through the "job duties" and find the IMPLIED job duties. (For example, if you have to "deliver" something, implied transferable skills & experiences might be "ahead of schedule" or "to very important people." Another example -- if it talks about "different" people you'll be working with ... implied is, "You'll be pulled in a million different directions and need to be able to handle it politely." If you've done that before ... yay, you have experience in something they need!)
- Now, for all the plain-English and implied real job duties, figure out what you've already done, at school or at work, and write notes to yourself.
- Finally, go through your notes and put together little stories about how great at the job you'd be, mentioning that you see they need the position to do ____, and how that sounds like the time in ____ when you _____ed.
  - Repeat until you're out of such stories -- then start this process over for the next job advertisement!
  - You might want to have a few different versions of any story that's too technical -- the "highly technical" (but still 60 seconds or less) version and the "explain it to a 6-year-old in 15 seconds or less" version. Always start by sharing the "explain it to a 6-year-old, 15 seconds or less" version of a story, only going into the "highly technical" version if pressed for more details.
  - When talking about how you have ___ experience on the job and ___ experience at school, don't leave it to the interviewer to infer that the two can be added together into a competence. This is YOUR JOB and is why you're rehearsing. If you'd like, you can start with phrases like, "In my experience" or "Given my experiences on the job and in my coursework," which have 3 advantages:
    1. You don't have to worry about seeming like you're being contradictory -- it's just your opinion. So it's polite / deferential.
    2. It's hard for someone to challenge the accuracy of. It's just your opinion. So it's confident (because it's hard to challenge).
    3. It brings attention to the fact that you have experience and that you can put all of your experience together into one big package that's useful to them.
This same process can also be used when deciding what to write in a cover letter -- only you have to be much, much more concise and not tell as many stories.
Finally, if you're worried about professional "word choice" when answering difficult questions, read Alison Green's 'Ask A Manger' advice blog religiously. Eventually, you'll pick up on her style, which I think is excellent.

Good luck!

Python CSV Example - Back-Filling Nearby Zip Code Distances

2017-05-24T13:45:00.000-05:00

Today I used the following small Python script to add a bunch of distances to a new field called "Miles_From_Us__c" on 40,000 pre-existing Contact records.

The process was as follows:

Export "SELECT ID, MAILINGPOSTALCODE, MILES_FROM_US__C WHERE MAILINGPOSTALCODE != NULL AND MAILINGPOSTALCODE != '' AND MILES_FROM_US__C = NULL" from our Salesforce org to a CSV file on my hard drive at "c:\tempexample\contactzipextract.csv" using the "Apex Data Loader" software.
Run the Python script in this blog post instead of bashing my head against the wall with Excel.
Import the CSV file on my hard drive at "c:\tempexample\contactzipupdate.csv" back into our Salesforce org as an "UPDATE" operation using the "Apex Data Loader" software.

import pandas

zips = ['11111','22222','33333','44444']

zipstodist = {'11111':0, '22222':1, '33333':9, '44444':21}

df = pandas.read_csv('c:\\tempexample\\contactzipextract.csv', dtype='object')

df = df[df['MAILINGPOSTALCODE'].str[:5].isin(zips)]

df['MILES_FROM_US__C'] = df['MAILINGPOSTALCODE'].str[:5].apply(lambda x : zipstodist[x])

df[['ID','MILES_FROM_US__C']].to_csv('c:\\tempexample\\contactzipupdate.csv', index=0)

Here's an explanation of what my script is doing, line by line, for people new to Python:

Says that I'll be using the "Pandas" plug-in to Python.
Says which U.S. postal codes ("zip codes") I care about as a Python "list" that I give a nickname of "zips."
- I got this from the first column of a CSV file of "zip codes of interest" that a consultant sent one of my users.
Defines the "distance from us," in miles, of each zip code of interest, as a Python "dict" that I give a nickname of "zipstodist."
- I got this from the first and second columns of a CSV file of "zip codes of interest" that a consultant sent one of my users.
- I could have cross-checked the two CSV files in Python, but I had already transformed the CSV data into this list style when writing some Salesforce Apex code for future maintenance, so copying/pasting it into Python was easier in this case.
Reads my CSV file into Python as a Python "Pandas dataframe" that I give a nickname of "df."
Filters out any rows whose first 5 characters of the value in the "MAILINGPOSTALCODE" column aren't among my zip codes of interest.
Sets each row's"MILES_FROM_US__C" to the distance corresponding to the first 5 characters of that row's "MAILINGPOSTALCODE" value.
Exports the "ID" & "MILES_FROM_US__C" columns of my filtered, transformed dataset out to a new CSV file on my hard drive.

A Brief UPDATE Script: Oracle SQL vs. Salesforce Apex+SOQL

2017-04-25T13:26:00.002-05:00

In a copy of Salesforce using EnrollmentRx, we are capturing the details of every submission-from-a-student on a table attached to "Contact" (PK-FK) called "Touch_Point__c."

When such a "Touch_Point__c" record is created, if it is the first created for a given "Contact," a trigger copies its "Lead_Source__c" value over to the corresponding "Contact" record's "LeadSource" field.

Midway through an advertising campaign, a decision was made to change the string used for a certain departmental landing page's "Lead_Source__c" value from "Normal Welcome Page" to "Landing Page."

We'd caught up on back-filling "Lead_Source__c" values on old "Touch_Point__c" table records.

However, we hadn't yet back-filled the corresponding "LeadSource" fields on "Contact" in the case where such "Touch_Point__c" records had been the first in existence for a given "Contact."
(We wanted to leave "Contact" records alone where none of the altered-after-the-fact "Touch_Point__c" were actually the first "Touch_Point__c" record for the "Contact.")

I wrote a little script that's the equivalent of a complex "UPDATE" statement and am sharing it here for colleagues from the Oracle SQL world.
(Please excuse any typos or inefficiencies -- the real data set was small, so I didn't care about performance, and I didn't actually run the Oracle.)

Here's some Oracle SQL that I believe would've done the job, if Salesforce were a normal Oracle database:

UPDATE Contact
SET LeadSource = (
    SELECT
        Lead_Source__c
    FROM Touch_Point__c
    INNER JOIN (
        SELECT Contact__c, MIN(CreatedDate) AS MIN_CR_DT
        FROM Touch_Point__c
        GROUP BY Contact__c
 ) qEarliestTP
    ON Touch_Point__c.Contact__c = qEarliestTP.Contact__c AND Touch_Point__c.CreatedDate = qEarliestTP.MIN_CR_DT 
    WHERE Contact.Id = Touch_Point__c.Contact__c
 AND Lead_Source__c='Landing Page'
 AND Dept_Name__c='Math'
 AND utm_source__c is not null
 AND extract(year from CreatedDate) >= extract(year from current_date)
)
WHERE Contact.Id IN (
    SELECT
        Touch_Point__c.Contact__c
    FROM Touch_Point__c
    INNER JOIN (
        SELECT Contact__c, MIN(CreatedDate) AS MIN_CR_DT
        FROM Touch_Point__c
        GROUP BY Contact__c
 ) qEarliestTP
    ON Touch_Point__c.Contact__c = qEarliestTP.Contact__c AND Touch_Point__c.CreatedDate = qEarliestTP.MIN_CR_DT 
    WHERE Contact.Contact__c = Touch_Point__c.Contact__c
 AND Lead_Source__c='Landing Page'
 AND Dept_Name__c='Math'
 AND utm_source__c is not null
 AND extract(year from CreatedDate) >= extract(year from current_date)
)
AND Contact.LeadSource='Normal Welcome Page'

Here's the Salesforce Apex code (with embedded SOQL) I wrote to do the job instead, since Salesforce doesn't give you a full-on SQL-type language.**

// Loop through every record in the "Touch_Point__c" table, setting it aside in a map, keyed by its foreign key to a the "Contact," if it is the earliest-created "Touch_Point__c" for that Contact
Map<Id, Touch_Point__c> cIDsToEarliestTP = new Map<Id, Touch_Point__c>();
List<Touch_Point__c> allTps = [
  SELECT Id, Contact__c, CreatedDate
  FROM Touch_Point__c
  ORDER BY Contact__c, CreatedDate ASC
 ];
for (Touch_Point__c tp : allTPs) {
 // The "ORDER BY" in allTPs should make this logic short-circuit at the first half of the "IF," but 2nd half will dummy-check if the list is, for some reason, out of order.
    if (!cIDsToEarliestTP.containsKey(tp.Contact__c) || cIDsToEarliestTP.get(tp.Contact__c).CreatedDate > tp.CreatedDate) {
        cIDsToEarliestTP.put(tp.Contact__c, tp);
    }
}

// Loop through every "landing page visit"-typed record in the "Touch_Point__c" table, updating a modified in-memory copy of the record in the "Contact" table it references to a list of "Contact" records called "csToUpdate" ONLY IF the "landing page"-related TouchPoint is also the "earliest-created" TouchPoint for that Contact record -- then call a DML operation on that in-memory list to persist it to the database.
List<Contact> csToUpdate = new List<Contact>();
List<Touch_Point__c> mathLandingTPs = [
  SELECT Id, Contact__c, Lead_Source__c, Contact__r.LeadSource, Dept_Name__c, utm_source__c, CreatedDate
  FROM Touch_Point__c
  WHERE Lead_Source__c='Landing Page'
  AND Contact__r.LeadSource='Normal Welcome Page'
  AND Dept_Name__c='Math'
  AND utm_source__c<>null
  AND CreatedDate>=THIS_YEAR
 ];
for (Touch_Point__c tp : mathLandingTPs ) {
    if (cIDsToEarliestTP.containsKey(tp.Contact__c) && cIDsToEarliestTP.get(tp.Contact__c).Id == tp.Id) {
        csToUpdate.add(new Contact(Id=tp.Contact__c, LeadSource=tp.Lead_Source__c));
    }
}
UPDATE csToUpdate;

Oracle programmers, I imagine your colleagues might yell at you if you used PL/SQL to hand-iterate over smaller SQL queries in Oracle rather than using native SQL do the work for you. In Salesforce, that's simply the way it's done.

**Note that more complex code might be required -- e.g. you might have to run the same code several times with a row-count limit on it -- since Salesforce is pretty picky about the performance of triggers fired in response to a DML statement. (A normal Oracle database has its limits, too, of course, but they're likely far less strict if it's your own in-house database than with Salesforce.)

Python for Salesforce Administrators - Introduction to XML and JSON

2017-04-04T12:33:00.001-05:00

XML and JSON are like each other, but not like CSV

We've talked about how useful Python can be for processing table-style data stored in "CSV" plain-text files.

The key properties of table-style data are that:

The "table" always has a certain number of columns
Every single row in the table has the exact same set of "columns" (the exact same "keys") as every other row and as the table at large
Every single row in the table is capable of having a value in each of these columns ... a "blank" placeholder still needs to be indicated if it doesn't.
Every single row in the table can only have one value in each of these columns. (If you have only one "First Name" column, no row can have two first names.)
Each conceptual "item" in the data is represented by a "row"
Each conceptual "item" (row) can have no more than 1 level of "key-value" properties (a "column header" being the "key" and a given cell beneath a column header, in a specific row, being the "value").
Example: There's no such thing as having the notion of a "Name" that breaks down at a lower level into "First Name" and "Last Name."
Sure, the database you're exporting a "CSV" file from might have an automatically computed / "formula" field called "Name" that is just a space-separated merge of "First Name" and "Last Name." But the exported "CSV" file itself -- a plain-text representation of your data -- will show "Name," "First Name," and "Last Name" as independent "columns" whose values are at an equal "level" to each other.

There are other styles of data that can be stored in plain-text files as well.

The two main problems with table-style data that alternative textual representations of data try to get around are:

Giving each conceptual "item" in the data key-value properties that are "nested" inside each other
Letting each conceptual "item" in the data have "keys" that have nothing to do with the "keys" that other conceptual "items" in the data have

A plain-text file where punctuation indicates the start/end of each conceptual "item" in the data, and where the "keys" (and their values) inside of each conceptual "item" are also indicated by careful use of punctuation, can handle both of these requirements.

The two most common formats today are "XML" and "JSON." Plain-text exports of your current Salesforce configuration are often formatted in either of these styles.

In both formats, conceptual "items" can have more conceptual "items" nested inside of them.
In both formats, there is a way (in XML, 2) of defining the "keys and their values" possessed by each conceptual "item"
In XML, the word for the thing representing a conceptual "item" is an "element."
In JSON, the word for the thing representing a conceptual "item" is an "object."
Despite the "element" vs. "object" linguistic difference, XML and JSON represent the same type of data (nested data where each conceptual "item" gets to define its own "keys" and specify values for them).

We'll have a lot of examples in this post.

To view my XML examples graphically, paste them here and click "Tree View".
To view my JSON examples graphically, paste them here and click "Tree View".

XML

The punctuation that XML uses to define the beginning and end of an "element" is a "tagset." It looks like this:

<Person></Person>

As you can see, it's the same word, surrounded by less-than and greater-than signs, with the one indicating the "end" of the element starting with a forward-slash.

Each piece in the greater-than or less-than signs is considered a "tag," hence "tagset" for the notion of including them both (kind of like "parenthesis" versus "a set of parentheses").

The fact that the tagset exists in your text file means that it exists as a conceptual "item" in your data.

It doesn't matter whether or not there's anything typed between the tags (after the first greater-than, before the last less-than). This is now a conceptual item that "exists" in your data, simply because the tagset exists.

If it doesn't have anything between the tags, you can think of it a little like a row full of nothing but commas in a CSV file. It's there ... it's just blank.

(In fact, there's even a fancy shortcut for typing such tagsets: this single tag is equivalent to a tagset with nothing in the middle like the one above -- note the forward-slash before the greater-than sign:)

<Person/>

Note that already, though, our "blank" element has one big thing different about it than a row in a CSV file does: it has a name! Rows don't have names. We'll come back to this, but this is why XML has two ways of indicating an element's "keys and their values." By giving each element a name, XML allows the element itself to be used as a "key" definition for the larger context inside of which the element finds itself.

Here's an example:

<Shirt>
 <Color></Color>
 <Fabric></Fabric>
</Shirt>

There are 3 conceptual "items," or "elements," in this data, each of which has a name.

All 3 can stand alone as "elements" in the grammar of XML. Analogy:

You can write an English sentence that has multiple complete sentences inside of it; to write a sentence with multiple complete sentences inside, simply separate the two with semicolons.

However, the fact that the elements named "color" and "fabric" are nested between the tags of the element named "shirt" means that they are also indicating that this particular shirt has keys named "color" and "fabric" (the values to both of which are currently blank).

The line breaks and tabs aren't necessary in XML (even for saying where "color" stops and "fabric" begins), but they help humans read XML.

Now might be a good time to show you the other way of indicating that a particular shirt has a "color" and "fabric," but that their values are blank:

<Shirt Color="" Fabric=""></Shirt>

Or, in shortcut notation, since there's now nothing inside the "Shirt" tagset:

<Shirt Color="" Fabric=""/>

Note that this isn't always treated EXACTLY the same as our nested-tag example when it comes to programming languages that read XML. Some software might argue that there's more of a "nothingness" in the nested-tags example (it truly doesn't have a color), and there's more of a "value without any letters in it"-ness in the inside-the-Shirt-opening-tag example. Just sometimes, though, and that's often you, the programmer, deciding to make that distinction.

The big difference, though, is that in this case, "color" and "fabric" are not standalone elements.
They are "attributes" of the element named "Shirt".

You can't put more standalone "elements" between the quotes after the "=" of an "attribute." You're done. Only a plain-text value can go there.

You can't do this:

<Shirt Color="" Fabric="<Washable></Washable>"></Shirt>

But you can do this:

<Shirt>
 <Color></Color>
 <Fabric>
  <Washable></Washable>
 </Fabric>
</Shirt>

You also can't give any element more than one "attribute" of the same name, whereas you can nest as many same-named "elements" inside of an element as you like.

You can't do this:

<Shirt Color="" Color="" Fabric=""></Shirt>

But you can do this:

<Shirt>
 <Color></Color>
 <Color></Color>
 <Fabric>
  <Washable></Washable>
 </Fabric>
</Shirt>

Those are the main differences between the two ways XML gives you to define key-value pairs on an element.

Attributes are, by definition, "the end of the line" when it comes to the "key" definitions attached to an element (and can't conflict with each others' names)
The names of elements nested inside an element also serve as "key" definitions for the outer element, but they're "fuzzier" than attributes.
This is usually considered a good thing.
This desire for "fuzziness" & "repeatability" & "nested-ness" is one of the two reasons people choose a "nested" data format instead of a table-style format to represent their data.

A word of warning: this is also valid code:

<Shirt Color="" Fabric="">
 <Color></Color>
 <Color></Color>
 <Fabric>
  <Washable></Washable>
 </Fabric>
</Shirt>

A human might look at the shirt above and think it has 3 colors and 2 fabrics. It's probably better to think of it the way the computer thinks of it -- that the shirt above has 1 color attribute, 1 fabric attribute, 2 full-on elements nested within it each named "color," and 1 full-on element nested within it named "fabric."

Now let's give our shirt some "values!"

First of all, it's essential to remember that in a way, all these example' elements "keys" already had values. The values for the "keys" were just blank, or they were other elements**.

**(Think about the examples where an element named "washable" was nested inside of an element named "fabric" which was nested inside of an element named "shirt." The value for the "shirt" element's "fabric" key wasn't exactly blank -- the value for that key was more like: "an element called 'washable.'")

But when I say "values" for keys like "color" or "fabric" or "washable," you're probably thinking about things like the word "blue" or the word "red" or the word "leather" or the word "yes" or the word "no." So let's talk about those.

In XML, any given "element" can have exactly 0 or 1 plain-text "value" (like "leather" or "blue") between the tags that show where its boundaries are.
The only other thing that can go "inside" the element besides its (optional) plain-text "value" is more elements.

Here's a really simple element with a plain-text value:

<Shape>
 Rectangle
</Shape>

It's not common for the "outer-most" elements in an XML-formatted piece of data to have values--especially because valid XML always has just 1 outermost element. (If you don't care, you can just make up a name like "RootElement" for the tagset that holds all the "elements" you actually think of as your data.).

But even sometimes "2nd-outer-most" elements don't have values. Particularly when they represent some sort of abstract real-world object with a lot of complexity that you want to capture, they have 0 values but a lot of elements nested inside them.

Although you could describe a fleet of cars like this:

<RootElement>
 <Car>
  First car's Vehicle Identification Number here
 </Car>
 <Car>
  Second car's Vehicle Identification Number here
 </Car>
</RootElement>

The above code implies that the conceptual "items" that you've given names of "car" truly are their Vehicle Identification Numbers. Yet they're really not, are they? They're heavy chunks of steel taking up space in the real world. There isn't really a plain-text value that captures what they are. Therefore, you won't really see a lot of XML like that. Although the word "Car" is, technically, a "key" to each "element's" "value," in this case, it doesn't quite make sense to give "Car" a "value."

Here's a more realistic way of writing the data, using nested elements to show that each car has 1 "key" of "VIN" and that the "value" for that "VIN" key is filled in on both cars:

<RootElement>
 <Car>
  <VIN>
   First car's Vehicle Identification Number here
  </VIN>
 </Car>
 <Car>
  <VIN>
  Second car's Vehicle Identification Number here
  </VIN>
 </Car>
</RootElement>

Here's another realistic way of expressing the same idea, only using attributes to show each car's key & values:

<RootElement>
 <Car VIN="First car's Vehicle Identification Number here">
 </Car>
 <Car VIN="Second car's Vehicle Identification Number here">
 </Car>
</RootElement>

Or, for short (using attributes):

<RootElement>
 <Car VIN="First car's Vehicle Identification Number here"/>
 <Car VIN="Second car's Vehicle Identification Number here"/>
</RootElement>

In this little example, we're actually just dealing with multiple conceptual "items," each of which has the exact same keys as each other, and which has just 1 value per key, so remember that table-style (CSV) data could've easily represented the same data -- in this case, we've just got a 1-column CSV file:

"VIN"
"First car's Vehicle Identification Number here"
"Second car's Vehicle Identification Number here"

I digress -- but it's good to recognize what's going on in your data, and which types of plain-text files are capable of representing it.

Going back to our shirt example, let's say that our data set includes just 1 shirt, that it's "blue and red and green" and made out of "leather and cotton" and that the leather isn't washable but the cotton is.
Our XML representation of our data might look like this:

<Shirt>
 <Color>
  Blue
 </Color>
 <Color>
  Red
 </Color>
 <Color>
  Green
 </Color>
 <Fabric>
  Leather
  <Washable>
  No
  </Washable>
 </Fabric>
 <Fabric>
  Cotton
  <Washable>
  Yes
  </Washable>
 </Fabric>
</Shirt>

There isn't really a good way to represent that concept of what traits the shirt possesses in a single row of a CSV file, is there? This is where XML and JSON shine!

Read the XML above carefully. What you have is:

1 element with a name of "shirt" that has 0 plain-text "values," but has 5 more elements nested inside of it
3 elements with a name of "color" (nested inside the one named "shirt"), none of which have any elements nested inside of them, but each of which have 1 plain-text value
2 elements with a name of "fabric" (nested inside the one named "shirt"), each of which have exactly 1 plain-text value, and each of which also have 1 more element nested inside of them
2 elements with a name of "washable" (nested inside various ones named "fabric"), none of which have any elements nested inside of them, but each of which have 1 plain-text value

There also isn't really a good way to represent this shirt using "attributes" on the "shirt" tagset (because it has multiple colors and multiple fabrics, and because the fabrics have nested elements of their own). However, since each "fabric" only has exactly 1 "washable" key & value, you could use attributes for that as follows:

<Shirt>
 <Color>
  Blue
 </Color>
 <Color>
  Red
 </Color>
 <Color>
  Green
 </Color>
 <Fabric Washable="No">
  Leather
 </Fabric>
 <Fabric Washable="Yes">
  Cotton
 </Fabric>
</Shirt>

The choice is up to you, depending on which way you think it's easier to fetch/modify the values using code and which way you think it's easier for humans to read.

Another choice that's up to you is whether "Leather" is what the fabric truly is (the way "blue" is an adjective and therefore describes what the color truly is), or whether the notion of a fabric is too fuzzy in the real world to capture in a single word (like with our car) and should've been a key-value pair with a key like "name."
It's the same choice we had to make when deciding whether a car was its VIN or whether it had a VIN.
Outer-ward elements representing complex concepts usually just have key-value pairs (like with our car or our shirt examples).
For elements at "deeper" levels of nesting, you'll need to decide whether they "are" something (the optional 1 plain-text value they get) or whether they merely "have" things (nested elements & attributes).
That's a judgment call for you to make based on how your data is going to be used. All organization involves judgment calls trading flexibility against simplicity.
When it comes to writing software to process XML someone else already wrote, it's good to be able to recognize which judgment call they made (because the programming-language commands for extracting the two styles of writing key-value pairs are different).

JSON

The punctuation that JSON uses to define the beginning and end of an "object" is a set of "curly braces." It looks like this:

{}

Q: Hey waiddaminute -- that "object" doesn't have a name! I thought you said JSON "objects" and XML "elements" were pretty equivalent in terms of both being representations of conceptual "items" in the same style of organizing data!
A: Similar. But not the same. Good catch. JSON doesn't give objects a name. Nor do they get an optional "plain-text value" standing apart from anything nested inside of them. JSON objects look pretty different from XML elements.

Also, you can't just put JSON objects back-to-back the way you can put XML elements back-to-back; this isn't valid JSON:

{}
{}

Instead, you have to put them inside square-brackets and separate them with commas (remember not to put a comma after the last one, since nothing comes next--easy copy/paste mistake when you're putting each one on its own line). This is how you show 2 JSON objects at the same level as each other:

[
 {},
 {}
]

Note that the line breaks and tabs, however, are still for human benefit only.

Also, you don't have to include them inside any sort of "RootElement" container. That right there is valid JSON.

But getting back to our "JSON objects don't have names" problem ... what's the equivalent of this XML in JSON?

<Person/>

There isn't an exact translation, but one representation could be:

{
 "type": "Person"
}

In other words, you're making up an "attribute" (a "key") for the JSON object, calling it "type," and giving it a "value" of "Person" (yup, JSON objects have attributes, and like XML element attributes, you can only use a given attribute-name once!) You could have called it anything -- "type" is nothing special.

Similarly, either this XML:

<Shirt>
 <Color></Color>
 <Fabric></Fabric>
</Shirt>

Or this XML:

<Shirt Color="" Fabric=""></Shirt>

Might become this JSON:

{
 "type" : "Shirt",
 "Color" : null,
 "Fabric" : null
}

(Though, getting back to that thing I mentioned earlier about whether an empty tagset is somehow "emptier" than an empty attribute set of quotes, you might argue that the values of this JSON object's "Color" & "Fabric" attributes/keys should be two quotes in a row, rather than the special keyword 'null'. More than I want to get into right now, but software than processes JSON would see the two differently -- null is emptier than the empty-quotes.)

The biggest difference from XML that the lack of names in JSON introduces is this notion of having to make up your own keyword for the name if you really think it needs a name.

Also, as far as how-to-type-JSON, "attribute" names & values on a JSON object are separated from each other with a colon, and there needs to be a comma between attribute name-and-value pairs. (Again, don't forget not to put a comma after the last attribute name-value pair!)

Furthermore, attribute names are often inside quotes in JSON, and the value can be something that doesn't have quotes around it (we haven't gotten there yet).

Sometimes your data might not need names. If you have a bunch of conceptual "items" back-to-back at the same level of nesting, and none of them need a "plain-text value" representing what they truly are, and all of them just have key-value pairs describing what they "have," JSON is a lot shorter to type. (Especially if you take out all the tabs & line breaks I'm putting in to make this blog readable. JSON authors love to take out line breaks & tabs -- if you run into such code, paste it here and click "Beautify" to read it more easily ... just make sure the "code" you're punching in isn't confidential company information!) Consider this example.

Here's some XML representing a fleet of cars, each of which have different sets of key-value traits we care about tracking (we'll use "attribute" style and "tagset-with-nothing-inside shorthand here), but all of which are cars.

<RootElement>
 <Car color="blue" trim="chrome" trunk="hatchback"/>
 <Car appeal="sporty" doors="2"/>
 <Car doors="4" color="red" make="Ford"/>
</RootElement>

Maybe we already know they're all cars based on the context of our data. A JSON equivalent, without forcing each one to have a silly attribute like "type" (with a value of "car"), could be:

[
 {
  "color" : "blue",
  "trim" : "chrome",
  "trunk" : "hatchback"
 },
 {
  "appeal" : "sporty",
  "doors" : "2"
 },
 {
  "doors" : "4",
  "color" : "red",
  "make" : "Ford"
 }
]

So far, JSON doesn't look much more concise. But that note I made about "attribute values in JSON don't have to be in double-quotes" earlier is where its power really lies. Let's take a look at a shirt with two colors (blue, red) and 1 fabric (nylon). Here's the XML:

<Shirt>
 <Color>blue</Color>
 <Color>red</Color>
 <Fabric>nylon</Fabric>
</Shirt>

And here's some similar JSON (note that I didn't bother to force it to be called "shirt" and that I decided that saying "colors" would make more sense than "color"):

{
 "Colors" : ["blue","red"],
 "Fabric" : "nylon"
}

We're using the same square-brackets-and-commas notation to make a list out of "blue" and "red" that we used to put a bunch of JSON objects together into a data set full of cars. The entirety of the set of brackets is the value of this JSON object's attribute/key called "colors."

As you can see, XML and JSON get pretty different when it comes to writing down the fact that a conceptual "item" in your data set has multiple "keys" all with the same name, each with a different value.

XML works like I just described it (multiple key-pair values, where the key names are the same as each other). JSON doesn't allow that. That's something special to XML. If you like that about XML, there you go--use XML. :-)
In JSON, you just give your object one "key" (maybe making the name of that "key" a plural noun) and you shove all those values into a list.

So, to recap the kinds of "value" you can give an attribute/"key" belonging to a JSON "object" (conceptual item):

We've seen that the "value" for a given attribute/"key" of a JSON object can be a piece of plain text (put it between double-quotes).
We've seen it be the keyword 'null' (no quotes around it).
We've seen it be a list (square-brackets, with commas separating multiple values).
Note that you can certainly just have one value inside the square brackets, or they can even be empty if the list is empty. Just keep in mind that it's still a list of values to a computer, even if it's empty or a list of size 1.
We're about to see that it can be another JSON "object" (we're about to nest things, just like we did in XML!)

Let's go back to our data set that includes just 1 shirt, which is "blue and red and green" and made out of "leather and cotton," where the leather isn't washable but the cotton is.
A JSON representation of our data might look like this:

{
 "type" : "Shirt",
 "colors" : ["blue","red","green"],
 "fabrics" :
  [
   {
    "type" : "Leather",
    "washable" : "No"
   },
   {
    "type" : "Cotton",
    "washable" : "Yes"
   }
  ]
}

As a reminder, here was a short XML version of the same shirt:

<Shirt>
 <Color>
  Blue
 </Color>
 <Color>
  Red
 </Color>
 <Color>
  Green
 </Color>
 <Fabric Washable="No">
  Leather
 </Fabric>
 <Fabric Washable="Yes">
  Cotton
 </Fabric>
</Shirt>

What I notice the most is:

JSON is awkward when conceptual "items" in our data need to have attributes and a meaningful name of their own (like "shirt" and "leather" and "cotton")
XML is overly wordy by making conceptual "items" out of things that already were true "attributes" of what we humans really think of as our data's conceptual "items" (like "color" -- we're probably not really thinking of it as an "item" in our data set, because "color" is no more complicated than its value(s) -- it doesn't have any other traits about the color that we need to describe).

What's In It For You?

In the end, as a Salesforce administrator, what matters most is being able to recognize which kind of data the Salesforce servers have given you (or in which format the servers expect to receive data from you).

You don't exactly get to argue with Salesforce about which format they should have picked.

When you're picking apart data Salesforce sent you, you just need to know how to "parse" the file to extract the data of interest to you.
When you're composing a file to send Salesforce, you'll model it after an existing file they sent you (or technical documentation) to figure out exactly how they want you to arrange the details.

In future posts, we'll talk about writing Python code that can do both of these tasks.

Understanding the relationship between the textual representation of the data and what it means will be crucial to those exercises.
Understanding when you're looking at data that can be made "flat" and "consistent from one conceptual item in the data to the next" (that is, when you can interpret it like a table/CSV) and when you're looking at data that really doesn't have those characteristics is crucial as well. Remember how I "diverted" and pointed out that our fleet of cars with nothing but VINs could have just as easily been a 1-column CSV file? Ideally, you want to be able to make that kind of commentary, too, about entire XML/JSON files or about fragments of them.
Remember to play with the graphical viewers (XML, JSON)!

Hopefully, this blog post will help you with both tasks by better understanding what the example data says and how it's shaped when you see it.

Table of Contents

Python Miscellany - Exploring A 1-Million-Row Task Table Export

2016-12-21T11:15:00.001-06:00

Mass-mailing-system e-mail histories have put us over our Salesforce data storage limits.

Various admin tools say it's the task table hogging all the space. Indeed, when I exported the entire tasks table as a CSV via Apex Data Loader, it was a gigabyte large on my hard drive. So this was obviously the table that we needed to delete some records out of (archiving their CSV-export values on a fileserver we own), but it sure wasn't going to be an efficient file to play with in Excel.

This post brings you along on some of the miscellaneous play I did with the file. I'll be describing output, rather than showing it to you, as this is real company data and I'd rather not take the time to clean it up with fake data for public display.

First, I executed the following statements:

import pandas
df = pandas.read_csv('C:\\example\\tasks.csv', encoding='latin1', dtype='object')

Then I commented out those lines (put a "#" sign at the beginning of each row) because that 2nd line of code had taken a LONG time to run, even in Python. I knew that Spyder, my "IDE" that came with "WinPython," picks up where the last code left off when you hit the "Run" button again, so in further experimentation, I was simply careful not to accidentally overwrite the value of "df" (that way, I could keep executing code against it).

The next thing I did was export the first 5 lines of DF to CSV, so they WOULD be openable in Excel, just in case I wanted to.

df.head().to_csv('C:\\example\\tasks-HEAD.csv', index=0, quoting=1)

I also checked the row count in my Task table, which is how I knew the CSV file had 1 million lines (I exported it yesterday and forgot the result from Apex Data Loader by today):

print(len(df.index))

Next, I interested myself in the subject headings, since I knew that Marketo & Pardot always started the Subject line of mass e-mails in distinctive ways. I just couldn't remember what those ways were.

For starters, I was curious how many distinct subject headings were to be found among the 50 million rows.

print(len(df['SUBJECT'].unique()))

7,000, it turns out.

Next, I wanted to skim through those 7,000 values and see if I could remember what the Marketo-related ones were called.

Just like filtering an Excel spreadsheet and clicking the down-arrow-icon that shows you all the unique values in alphabetical order gets a little useless when you have too many long, similar values, a Python output console isn't ideal for reading lots of data at once.

Therefore, I exported the data into a 7,000-line, 1-column CSV file (which would open easily & read well in a text editor like Notepad).

pandas.DataFrame(df['SUBJECT'].sort_values().unique()).to_csv('C:\\example\\task-uniquesubjects.csv', index=0, header=False)

Aha: Pardot used "Pardot List Email:" and Marketo used "Was Sent Email:" at the beginning of all subject lines in records that I considered candidates for deletion if they were over 2 years old.

(Note: if an error message comes up that says, "AttributeError: 'Series' object has no attribute 'sort_values'," your version of the Pandas plugin to Python is too old for the ".sort_values()" command against Series-typed data. Replace that part of the code with ".sort(inplace=False)".

I have two different business units willing to let me delete old e-mail copies, but they have different timeframes. One is willing to let me delete anything of theirs over 2 years old. The other is willing to let me delete anything of theirs over 1 year old. So now I needed to comb through the raw data matching my two subject heading patterns, looking for anything that would indicate business unit.

print(df[(df['SUBJECT'].str.startswith('Was Sent Email:') | df['SUBJECT'].str.startswith('Pardot List Email:')) & (pandas.to_datetime(df['ACTIVITYDATE']) < (pandas.to_datetime('today')-pandas.DateOffset(years=2)))].head())

Unfortunately, nothing in any of the other columns gave me a hint which business unit had been responsible for sending the e-mail.

I knew that we were using Marketo long before Pardot ... if I just deleted Marketo outbound e-mails over 2 years old (the least common denominator between the two business units' decisions), how many rows would I be deleting?

print(len(df[(df['SUBJECT'].str.startswith('Was Sent Email:')) & (pandas.to_datetime(df['ACTIVITYDATE']) < (pandas.to_datetime('today')-pandas.DateOffset(years=2)))]))

Hmmm. 85,000. Out of 1 million. Well ... not a bad start (about 85MB out of 1GB)...but I need to find more.

Come to think of it, how many is "both Pardot and Marketo, more than a year old," period?

print(len(df[(df['SUBJECT'].str.startswith('Was Sent Email:') | df['SUBJECT'].str.startswith('Pardot List Email:')) & (pandas.to_datetime(df['ACTIVITYDATE']) < (pandas.to_datetime('today')-pandas.DateOffset(years=1)))]))

200,000. Out of 1 million. So now we're talking about getting rid of 200MB of data (the table is about 1GB), which is probably enough to buy us some time with Salesforce.

So our happy medium is somewhere in the middle, and we probably can't do much better than getting rid of 200MB of data given the business units' requests.

Maybe, though, it wouldn't be a bad idea to just start with the low-hanging fruit and get rid of that first 80MB of data or so (Marketo >2 years old).

Let's dive just a little deeper into what I think is "Marketo 2-Year-Old e-mail copies" to make absolutely sure that that's what they are.

First ... I when I printed the "head()" of Marketo e-mails, I noticed that there seemed to be some diversity in a custom field called "ACTIVITY_REPORT_DEPT__C." Let's see what we've got in there:

tempmksbj2yo = df[(df['SUBJECT'].str.startswith('Was Sent Email:')) & (pandas.to_datetime(df['ACTIVITYDATE']) < (pandas.to_datetime('today')-pandas.DateOffset(years=2)))]
print(tempmksbj2yo['ACTIVITY_REPORT_DEPT__C'].unique())

"ERP API", "Marketo Sync," & "Student Worker

How many do we have of each?

tempmksbj2yo = df[(df['SUBJECT'].str.startswith('Was Sent Email:')) & (pandas.to_datetime(df['ACTIVITYDATE']) < (pandas.to_datetime('today')-pandas.DateOffset(years=2)))]
print(tempmksbj2yo.groupby('ACTIVITY_REPORT_DEPT__C').size())

Good to know - there are just a few dozen rows that aren't "Marketo Sync." In future logic (as below), I'll further specify that a "Marketo" e-mail has this "Marketo Sync" trait.

Now let's take our records we want to delete from Salesforce and get them ready for doing so by exporting the raw data to CSV (for archiving on an on-premise file server) and by exporting the record IDs to a different CSV (for putting into "Apex Data Loader" as a DELETE operation). Let's also export to CSV what remains, to make it easier to pick up where we left off hunting for more deleteable records (this ".to_csv()" operation takes a LONG time to execute because it's a huge file!). And before doing all that (commented out the last line to keep it from running when first verifying this, since it takes so long to run), let's also verify whether the "remaining" file has the same size as the original file minus our "to delete" file.

mk2yologic = (df['SUBJECT'].str.startswith('Was Sent Email:')) & (pandas.to_datetime(df['ACTIVITYDATE']) < (pandas.to_datetime('today')-pandas.DateOffset(years=2))) & (df['ACTIVITY_REPORT_DEPT__C'] == 'Marketo Sync')

mk2yo = df[mk2yologic]
remaining = df[~mk2yologic]

print('df size is ' + str(len(df.index)))
print('mk2yo size is ' + str(len(mk2yo.index)))
print('df - mk2yo size is ' + str(len(df.index) - len(mk2yo.index)))
print('remaining size is ' + str(len(remaining.index)))

mk2yo.to_csv('C:\\example\\task-marketo-2yo-or-more-deleting.csv', index=0, quoting=1)
mk2yo['ID'].to_csv('C:\\example\\task-marketo-2yo-or-more-idstodelete.csv', index=0, header=True)
remaining.to_csv('C:\\example\\task-remaining-after-removing-marketo-2yo-or-more.csv', index=0, quoting=1)

Next, it's on out of Python-land for a while, and into Apex trigger-editing (there's one from an AppExchange plugin that it helps to disable when deleting this many Tasks) and using the "Apex Data Loader" to execute the DELETE operation.

P.S. That didn't help enough. You know where I found another problem? E-mails that went out to a business unit's entire mailing list. 1 task per recipient per e-mail. Here's how I came up with a list of the (10) Subject-ActivityDate combinations that had the largest number of records:

rmn = df[~((df['SUBJECT'].str.startswith('Was Sent Email:')) & (pandas.to_datetime(df['ACTIVITYDATE']) < (pandas.to_datetime('today')-pandas.DateOffset(years=1))) & (df['ACTIVITY_REPORT_DEPT__C'] == 'Marketo Sync'))]
print(rmn.groupby(['SUBJECT','ACTIVITYDATE']).size().order(ascending=True).reset_index(name='count').query('count>10000'))

Those 10 mailings represent 250,000 of our 1 million "Task" records, as verified here:

rmn = df[~((df['SUBJECT'].str.startswith('Was Sent Email:')) & (pandas.to_datetime(df['ACTIVITYDATE']) < (pandas.to_datetime('today')-pandas.DateOffset(years=1))) & (df['ACTIVITY_REPORT_DEPT__C'] == 'Marketo Sync'))]
print(rmn.groupby(['SUBJECT','ACTIVITYDATE']).size().order(ascending=True).reset_index(name='count').query('count>10000')['count'].sum())

Time to talk with some business users about whether this data really needs to live in Salesforce, even though it's recent...

Python for Salesforce Administrators - Filtering Rows By Maximum Date Per Group

2016-12-08T13:38:00.001-06:00

I had a special request to show off Python for filtering a CSV file to only leave behind the "most recent activity" for a given "Account" record. Happy to oblige!

However, I can't explain as well as I've been doing in other examples exactly what's going on here. It's a bit of black magic to me, but it works. Follow along.

Preparing A CSV

Now I'm going to open Notepad and make myself a CSV file (when I do File->Save As, change the "Save as File Type" from .TXT to "All Files (*.*)") and save the file as "c:\tempexamples\sample3.csv".

Here's what that file contains (representing 5 columns worth of fields & 7 rows worth of data):

"Actv Id","Actv Subject","Acct Name","Contact Name","Actv Date"
"02848v","Sent email","Costco","James Brown","10/9/2014"
"vsd8923j","Phone call","Gallaudet University","Maria Hernandez","3/8/2016"
"3289vd09","Sent email","United States Congress","Tamika Page","7/9/2016"
"das90","Lunch appointment","United States Congress","Tamika Page","3/4/2015"
"vad0923","Sent email","Salesforce","Leslie Andrews","4/28/2013"
"dc89a","Phone call","Costco","Sheryl Larson","5/29/2016"
"adf8o32","Conference invitation","Salesforce","Leslie Andrews","2/7/2015"
"fa9s3","Breakfast appointment","Costco","James Brown","9/3/2014"
"938fk3","Phone call","United States Congress","Shirley Chisholm","7/9/2016"

If I run this code (note that I've got a special "read_csv" line that specifies to treat 'Actv Date' as a sortable date):

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample3.csv', dtype=object, parse_dates=['Actv Date'])
print(df)

The output looks like this:

    Actv Id           Actv Subject               Acct Name      Contact Name  Actv Date
0    02848v             Sent email                  Costco       James Brown 2014-10-09
1  vsd8923j             Phone call    Gallaudet University   Maria Hernandez 2016-03-08
2  3289vd09             Sent email  United States Congress       Tamika Page 2016-07-09
3     das90      Lunch appointment  United States Congress       Tamika Page 2015-03-04
4   vad0923             Sent email              Salesforce    Leslie Andrews 2013-04-28
5     dc89a             Phone call                  Costco     Sheryl Larson 2016-05-29
6   adf8o32  Conference invitation              Salesforce    Leslie Andrews 2015-02-07
7     fa9s3  Breakfast appointment                  Costco       James Brown 2014-09-03
8    938fk3             Phone call  United States Congress  Shirley Chisholm 2016-07-09

Grouping and Filtering

Firstly, I'm going to show you the final code.

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample3.csv', dtype=object, parse_dates=['Actv Date'])
groupingByAcctName = df.groupby('Acct Name')
groupedDataFrame = groupingByAcctName.apply(lambda x: x[x['Actv Date'] == x['Actv Date'].max()])
outputdf = groupedDataFrame.reset_index(drop=True)
print(outputdf)

The output looks like this:

    Actv Id           Actv Subject               Acct Name      Contact Name  Actv Date
0     dc89a             Phone call                  Costco     Sheryl Larson 2016-05-29
1  vsd8923j             Phone call    Gallaudet University   Maria Hernandez 2016-03-08
2   adf8o32  Conference invitation              Salesforce    Leslie Andrews 2015-02-07
3  3289vd09             Sent email  United States Congress       Tamika Page 2016-07-09
4    938fk3             Phone call  United States Congress  Shirley Chisholm 2016-07-09

(Exporting back to CSV instructions here, under the 5th program, "Writing To CSV.")

Note that "United States Congress" still has 2 records, because two things happened on the same date. This particular script I gave you doesn't "toss a coin" and break the tie between e-mailing Tamika and calling Shirley. It just leaves both "most recent" records in place.
(See the end of this post for tie-breaking.)

I can spot-check that it works because when line 4 ends with "max()" (as in most-recent activity), the result for Costco is Sheryl Larson's activity in May 2016. If I change it to "min()" (as in least-recent activity), the result for Costco is James Brown's activity in September 2014.

Variation - Multi-Column Group

If I change ".groupby('Acct Name')" at the end of line 3 in the previous example to ".groupby(['Acct Name', 'Contact Name'])" and leave line 4 as "max," I get the following output (note 2 Costco rows - one for James, one for Sheryl, but with James's October 2014 activity as his most recent):

    Actv Id           Actv Subject               Acct Name      Contact Name  Actv Date
0    02848v             Sent email                  Costco       James Brown 2014-10-09
1     dc89a             Phone call                  Costco     Sheryl Larson 2016-05-29
2  vsd8923j             Phone call    Gallaudet University   Maria Hernandez 2016-03-08
3   adf8o32  Conference invitation              Salesforce    Leslie Andrews 2015-02-07
4    938fk3             Phone call  United States Congress  Shirley Chisholm 2016-07-09
5  3289vd09             Sent email  United States Congress       Tamika Page 2016-07-09

What's Really Going On?

Line 3, "df.groupby('Acct Name')," produces data that we can save with a nickname ("groupingByAcctName"), but that doesn't really print well (try it - try adding a "print(groupingByAcctName)" line to the script). However, trying to print it does tell us that it's a "type" of data called a "DataFrameGroupBy." (Just like there are "types" of data called "DataFrames" & "Series.")

Line 4 applies a "lambda" function to our cryptic data that we stored under the nickname "groupingByAcctName."

The output, which we give a nickname of "groupedDataFrame," is just a "DataFrame." (I can tell by running the line "print(type(groupedDataFrame))".)

We've seen the ".apply(lambda ...)" operation before, in "Fancier Row-Filtering and Data-Editing."

It turns out that it can also be tacked onto the end of "DataFrameGroupBy"-typed data, not just "Series"-typed data.

When ".apply(lambda ...)" is tacked onto Series-type data, it does whatever's inside the "lambda ..." to each item in the Series and spits out a new Series of the same length with contents that have been transformed accordingly.

A DataFrameGroupBy can be thought of kind of like a "Series that holds DataFrames" (each mini-DataFrame being a row of the original DataFrame that falls into the "group").
When ".apply(lambda ...)" is tacked onto DataFrameGroupBy-type data, it does whatever's inside the "lambda ..." to each DataFrame in the DataFrameGroupBy and spits out a new DataFrame that jams the results back together (although, as we'll see later, the labeling is a bit complex).

In this case, there were 4 unique 'Acct Name' values. What we did to each mini-DataFrame inside was say, "Just show the rows of this mini-DataFrame that have a maximal 'Actv Date.' One 'Acct Name' had a tie on 'Actv Date,' so we actually ended up with 5 rows in the output DataFrame (which we then stored under the nickname "groupedDataFrame").

Our "CodeThatMakesAValueGoHere" was "x[x['Actv Date'] == x['Actv Date'].max()]"

What's going here is that "x" is a DataFrame.
"lambda x : ..." means, "Use 'x' as a placeholder for each of whatever is in the thing we're running this '.apply(lambda ...)' on."
If we were to use our original straight-from-the-CSV DataFrame and say "print(df[df['Actv Date'] == df['Actv Date'].max()])" we would get 2 rows: the tie for "most recent activity in the whole original DataFrame" (the July 10, 2016 activities).
However, now we're saying "Do the same thing, but within each of the mini-DataFrames that have already been broken up into unique 'Actv Date' clumps."

This code:

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample3.csv', dtype=object, parse_dates=['Actv Date'])
groupingByAcctName = df.groupby('Acct Name')
groupedDataFrame = groupingByAcctName.apply(lambda x: x[x['Actv Date'] == x['Actv Date'].max()])
print(groupedDataFrame)

Produces this output:

                           Actv Id           Actv Subject               Acct Name      Contact Name  Actv Date
Acct Name                                                                                                     
Costco                 5     dc89a             Phone call                  Costco     Sheryl Larson 2016-05-29
Gallaudet University   1  vsd8923j             Phone call    Gallaudet University   Maria Hernandez 2016-03-08
Salesforce             6   adf8o32  Conference invitation              Salesforce    Leslie Andrews 2015-02-07
United States Congress 2  3289vd09             Sent email  United States Congress       Tamika Page 2016-07-09
                       8    938fk3             Phone call  United States Congress  Shirley Chisholm 2016-07-09

Here we have something with "rows" and "columns," with its rows "numbered" by Pandas - it is indeed a DataFrame, even though it's got some weird groupings and gaps. (Although note how the row-numbers have been preserved from the original DataFrame! And note how to the left of the numbers there are labels that seem to group the numbers ... I said that DataFrame row-numbering was "complicated," didn't I?)

In case you're curious, here's output from the same code, only with the "grouping by both account and contact name" variation mentioned earlier.

                                            Actv Id           Actv Subject               Acct Name      Contact Name  Actv Date
Acct Name              Contact Name                                                                                            
Costco                 James Brown      0    02848v             Sent email                  Costco       James Brown 2014-10-09
                       Sheryl Larson    5     dc89a             Phone call                  Costco     Sheryl Larson 2016-05-29
Gallaudet University   Maria Hernandez  1  vsd8923j             Phone call    Gallaudet University   Maria Hernandez 2016-03-08
Salesforce             Leslie Andrews   6   adf8o32  Conference invitation              Salesforce    Leslie Andrews 2015-02-07
United States Congress Shirley Chisholm 8    938fk3             Phone call  United States Congress  Shirley Chisholm 2016-07-09
                       Tamika Page      2  3289vd09             Sent email  United States Congress       Tamika Page 2016-07-09

Before we can export our DataFrame into something that looks like our original CSV file (which is pretty much everything to the right of the row-numbers), we need to strip off that far-left "Acct Name" and the row-numbers, plus strip out the extra line between the column headers to the right and their data.

That's where the ".reset_index(drop=True)" command comes into play. Tacked onto a DataFrame, it produces a new DataFrame with all the row-numbers reset to "0, 1, 2..." and no labels to the left, plus no whitespace above the data. To make the code easier to read, I saved this new DataFrame as "outputdf."

And that's the end of the dirty details.

Breaking Ties

Afterthought: if you want to break ties so you have exactly 1 output row per 'Acct Name', here's how you do it:

Figure out if there's anything else that's a "secondary tiebreaker."
Once you run out of "secondary tiebreakers," pick an arbitrary single record with ".head(n=1)" or ".tail(n=1)" (depending on whether you want the first or the last record).
- Making a final tie-break of "Oh, just pick the first line" is a common pattern in data processing when your business requirements say it's vital not to allow any ties to come out of an aggregation.

We have to change our approach to designing our code a bit to accommodate these extra steps. Here's the pattern:

Break our original DataFrame up into a "Series of DataFrames grouped by (whatever we're computing the Max/Min among)"
(a.k.a. a "DataFrameGroupBy")
The code for this is ".groupBy(...)"
Run an ".apply(lambda ...)" operation on the result of the previous step. For each mini-DataFrame being processed by the ".apply(lambda ...)" operation:
1. Sort the rows of the mini-DataFrame first by the "primary sorter column," then by the "secondary sorter column," and so on (".sort_values(...)"). This produces a new mini-DataFrame with all the same rows, but sorted as specified.
  (Note - "ascending" is the default sort for each column with ".sort_values(...)", but you can specify on a column-by-column basis. I've shown that in the example code, but you can leave the ", ascending = ..." code out if you simply want an "ascending" sort across the board.)
2. Produce a new mini-DataFrame that contains just the first/last row of the mini-DataFrame that came out of the previous step (depending on whether your sorting operation put the min/max you're looking for at the beginning or the end of the mini-DataFrame). The code for this is ".head(n=1)" or ".tail(n=1)".
Strip the extra labels off the resulting big DataFrame so that it's ready for export back to CSV (".reset_index(drop=True)")

Afterthought: if you want to break ties by LastModifiedDate, just do it all over again. I've created a new "sample4.csv" as an example:

"Actv Id","Actv Subject","Acct Name","Contact Name","Actv Date","Mod Date"
"02848v","Sent email","Costco","James Brown","10/9/2014","10/10/2016 10:00:00"
"vsd8923j","Phone call","Gallaudet University","Maria Hernandez","3/8/2016","4/13/2016 14:23:04"
"3289vd09","Sent email","United States Congress","Tamika Page","7/9/2016","7/10/2016 9:35:36"
"das90","Lunch appointment","United States Congress","Tamika Page","3/4/2015","3/5/2015 13:01:01"
"vad0923","Sent email","Salesforce","Leslie Andrews","4/28/2013","9/30/2015 10:04:58"
"dc89a","Phone call","Costco","Sheryl Larson","5/29/2016","6/1/2016 11:38:00"
"adf8o32","Conference invitation","Salesforce","Leslie Andrews","2/7/2015","2/8/2015 08:36:00"
"fa9s3","Breakfast appointment","Costco","James Brown","9/3/2014","9/4/2014 07:35:00"
"938fk3","Phone call","United States Congress","Shirley Chisholm","7/9/2016","7/10/2016 08:34:20"

Here's some code to do this:
(Click here for playable sample...do not type your own company's data into this link!)

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample4.csv', dtype=object, parse_dates=['Actv Date', 'Mod Date'])
groupingByAcctName = df.groupby('Acct Name')
groupedDataFrame = groupingByAcctName.apply(lambda x : x.sort_values(['Actv Date','Mod Date'], ascending=[True,True]).tail(n=1))
outputdf = groupedDataFrame .reset_index(drop=True)
print(outputdf)

Or a more concise version of the code, where I strung multiple lines together and didn't stop to give them nicknames (this form gets easier to read as your code gets longer):

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample4.csv', dtype=object, parse_dates=['Actv Date', 'Mod Date'])
print(df.groupby('Acct Name').apply(lambda x : x.sort_values(['Actv Date','Mod Date'], ascending=[True,True]).tail(n=1)).reset_index(drop=True))

And here's the output text:

    Actv Id           Actv Subject               Acct Name     Contact Name  Actv Date            Mod Date
0     dc89a             Phone call                  Costco    Sheryl Larson 2016-05-29 2016-06-01 11:38:00
1  vsd8923j             Phone call    Gallaudet University  Maria Hernandez 2016-03-08 2016-04-13 14:23:04
2   adf8o32  Conference invitation              Salesforce   Leslie Andrews 2015-02-07 2015-02-08 08:36:00
3  3289vd09             Sent email  United States Congress      Tamika Page 2016-07-09 2016-07-10 09:35:36

Again, I can spot-check it because the tie-break between the "United States Congress" is that Tamika's July 2016 activity record was modified about an hour after Shirley's.

If I change "ascending=[True,True]" to "ascending=[True,False]" (sort 'Actv Date' ascending but 'Mod Date' descending, then pick the last record in the group) the tie-break goes to Shirley's record instead.
Similarly, if I change "ascending=[True,True]" to "ascending=[False,True]" (sort 'Actv Date' descending but 'Mod Date' ascending, then pick the last record in the group), I'm saying I want the least recent 'Actv Date', but I still want ties broken by most recent 'Mod Date.'
And if I change "ascending=[True,True]" to "ascending=[False,False]" (sort both dates descending, then pick the last record in the group), I'm saying I truly want the least recent record in the group.
(In this case, "[False,True]" & "[False,False]" produce the same results because I don't have any ties for "oldest activity in the group" on 'Actv Date' that need breaking by 'Mod Date'. Tamika's March 4, 2015 record is the oldest for Congress, etc.)

Table of Contents

Python for Salesforce Administrators - Installing Python 3 on Windows

2016-12-08T10:42:00.000-06:00

As I mentioned here, an "IDE" is, basically, a text-editor with a "play" button to execute whatever code you've typed into the text editor.

My favorite suite of software for running "Python 3" on Windows is called WinPython.
(Read up here about Python 2 & Python 3.)

You can install and run WinPython without administrator rights.

The latest versions are always listed at the software's Github page, but Windows 7 & Windows 8 users whose systems don't have the "Microsoft Visual C++ Redistributable for Visual Studio 2015" update installed into Windows might not be able to actually run the very latest versions of WinPython after installing them. (And, unfortunately, you can't install this update to Windows without administrator rights.)

The most recent "64-bit" version of WinPython for "Python 3" that I've found that works without that update is WinPython 3.4.3.5, which you can download here.

WinPython doesn't really install anything on your computer. It just puts a bunch of files into whatever folder you tell it to put those files into. When you want to "uninstall" it, you simply delete the folder. (Similarly, you have to create your own Desktop and Start-Menu shortcuts by hand.) If you "install" a certain version of WinPython and find that you can't run certain programs due to an error saying that "api-ms-win-crt-runtime-l1-1-0.dll" is missing, or if those program never show any sign of running at all, delete your "installation" and start over with a different version of WinPython.

Here are screenshots of the steps I took as I installed WinPython 3.4.3.5, 64-bit, on a thumb drive.

Step 1: download the installer file to my hard drive

Step 2: run the installer file by double-clicking on it

Step 3: wait a few seconds for the installer to come up

Step 4: accept the licensing agreement

Step 5: specify where on your computer (or on a thumb drive) you want the folder containing "WinPython" files to live & click "Install"

Step 6: wait for the installer to put all of the files on your hard drive, then click "Next," then click "Finish"

Let's see what we've done so far.

Step 7: navigate to the folder where you installed WinPython and observe the two files "Spyder.exe" and "WinPython Command Prompt.exe"

"Spyder" is your IDE (the text-editor with a play button)
The "WinPython Command Prompt" is what you run if you need to type commands that begin with "pip" (which is command-line-interface software that can download and install Python "modules" you don't already have based on the module name - if you got a "full" version of WinPython, you probably won't need to use this)

You might want to manually create shortcuts to these two files - or at least to Spyder - in your Start Menu and/or on your Desktop.

All right ... let's run some code!

Step 8: run "Spyder.exe" by double-clicking on it (it takes a minute to load -- something went wrong if you don't see a greenish "splash screen" with a spiderweb on it within 20 seconds, though)

Step 9: type the following text into the text-editor in Python and click the "play" button (or hit "F5"), which means "run":

print('hello there')

Note: If you get a pop-up dialog with all sorts of options about executing code consoles or interpreters and such, click "cancel" on the pop-up dialog box and then click the play button again - for some reason. The pop-up box was giving you a chance to highly customize where you want to see any output from running your code. Clicking "cancel" and trying again seems to tell Spyder, "you figure out where to display the output of this code, don't bother me with all these options." (Unfortunately, I didn't catch this dialog box with a screenshot.)

Step 10: observe that the "console" in another corner of your screen" said "hello there" in it!

Note: Typically, when you are programming, the first thing you do with a new tool is try to get it to display some simple text on your screen. That's what we did in Steps 9 & 10. But it's important to remember that not every command in a programming language displays text on the screen when it runs. For example, in some of the Python examples on this blog, certain commands create CSV files - they don't display any text on the screen at all when they're done creating the CSV file. To know when such code is done, you could add a line to your script afterwards along the lines of "print('all done')". You may also pick up on subtle changes in the "console" in Spyder as code starts & stops running (for example, you get a new "[]" prompt - which you can ignore - when code has finished running).

Nevertheless, no matter what your code does, the area of Spyder where you typed in Step 9 is the area where you will want to copy/paste all of the example code from this blog before tweaking it to match your own needs and clicking "Run" (the play-button / F5).

Now that you're up and running, time to learn how to use Python!

Web-Browser-Based Exercises

DO NOT DO NOT DO NOT use your own company's data with this trick.

You have to promise me. Pinky-swear.

You MUST use a properly-installed copy of Python as described above for data you actually want to work with, no matter how inconsequential you think it is.

BUT ... if you'd like to just play along with the "sample1.csv"-type data in this blog, I've found a way you can do so right now, using nothing but your web browser.

I painstakingly made all of that data up, word by word. That's why it's okay to just throw into some random service hosted online.

Real CSV files from your company are NOT okay to throw into some random service hosted online. Even if it doesn't seem like important data. Just do it on your local computer and be safe, not sorry.

Anyway ... for imitating exercises in this blog, you can go to https://repl.it/languages/python3

(Update, 2018: repl.it sometimes seems to run slowly, lately, when using the "Pandas" functionality of Python. At the moment, https://codebunk.com/b/ still seems to work, but that link is just meant for trying their paid product, so please don't go too crazy.)

If an exercise in this blog claims it works with a file "C:\tempexamples\sample000.csv" whose contents look like this:

"Id","First","Last","Email","Company"
"5829","Jimmy","Buffet","jb@example.com","RCA"
"2894","Shirley","Chisholm","sc@example.com","United States Congress"

And then if the Python code looks like this:

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample000.csv')
print('---Here is a sampling of the first lines---')
print(df.head())

Then your code at Repl.it will be:

import sys
if sys.version_info[0] < 3: from StringIO import StringIO
else: from io import StringIO

sample = StringIO("""
"Id","First","Last","Email","Company"
"5829","Jimmy","Buffet","jb@example.com","RCA"
"2894","Shirley","Chisholm","sc@example.com","United States Congress"
""")

import pandas
df = pandas.read_csv(sample)
print('---Here is a sampling of the first lines---')
print(df.head())

Note that you have to put in 3 lines of importing extra "modules," then you have to do some weird, very precise code surrounding a copy/paste from the contents of the CSV file, and finally, in the "df = pandas.read_csv()" line, instead of saying "'C:\\tempexamples\\sample000.csv'", you refer to the nickname you gave your fake CSV file (in this case, "sample") - without any single-quotes around it.

Drive carefully. Remember your pinky-swear.

Table of Contents

Salesforce Application Deployment and Versioning - a few random thoughts

2016-12-01T15:48:00.001-06:00

At work, we're starting to talk about overhauling our software application build, deployment, and versioning processes and technologies.

One of the tensions in coming up with organized versioning has to do with the fact that these days, it's considered good practice to push software application code onto end-user-facing servers via your version control system. An alternative approach is to push software application code onto end-user-facing servers some other way, and then separately throw a copy of "the latest server code" into your version control system. In my opinion, it can be pretty tough to decide which of these philosophies to follow when trying to use version control software with Salesforce.

Background: Each copy of Salesforce that a company buys comes with at least one ("production"), and possibly several more ("sandbox"), independent databases. (They're vaguely aware of each other's existence, but live on different servers, don't have any configuration-file interdependencies, etc.) "Sandbox" databases can be destroyed and rebuilt -- empty of data -- in the image of the "production" database (application code included), at the click of a few buttons from the "production" database's web-based administrative UI. New "sandbox" databases can be created the same way.

The back-end configuration files from which each of those databases knows its own definition (schema, application code, user permissions) are in plain text (usually XML, but some Java-like source code files as well) and live on Salesforce's servers along with the data stored in the database.

For each database, there are 3 ways to modify those back-end configuration files:

Log into the database's web-based administrative UI and point-and-click your way through changes (think of it like modifying an MS Access database).
- In a "production" database, this option is only allowed for schema and user permission changes, not for application code changes.
- In a "sandbox" database, the web-based administrative UI also involves a small IDE for modifying application code. Such code is often moved to a "production" database via option #2 below.
Log into the web-based UIs of 2 databases related to the same "copy of Salesforce bought by a company" and specify that certain components of one database should be written to the other. (Often used by novices for migrating application code from a "sandbox" database to a "production" database. Known as using "change sets.")
Upload a fresh copy of a configuration file to the database's SOAP HTTP API endpoint in a packaged "build" ZIP file.
- (Downloading/uploading via the Eclipse "Force.com IDE" plugin or the "ANT Force.com Migration Tool" or Workbench or MavensMate or WelkinSuite are all just tools that take care of the "SOAP" part of option #3 for you.)

Option #3, which involves getting/sending copies of the configuration files to and from Salesforce's servers, is the only option that truly integrates with version control systems. That's because it's the only option that actually involves using copies of the configuration files sitting on a machine that isn't the Salesforce server itself. (Salesforce won't let you make a version control repository out of the database server itself the way some other code-hosting cloud services will.)

Because option #3 can involve version control systems, it plays nicely with the modern software application deployment philosophy of "push configuration files onto end-user-facing servers via your version control system." However, a key rule that everyone has to follow to make this philosophy work seems to be: "Do NOT modify the database's configuration files in any way other than via the version control system. Seriously. Don't. Even if the database will let you. Just don't."

Unfortunately, that's pretty much impossible to enforce with Salesforce. Salesforce advertises how easy it is to use option #1 & option #2. They encourage "clicks, not code" for schema and user permission changes (as well as for some "application" development). In the end, it's unlikely that the database configuration files actually out on Salesforce's servers truly match the copies of those configuration files sitting in your version control system if you're only moving copies in the direction of VCS repository -> database.

If you find that your company is mostly using option #1 and option #2 to modify code in your Salesforce databases, you can still use version control software. Essentially, you use the "get copies of database configuration files" functionality of the database's SOAP HTTP API to download copies of code onto a machine that does host a version control system repository. Do this as often as you want to "checkpoint" the current state of a Salesforce database. Example here.

I've also heard of companies that finagle some sort of hybrid solution: they do what I just suggested with a scheduled tool for everything but configuration files representing "application code."
People who write "application code" are expected to follow "push configuration files onto the end-user-facing server via your version control system" practices.
(From what I understand, they have to be pretty skilled with version control tool branching and merging, as well as efficient XML file editing, to handle the fact that lay-people might concurrently use option #1 & option #2 to modify schemas & user permissions that they also need to modify and were planning to modify via option #3.)

Personally, I think it's also hard to decide how many version control system repositories to spin up:

One repository for each database in a "copy of Salesforce a company bought"
One big repository (perhaps with 1 sub-folder per database) for each "copy of Salesforce a company bought"

Novices to Salesforce and Git using the Eclipse "Force.com IDE" plugin to facilitate downloading copies of their databases' configuration files could easily find themselves in the "1 repository per database" approach because version control systems often use a computer's file-system folder as the boundary of a "repository," and default setup with the Eclipse plugin seems to give you one file-system folder per database (not clumped together into any sort of overarching folder representing the "copy of Salesforce a company bought").

That might actually be rather nice, because then your check-in comments about changes you've made in the various databases are naturally segregated from each other. It might also be easier to use graphical user interface tools to search a repository for a file by its name without seeing a bunch of clutter from other databases if the repository only represents other databases.
However, you're leaving your colleagues with more repositories to wade through asking themselves, "What's this one for?"

I don't completely understand it yet, but my understanding is that people in "push configuration files onto the end-user-facing server via your version control system" workplaces prefer the "one big repository for each 'copy of Salesforce a company bought'" repository structure.

Visualforce vs. Lightning Components Side-By-Side 1

2016-11-02T14:51:00.000-05:00

In my last post, I proposed converting an existing Visualforce page to use Lightning Components. I decided instead to start with a few simple "build the same thing in both environments and compare them" exercises.

Visualforce & Lightning Components are different coding environments for writing web applications, hosted in the user interface of a Salesforce database, that interact with data stored in that database.

Here is the first such exercise.

Apex Controller "BlogExample1ContactController.apxc" - used by both files:

public class BlogExample1ContactController {
    @AuraEnabled
    public static List<Contact> getCsFromServer() {
        // Do various proper security stuff and then...
        return [SELECT LastName, FirstName, Id, Email FROM Contact WHERE LastName LIKE 'LastXYZZY-%']; // <<-- DO NOT actually just do this.  Need proper security stuff.
    }
}

This server-side code, when executed, returns a list of "Contact"-typed records from the database (filtered to expose only records where LastName starts with "LastXYZZY-," and exposing only data from the LastName, FirstName, Id, & Email columns of those records).

The VisualForce page only takes 1 more file of code to get data displaying at https://MyCustomDomainName.na##.visual.force.com/apex/BlogExample1VFPage.

Visualforce Page "BlogExample1VFPage.vfp" - used in VisualForce:

<apex:page docType="html-5.0" controller="BlogExample1ContactController">
    <p>Hi</p>
    <apex:repeat value="{!CsFromServer}" var="cntct">
        <hr/>
        <p><apex:outputField value="{!cntct.LastName}"/></p>
        <p><apex:outputField value="{!cntct.FirstName}"/></p>
        <hr/>
    </apex:repeat>
    <p>Bye</p>
</apex:page>

Here's what visiting the VisualForce page looks like:

The Lightning Components app requires 4 more files of code (1 "app" + 1 "component" and 2 more JavaScript files as its "controller" & "helper" files) to get data displaying at https://MyCustomDomainName-dev-ed.lightning.force.com/c/BlogExample1LCApp.app.

Lightning Components App "BlogExample1LCApp.app":

<aura:application >
    <c:BlogExample1LCBasicComponent />
</aura:application>

This app's only job is to exist (lines 1 & 3) - it gets its own URL so I can actually browse to it and see it.

Oh, and it has to say, "go see the Component called 'BlogExample1LCBasicComponent'" (line 2).

"c" is a built-in variable that means "the code-space that is 'all Lightning Components.'"

Lightning Components App "BlogExample1LCBasicComponent.cmp":

<aura:component controller="BlogExample1ContactController">
    <aura:handler name="init" value="{!this}" action="{!c.initializationCode}" />
    <aura:attribute name="cs" type="Contact[]"/>
    <p>Hi</p>
    <aura:iteration items="{!v.cs}" var="con">
        <hr/>
        <p><ui:outputText value="{!con.LastName}"/></p>
        <p><ui:outputText value="{!con.FirstName}"/></p>
        <hr/>
    </aura:iteration>
    <p>Bye</p>
</aura:component>

My Lightning Components example is so simple that it's just got 1 component.

If we'd wanted fancier HTML representing each record of the database that we're displaying, we could have moved the contents of that "aura:iteration" tagset to their own component and put a reference to that component inside the tagset instead. (Although we'd have to make sure, when defining it, to give it an "attribute" that can hold a "Contact"-typed record and, when referencing the component, to "pass" that attribute the current value of the iteration's "con" variable.)

Anyway, this component's definition indicates that it's "controlled" by our Apex controller, much as you see in our Visualforce code.

Overall, this component looks a lot like our Visualforce page. The main difference is that we have to explicitly tell it to execute JavaScript code ("initializationCode(...)"). (Note: "initializationCode(...)" itself also has to be told to actually talk to the server-side Apex controller and fetch data ... we'll see that later.)

If we don't do that, there'll be no data between "Hi" & "Bye," even if there's data in the database that matches the query in our Apex controller.

The Visualforce line '<apex:repeat value="{!CsFromServer}" var="cntct">' knows to go talk to the server to fill in data summoned by the code in the "value" tag. (It also has the server loop through that data and generate HTML/CSS.)

The Lightning Components line '<aura:iteration items="{!v.cs}" var="con">' doesn't. It just says, "if this component's 'cs' variable (which lives in a web-surfer's browser) has any data in it, have the browser loop through it and generate HTML/CSS."

Line 3 establishes that the component has a variable named "cs" and that its data type is a list of "Contacts."

Line 2 ('<aura:handler name="init" value="{!this}" ... />') actually executes JavaScript that will go & talk to the server and that is responsible for setting the value of "cs" (in this case, upon page load).

Lightning Components JavaScript file "BlogExample1LCBasicComponentController.js":

({
 initializationCode : function(component, event, helper) {
        helper.getCons(component);
 }
})

Every JavaScript function intended to be summoned directly from "component" markup should be in the associated "Controller" file and should be defined to expect 3 parameters: a component, an event, and a helper (in that order).

Every JavaScript function intended to be summoned from other JavaScript, rather than from the "component" markup itself, should be in the associated "Helper" file.

In this case, our "initializationCode" doesn't do any real work - it just executes a JavaScript function found in the "Helper" file and passes it the value of the "component" passed to it (which would be the component from which we summoned it).

Lightning Components App "BlogExample1LCBasicComponentHelper.js":

({
    getCons : function(component) {
        var action = component.get("c.getCsFromServer");
        action.setCallback(this, function(response) {
            var state = response.getState();
            if (component.isValid() && state === "SUCCESS") {
                component.set("v.cs", response.getReturnValue());
            }
        });
        $A.enqueueAction(action);
    }
})

This JavaScript function inspects the "component" it's been passed, notices that that "component" has an Apex "controller" class attached to it in its definition, and goes about talking to that code (calling its "getCsFromServer()" method).

If data actually comes back from that process, it assigns that data to the component's variable "cs."

("v," by the way, is included in the framework and is a variable that an instance of a component uses to refer to itself. It also appears in the component's code above.)

Here's what visiting the Lightning Components app looks like, when viewed on its own:

(Notice how "vanilla" it looks - what you code is what you get!)

Finally, I have a bunch of random ways of abbreviating the word "contact" or "contacts" scattered throughout the source code to make variable scope clear. Hope it helps, sorry if it confuses!