Pages

Wednesday, December 21, 2016

Python Miscellany - Exploring A 1-Million-Row Task Table Export

Mass-mailing-system e-mail histories have put us over our Salesforce data storage limits.

Various admin tools say it's the task table hogging all the space. Indeed, when I exported the entire tasks table as a CSV via Apex Data Loader, it was a gigabyte large on my hard drive. So this was obviously the table that we needed to delete some records out of (archiving their CSV-export values on a fileserver we own), but it sure wasn't going to be an efficient file to play with in Excel.

This post brings you along on some of the miscellaneous play I did with the file. I'll be describing output, rather than showing it to you, as this is real company data and I'd rather not take the time to clean it up with fake data for public display.

First, I executed the following statements:

import pandas
df = pandas.read_csv('C:\\example\\tasks.csv', encoding='latin1', dtype='object')

Then I commented out those lines (put a "#" sign at the beginning of each row) because that 2nd line of code had taken a LONG time to run, even in Python. I knew that Spyder, my "IDE" that came with "WinPython," picks up where the last code left off when you hit the "Run" button again, so in further experimentation, I was simply careful not to accidentally overwrite the value of "df" (that way, I could keep executing code against it).


The next thing I did was export the first 5 lines of DF to CSV, so they WOULD be openable in Excel, just in case I wanted to.

df.head().to_csv('C:\\example\\tasks-HEAD.csv', index=0, quoting=1)

I also checked the row count in my Task table, which is how I knew the CSV file had 1 million lines (I exported it yesterday and forgot the result from Apex Data Loader by today):

print(len(df.index))

Next, I interested myself in the subject headings, since I knew that Marketo & Pardot always started the Subject line of mass e-mails in distinctive ways. I just couldn't remember what those ways were.

For starters, I was curious how many distinct subject headings were to be found among the 50 million rows.

print(len(df['SUBJECT'].unique()))

7,000, it turns out.

Next, I wanted to skim through those 7,000 values and see if I could remember what the Marketo-related ones were called.

Just like filtering an Excel spreadsheet and clicking the down-arrow-icon that shows you all the unique values in alphabetical order gets a little useless when you have too many long, similar values, a Python output console isn't ideal for reading lots of data at once.

Therefore, I exported the data into a 7,000-line, 1-column CSV file (which would open easily & read well in a text editor like Notepad).

pandas.DataFrame(df['SUBJECT'].sort_values().unique()).to_csv('C:\\example\\task-uniquesubjects.csv', index=0, header=False)

Aha: Pardot used "Pardot List Email:" and Marketo used "Was Sent Email:" at the beginning of all subject lines in records that I considered candidates for deletion if they were over 2 years old.

(Note: if an error message comes up that says, "AttributeError: 'Series' object has no attribute 'sort_values'," your version of the Pandas plugin to Python is too old for the ".sort_values()" command against Series-typed data. Replace that part of the code with ".sort(inplace=False)".


I have two different business units willing to let me delete old e-mail copies, but they have different timeframes. One is willing to let me delete anything of theirs over 2 years old. The other is willing to let me delete anything of theirs over 1 year old. So now I needed to comb through the raw data matching my two subject heading patterns, looking for anything that would indicate business unit.

print(df[(df['SUBJECT'].str.startswith('Was Sent Email:') | df['SUBJECT'].str.startswith('Pardot List Email:')) & (pandas.to_datetime(df['ACTIVITYDATE']) < (pandas.to_datetime('today')-pandas.DateOffset(years=2)))].head())

Unfortunately, nothing in any of the other columns gave me a hint which business unit had been responsible for sending the e-mail.

I knew that we were using Marketo long before Pardot ... if I just deleted Marketo outbound e-mails over 2 years old (the least common denominator between the two business units' decisions), how many rows would I be deleting?

print(len(df[(df['SUBJECT'].str.startswith('Was Sent Email:')) & (pandas.to_datetime(df['ACTIVITYDATE']) < (pandas.to_datetime('today')-pandas.DateOffset(years=2)))]))

Hmmm. 85,000. Out of 1 million. Well ... not a bad start (about 85MB out of 1GB)...but I need to find more.

Come to think of it, how many is "both Pardot and Marketo, more than a year old," period?

print(len(df[(df['SUBJECT'].str.startswith('Was Sent Email:') | df['SUBJECT'].str.startswith('Pardot List Email:')) & (pandas.to_datetime(df['ACTIVITYDATE']) < (pandas.to_datetime('today')-pandas.DateOffset(years=1)))]))

200,000. Out of 1 million. So now we're talking about getting rid of 200MB of data (the table is about 1GB), which is probably enough to buy us some time with Salesforce.


So our happy medium is somewhere in the middle, and we probably can't do much better than getting rid of 200MB of data given the business units' requests.

Maybe, though, it wouldn't be a bad idea to just start with the low-hanging fruit and get rid of that first 80MB of data or so (Marketo >2 years old).

Let's dive just a little deeper into what I think is "Marketo 2-Year-Old e-mail copies" to make absolutely sure that that's what they are.

First ... I when I printed the "head()" of Marketo e-mails, I noticed that there seemed to be some diversity in a custom field called "ACTIVITY_REPORT_DEPT__C." Let's see what we've got in there:

tempmksbj2yo = df[(df['SUBJECT'].str.startswith('Was Sent Email:')) & (pandas.to_datetime(df['ACTIVITYDATE']) < (pandas.to_datetime('today')-pandas.DateOffset(years=2)))]
print(tempmksbj2yo['ACTIVITY_REPORT_DEPT__C'].unique())

"ERP API", "Marketo Sync," & "Student Worker

How many do we have of each?

tempmksbj2yo = df[(df['SUBJECT'].str.startswith('Was Sent Email:')) & (pandas.to_datetime(df['ACTIVITYDATE']) < (pandas.to_datetime('today')-pandas.DateOffset(years=2)))]
print(tempmksbj2yo.groupby('ACTIVITY_REPORT_DEPT__C').size())

Good to know - there are just a few dozen rows that aren't "Marketo Sync." In future logic (as below), I'll further specify that a "Marketo" e-mail has this "Marketo Sync" trait.


Now let's take our records we want to delete from Salesforce and get them ready for doing so by exporting the raw data to CSV (for archiving on an on-premise file server) and by exporting the record IDs to a different CSV (for putting into "Apex Data Loader" as a DELETE operation). Let's also export to CSV what remains, to make it easier to pick up where we left off hunting for more deleteable records (this ".to_csv()" operation takes a LONG time to execute because it's a huge file!). And before doing all that (commented out the last line to keep it from running when first verifying this, since it takes so long to run), let's also verify whether the "remaining" file has the same size as the original file minus our "to delete" file.

mk2yologic = (df['SUBJECT'].str.startswith('Was Sent Email:')) & (pandas.to_datetime(df['ACTIVITYDATE']) < (pandas.to_datetime('today')-pandas.DateOffset(years=2))) & (df['ACTIVITY_REPORT_DEPT__C'] == 'Marketo Sync')

mk2yo = df[mk2yologic]
remaining = df[~mk2yologic]

print('df size is ' + str(len(df.index)))
print('mk2yo size is ' + str(len(mk2yo.index)))
print('df - mk2yo size is ' + str(len(df.index) - len(mk2yo.index)))
print('remaining size is ' + str(len(remaining.index)))

mk2yo.to_csv('C:\\example\\task-marketo-2yo-or-more-deleting.csv', index=0, quoting=1)
mk2yo['ID'].to_csv('C:\\example\\task-marketo-2yo-or-more-idstodelete.csv', index=0, header=True)
remaining.to_csv('C:\\example\\task-remaining-after-removing-marketo-2yo-or-more.csv', index=0, quoting=1)

Next, it's on out of Python-land for a while, and into Apex trigger-editing (there's one from an AppExchange plugin that it helps to disable when deleting this many Tasks) and using the "Apex Data Loader" to execute the DELETE operation.


P.S. That didn't help enough. You know where I found another problem? E-mails that went out to a business unit's entire mailing list. 1 task per recipient per e-mail. Here's how I came up with a list of the (10) Subject-ActivityDate combinations that had the largest number of records:

rmn = df[~((df['SUBJECT'].str.startswith('Was Sent Email:')) & (pandas.to_datetime(df['ACTIVITYDATE']) < (pandas.to_datetime('today')-pandas.DateOffset(years=1))) & (df['ACTIVITY_REPORT_DEPT__C'] == 'Marketo Sync'))]
print(rmn.groupby(['SUBJECT','ACTIVITYDATE']).size().order(ascending=True).reset_index(name='count').query('count>10000'))

Those 10 mailings represent 250,000 of our 1 million "Task" records, as verified here:

rmn = df[~((df['SUBJECT'].str.startswith('Was Sent Email:')) & (pandas.to_datetime(df['ACTIVITYDATE']) < (pandas.to_datetime('today')-pandas.DateOffset(years=1))) & (df['ACTIVITY_REPORT_DEPT__C'] == 'Marketo Sync'))]
print(rmn.groupby(['SUBJECT','ACTIVITYDATE']).size().order(ascending=True).reset_index(name='count').query('count>10000')['count'].sum())

Time to talk with some business users about whether this data really needs to live in Salesforce, even though it's recent...

Thursday, December 8, 2016

Python for Salesforce Administrators - Filtering Rows By Maximum Date Per Group

I had a special request to show off Python for filtering a CSV file to only leave behind the "most recent activity" for a given "Account" record. Happy to oblige!

However, I can't explain as well as I've been doing in other examples exactly what's going on here. It's a bit of black magic to me, but it works. Follow along.


Preparing A CSV

Now I'm going to open Notepad and make myself a CSV file (when I do File->Save As, change the "Save as File Type" from .TXT to "All Files (*.*)") and save the file as "c:\tempexamples\sample3.csv".

Here's what that file contains (representing 5 columns worth of fields & 7 rows worth of data):

"Actv Id","Actv Subject","Acct Name","Contact Name","Actv Date"
"02848v","Sent email","Costco","James Brown","10/9/2014"
"vsd8923j","Phone call","Gallaudet University","Maria Hernandez","3/8/2016"
"3289vd09","Sent email","United States Congress","Tamika Page","7/9/2016"
"das90","Lunch appointment","United States Congress","Tamika Page","3/4/2015"
"vad0923","Sent email","Salesforce","Leslie Andrews","4/28/2013"
"dc89a","Phone call","Costco","Sheryl Larson","5/29/2016"
"adf8o32","Conference invitation","Salesforce","Leslie Andrews","2/7/2015"
"fa9s3","Breakfast appointment","Costco","James Brown","9/3/2014"
"938fk3","Phone call","United States Congress","Shirley Chisholm","7/9/2016"

If I run this code (note that I've got a special "read_csv" line that specifies to treat 'Actv Date' as a sortable date):
import pandas
df = pandas.read_csv('C:\\tempexamples\\sample3.csv', dtype=object, parse_dates=['Actv Date'])
print(df)
The output looks like this:
    Actv Id           Actv Subject               Acct Name      Contact Name  Actv Date
0    02848v             Sent email                  Costco       James Brown 2014-10-09
1  vsd8923j             Phone call    Gallaudet University   Maria Hernandez 2016-03-08
2  3289vd09             Sent email  United States Congress       Tamika Page 2016-07-09
3     das90      Lunch appointment  United States Congress       Tamika Page 2015-03-04
4   vad0923             Sent email              Salesforce    Leslie Andrews 2013-04-28
5     dc89a             Phone call                  Costco     Sheryl Larson 2016-05-29
6   adf8o32  Conference invitation              Salesforce    Leslie Andrews 2015-02-07
7     fa9s3  Breakfast appointment                  Costco       James Brown 2014-09-03
8    938fk3             Phone call  United States Congress  Shirley Chisholm 2016-07-09

Grouping and Filtering

Firstly, I'm going to show you the final code.

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample3.csv', dtype=object, parse_dates=['Actv Date'])
groupingByAcctName = df.groupby('Acct Name')
groupedDataFrame = groupingByAcctName.apply(lambda x: x[x['Actv Date'] == x['Actv Date'].max()])
outputdf = groupedDataFrame.reset_index(drop=True)
print(outputdf)

The output looks like this:

    Actv Id           Actv Subject               Acct Name      Contact Name  Actv Date
0     dc89a             Phone call                  Costco     Sheryl Larson 2016-05-29
1  vsd8923j             Phone call    Gallaudet University   Maria Hernandez 2016-03-08
2   adf8o32  Conference invitation              Salesforce    Leslie Andrews 2015-02-07
3  3289vd09             Sent email  United States Congress       Tamika Page 2016-07-09
4    938fk3             Phone call  United States Congress  Shirley Chisholm 2016-07-09

(Exporting back to CSV instructions here, under the 5th program, "Writing To CSV.")

Note that "United States Congress" still has 2 records, because two things happened on the same date. This particular script I gave you doesn't "toss a coin" and break the tie between e-mailing Tamika and calling Shirley. It just leaves both "most recent" records in place.
(See the end of this post for tie-breaking.)

I can spot-check that it works because when line 4 ends with "max()" (as in most-recent activity), the result for Costco is Sheryl Larson's activity in May 2016. If I change it to "min()" (as in least-recent activity), the result for Costco is James Brown's activity in September 2014.

Variation - Multi-Column Group

If I change ".groupby('Acct Name')" at the end of line 3 in the previous example to ".groupby(['Acct Name', 'Contact Name'])" and leave line 4 as "max," I get the following output (note 2 Costco rows - one for James, one for Sheryl, but with James's October 2014 activity as his most recent):

    Actv Id           Actv Subject               Acct Name      Contact Name  Actv Date
0    02848v             Sent email                  Costco       James Brown 2014-10-09
1     dc89a             Phone call                  Costco     Sheryl Larson 2016-05-29
2  vsd8923j             Phone call    Gallaudet University   Maria Hernandez 2016-03-08
3   adf8o32  Conference invitation              Salesforce    Leslie Andrews 2015-02-07
4    938fk3             Phone call  United States Congress  Shirley Chisholm 2016-07-09
5  3289vd09             Sent email  United States Congress       Tamika Page 2016-07-09

What's Really Going On?

Line 3, "df.groupby('Acct Name')," produces data that we can save with a nickname ("groupingByAcctName"), but that doesn't really print well (try it - try adding a "print(groupingByAcctName)" line to the script). However, trying to print it does tell us that it's a "type" of data called a "DataFrameGroupBy." (Just like there are "types" of data called "DataFrames" & "Series.")

Line 4 applies a "lambda" function to our cryptic data that we stored under the nickname "groupingByAcctName."

The output, which we give a nickname of "groupedDataFrame," is just a "DataFrame." (I can tell by running the line "print(type(groupedDataFrame))".)

We've seen the ".apply(lambda ...)" operation before, in "Fancier Row-Filtering and Data-Editing."

It turns out that it can also be tacked onto the end of "DataFrameGroupBy"-typed data, not just "Series"-typed data.

When ".apply(lambda ...)" is tacked onto Series-type data, it does whatever's inside the "lambda ..." to each item in the Series and spits out a new Series of the same length with contents that have been transformed accordingly.

A DataFrameGroupBy can be thought of kind of like a "Series that holds DataFrames" (each mini-DataFrame being a row of the original DataFrame that falls into the "group").
When ".apply(lambda ...)" is tacked onto DataFrameGroupBy-type data, it does whatever's inside the "lambda ..." to each DataFrame in the DataFrameGroupBy and spits out a new DataFrame that jams the results back together (although, as we'll see later, the labeling is a bit complex).

In this case, there were 4 unique 'Acct Name' values. What we did to each mini-DataFrame inside was say, "Just show the rows of this mini-DataFrame that have a maximal 'Actv Date.' One 'Acct Name' had a tie on 'Actv Date,' so we actually ended up with 5 rows in the output DataFrame (which we then stored under the nickname "groupedDataFrame").

Our "CodeThatMakesAValueGoHere" was "x[x['Actv Date'] == x['Actv Date'].max()]"

What's going here is that "x" is a DataFrame.
"lambda x : ..." means, "Use 'x' as a placeholder for each of whatever is in the thing we're running this '.apply(lambda ...)' on."
If we were to use our original straight-from-the-CSV DataFrame and say "print(df[df['Actv Date'] == df['Actv Date'].max()])" we would get 2 rows: the tie for "most recent activity in the whole original DataFrame" (the July 10, 2016 activities).
However, now we're saying "Do the same thing, but within each of the mini-DataFrames that have already been broken up into unique 'Actv Date' clumps."

This code:

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample3.csv', dtype=object, parse_dates=['Actv Date'])
groupingByAcctName = df.groupby('Acct Name')
groupedDataFrame = groupingByAcctName.apply(lambda x: x[x['Actv Date'] == x['Actv Date'].max()])
print(groupedDataFrame)

Produces this output:

                           Actv Id           Actv Subject               Acct Name      Contact Name  Actv Date
Acct Name                                                                                                     
Costco                 5     dc89a             Phone call                  Costco     Sheryl Larson 2016-05-29
Gallaudet University   1  vsd8923j             Phone call    Gallaudet University   Maria Hernandez 2016-03-08
Salesforce             6   adf8o32  Conference invitation              Salesforce    Leslie Andrews 2015-02-07
United States Congress 2  3289vd09             Sent email  United States Congress       Tamika Page 2016-07-09
                       8    938fk3             Phone call  United States Congress  Shirley Chisholm 2016-07-09

Here we have something with "rows" and "columns," with its rows "numbered" by Pandas - it is indeed a DataFrame, even though it's got some weird groupings and gaps. (Although note how the row-numbers have been preserved from the original DataFrame! And note how to the left of the numbers there are labels that seem to group the numbers ... I said that DataFrame row-numbering was "complicated," didn't I?)

In case you're curious, here's output from the same code, only with the "grouping by both account and contact name" variation mentioned earlier.

                                            Actv Id           Actv Subject               Acct Name      Contact Name  Actv Date
Acct Name              Contact Name                                                                                            
Costco                 James Brown      0    02848v             Sent email                  Costco       James Brown 2014-10-09
                       Sheryl Larson    5     dc89a             Phone call                  Costco     Sheryl Larson 2016-05-29
Gallaudet University   Maria Hernandez  1  vsd8923j             Phone call    Gallaudet University   Maria Hernandez 2016-03-08
Salesforce             Leslie Andrews   6   adf8o32  Conference invitation              Salesforce    Leslie Andrews 2015-02-07
United States Congress Shirley Chisholm 8    938fk3             Phone call  United States Congress  Shirley Chisholm 2016-07-09
                       Tamika Page      2  3289vd09             Sent email  United States Congress       Tamika Page 2016-07-09

Before we can export our DataFrame into something that looks like our original CSV file (which is pretty much everything to the right of the row-numbers), we need to strip off that far-left "Acct Name" and the row-numbers, plus strip out the extra line between the column headers to the right and their data.

That's where the ".reset_index(drop=True)" command comes into play. Tacked onto a DataFrame, it produces a new DataFrame with all the row-numbers reset to "0, 1, 2..." and no labels to the left, plus no whitespace above the data. To make the code easier to read, I saved this new DataFrame as "outputdf."

And that's the end of the dirty details.


Breaking Ties

Afterthought: if you want to break ties so you have exactly 1 output row per 'Acct Name', here's how you do it:

  1. Figure out if there's anything else that's a "secondary tiebreaker."
  2. Once you run out of "secondary tiebreakers," pick an arbitrary single record with ".head(n=1)" or ".tail(n=1)" (depending on whether you want the first or the last record).
    • Making a final tie-break of "Oh, just pick the first line" is a common pattern in data processing when your business requirements say it's vital not to allow any ties to come out of an aggregation.

We have to change our approach to designing our code a bit to accommodate these extra steps. Here's the pattern:

  1. Break our original DataFrame up into a "Series of DataFrames grouped by (whatever we're computing the Max/Min among)"
    (a.k.a. a "DataFrameGroupBy")
    The code for this is ".groupBy(...)"
  2. Run an ".apply(lambda ...)" operation on the result of the previous step. For each mini-DataFrame being processed by the ".apply(lambda ...)" operation:
    1. Sort the rows of the mini-DataFrame first by the "primary sorter column," then by the "secondary sorter column," and so on (".sort_values(...)"). This produces a new mini-DataFrame with all the same rows, but sorted as specified.
      (Note - "ascending" is the default sort for each column with ".sort_values(...)", but you can specify on a column-by-column basis. I've shown that in the example code, but you can leave the ", ascending = ..." code out if you simply want an "ascending" sort across the board.)
    2. Produce a new mini-DataFrame that contains just the first/last row of the mini-DataFrame that came out of the previous step (depending on whether your sorting operation put the min/max you're looking for at the beginning or the end of the mini-DataFrame). The code for this is ".head(n=1)" or ".tail(n=1)".
  3. Strip the extra labels off the resulting big DataFrame so that it's ready for export back to CSV (".reset_index(drop=True)")

Afterthought: if you want to break ties by LastModifiedDate, just do it all over again. I've created a new "sample4.csv" as an example:

"Actv Id","Actv Subject","Acct Name","Contact Name","Actv Date","Mod Date"
"02848v","Sent email","Costco","James Brown","10/9/2014","10/10/2016 10:00:00"
"vsd8923j","Phone call","Gallaudet University","Maria Hernandez","3/8/2016","4/13/2016 14:23:04"
"3289vd09","Sent email","United States Congress","Tamika Page","7/9/2016","7/10/2016 9:35:36"
"das90","Lunch appointment","United States Congress","Tamika Page","3/4/2015","3/5/2015 13:01:01"
"vad0923","Sent email","Salesforce","Leslie Andrews","4/28/2013","9/30/2015 10:04:58"
"dc89a","Phone call","Costco","Sheryl Larson","5/29/2016","6/1/2016 11:38:00"
"adf8o32","Conference invitation","Salesforce","Leslie Andrews","2/7/2015","2/8/2015 08:36:00"
"fa9s3","Breakfast appointment","Costco","James Brown","9/3/2014","9/4/2014 07:35:00"
"938fk3","Phone call","United States Congress","Shirley Chisholm","7/9/2016","7/10/2016 08:34:20"

Here's some code to do this:
(Click here for playable sample...do not type your own company's data into this link!)

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample4.csv', dtype=object, parse_dates=['Actv Date', 'Mod Date'])
groupingByAcctName = df.groupby('Acct Name')
groupedDataFrame = groupingByAcctName.apply(lambda x : x.sort_values(['Actv Date','Mod Date'], ascending=[True,True]).tail(n=1))
outputdf = groupedDataFrame .reset_index(drop=True)
print(outputdf)

Or a more concise version of the code, where I strung multiple lines together and didn't stop to give them nicknames (this form gets easier to read as your code gets longer):

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample4.csv', dtype=object, parse_dates=['Actv Date', 'Mod Date'])
print(df.groupby('Acct Name').apply(lambda x : x.sort_values(['Actv Date','Mod Date'], ascending=[True,True]).tail(n=1)).reset_index(drop=True))

And here's the output text:

    Actv Id           Actv Subject               Acct Name     Contact Name  Actv Date            Mod Date
0     dc89a             Phone call                  Costco    Sheryl Larson 2016-05-29 2016-06-01 11:38:00
1  vsd8923j             Phone call    Gallaudet University  Maria Hernandez 2016-03-08 2016-04-13 14:23:04
2   adf8o32  Conference invitation              Salesforce   Leslie Andrews 2015-02-07 2015-02-08 08:36:00
3  3289vd09             Sent email  United States Congress      Tamika Page 2016-07-09 2016-07-10 09:35:36

Again, I can spot-check it because the tie-break between the "United States Congress" is that Tamika's July 2016 activity record was modified about an hour after Shirley's.

  • If I change "ascending=[True,True]" to "ascending=[True,False]" (sort 'Actv Date' ascending but 'Mod Date' descending, then pick the last record in the group) the tie-break goes to Shirley's record instead.
  • Similarly, if I change "ascending=[True,True]" to "ascending=[False,True]" (sort 'Actv Date' descending but 'Mod Date' ascending, then pick the last record in the group), I'm saying I want the least recent 'Actv Date', but I still want ties broken by most recent 'Mod Date.'
  • And if I change "ascending=[True,True]" to "ascending=[False,False]" (sort both dates descending, then pick the last record in the group), I'm saying I truly want the least recent record in the group.
    (In this case, "[False,True]" & "[False,False]" produce the same results because I don't have any ties for "oldest activity in the group" on 'Actv Date' that need breaking by 'Mod Date'. Tamika's March 4, 2015 record is the oldest for Congress, etc.)

Table of Contents

Python for Salesforce Administrators - Installing Python 3 on Windows

As I mentioned here, an "IDE" is, basically, a text-editor with a "play" button to execute whatever code you've typed into the text editor.

My favorite suite of software for running "Python 3" on Windows is called WinPython.
(Read up here about Python 2 & Python 3.)

You can install and run WinPython without administrator rights.

The latest versions are always listed at the software's Github page, but Windows 7 & Windows 8 users whose systems don't have the "Microsoft Visual C++ Redistributable for Visual Studio 2015" update installed into Windows might not be able to actually run the very latest versions of WinPython after installing them. (And, unfortunately, you can't install this update to Windows without administrator rights.)

The most recent "64-bit" version of WinPython for "Python 3" that I've found that works without that update is WinPython 3.4.3.5, which you can download here.

WinPython doesn't really install anything on your computer. It just puts a bunch of files into whatever folder you tell it to put those files into. When you want to "uninstall" it, you simply delete the folder. (Similarly, you have to create your own Desktop and Start-Menu shortcuts by hand.) If you "install" a certain version of WinPython and find that you can't run certain programs due to an error saying that "api-ms-win-crt-runtime-l1-1-0.dll" is missing, or if those program never show any sign of running at all, delete your "installation" and start over with a different version of WinPython.


Here are screenshots of the steps I took as I installed WinPython 3.4.3.5, 64-bit, on a thumb drive.

Step 1: download the installer file to my hard drive

Step 2: run the installer file by double-clicking on it

Step 3: wait a few seconds for the installer to come up

Step 4: accept the licensing agreement

Step 5: specify where on your computer (or on a thumb drive) you want the folder containing "WinPython" files to live & click "Install"

Step 6: wait for the installer to put all of the files on your hard drive, then click "Next," then click "Finish"


Let's see what we've done so far.

Step 7: navigate to the folder where you installed WinPython and observe the two files "Spyder.exe" and "WinPython Command Prompt.exe"

  • "Spyder" is your IDE (the text-editor with a play button)
  • The "WinPython Command Prompt" is what you run if you need to type commands that begin with "pip" (which is command-line-interface software that can download and install Python "modules" you don't already have based on the module name - if you got a "full" version of WinPython, you probably won't need to use this)

You might want to manually create shortcuts to these two files - or at least to Spyder - in your Start Menu and/or on your Desktop.


All right ... let's run some code!

Step 8: run "Spyder.exe" by double-clicking on it (it takes a minute to load -- something went wrong if you don't see a greenish "splash screen" with a spiderweb on it within 20 seconds, though)

Step 9: type the following text into the text-editor in Python and click the "play" button (or hit "F5"), which means "run":

print('hello there')

Note: If you get a pop-up dialog with all sorts of options about executing code consoles or interpreters and such, click "cancel" on the pop-up dialog box and then click the play button again - for some reason. The pop-up box was giving you a chance to highly customize where you want to see any output from running your code. Clicking "cancel" and trying again seems to tell Spyder, "you figure out where to display the output of this code, don't bother me with all these options." (Unfortunately, I didn't catch this dialog box with a screenshot.)

Step 10: observe that the "console" in another corner of your screen" said "hello there" in it!

Note: Typically, when you are programming, the first thing you do with a new tool is try to get it to display some simple text on your screen. That's what we did in Steps 9 & 10. But it's important to remember that not every command in a programming language displays text on the screen when it runs. For example, in some of the Python examples on this blog, certain commands create CSV files - they don't display any text on the screen at all when they're done creating the CSV file. To know when such code is done, you could add a line to your script afterwards along the lines of "print('all done')". You may also pick up on subtle changes in the "console" in Spyder as code starts & stops running (for example, you get a new "[]" prompt - which you can ignore - when code has finished running).

Nevertheless, no matter what your code does, the area of Spyder where you typed in Step 9 is the area where you will want to copy/paste all of the example code from this blog before tweaking it to match your own needs and clicking "Run" (the play-button / F5).

Now that you're up and running, time to learn how to use Python!


Web-Browser-Based Exercises

DO NOT DO NOT DO NOT use your own company's data with this trick.

You have to promise me. Pinky-swear.

You MUST use a properly-installed copy of Python as described above for data you actually want to work with, no matter how inconsequential you think it is.

BUT ... if you'd like to just play along with the "sample1.csv"-type data in this blog, I've found a way you can do so right now, using nothing but your web browser.

I painstakingly made all of that data up, word by word. That's why it's okay to just throw into some random service hosted online.

Real CSV files from your company are NOT okay to throw into some random service hosted online. Even if it doesn't seem like important data. Just do it on your local computer and be safe, not sorry.

Anyway ... for imitating exercises in this blog, you can go to https://repl.it/languages/python3

(Update, 2018: repl.it sometimes seems to run slowly, lately, when using the "Pandas" functionality of Python. At the moment, https://codebunk.com/b/ still seems to work, but that link is just meant for trying their paid product, so please don't go too crazy.)

If an exercise in this blog claims it works with a file "C:\tempexamples\sample000.csv" whose contents look like this:

"Id","First","Last","Email","Company"
"5829","Jimmy","Buffet","jb@example.com","RCA"
"2894","Shirley","Chisholm","sc@example.com","United States Congress"

And then if the Python code looks like this:

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample000.csv')
print('---Here is a sampling of the first lines---')
print(df.head())

Then your code at Repl.it will be:

import sys
if sys.version_info[0] < 3: from StringIO import StringIO
else: from io import StringIO

sample = StringIO("""
"Id","First","Last","Email","Company"
"5829","Jimmy","Buffet","jb@example.com","RCA"
"2894","Shirley","Chisholm","sc@example.com","United States Congress"
""")

import pandas
df = pandas.read_csv(sample)
print('---Here is a sampling of the first lines---')
print(df.head())

Note that you have to put in 3 lines of importing extra "modules," then you have to do some weird, very precise code surrounding a copy/paste from the contents of the CSV file, and finally, in the "df = pandas.read_csv()" line, instead of saying "'C:\\tempexamples\\sample000.csv'", you refer to the nickname you gave your fake CSV file (in this case, "sample") - without any single-quotes around it.

Drive carefully. Remember your pinky-swear.


Table of Contents

Thursday, December 1, 2016

Salesforce Application Deployment and Versioning - a few random thoughts

At work, we're starting to talk about overhauling our software application build, deployment, and versioning processes and technologies.

One of the tensions in coming up with organized versioning has to do with the fact that these days, it's considered good practice to push software application code onto end-user-facing servers via your version control system. An alternative approach is to push software application code onto end-user-facing servers some other way, and then separately throw a copy of "the latest server code" into your version control system. In my opinion, it can be pretty tough to decide which of these philosophies to follow when trying to use version control software with Salesforce.


Background: Each copy of Salesforce that a company buys comes with at least one ("production"), and possibly several more ("sandbox"), independent databases. (They're vaguely aware of each other's existence, but live on different servers, don't have any configuration-file interdependencies, etc.) "Sandbox" databases can be destroyed and rebuilt -- empty of data -- in the image of the "production" database (application code included), at the click of a few buttons from the "production" database's web-based administrative UI. New "sandbox" databases can be created the same way.


The back-end configuration files from which each of those databases knows its own definition (schema, application code, user permissions) are in plain text (usually XML, but some Java-like source code files as well) and live on Salesforce's servers along with the data stored in the database.

For each database, there are 3 ways to modify those back-end configuration files:

  1. Log into the database's web-based administrative UI and point-and-click your way through changes (think of it like modifying an MS Access database).
    • In a "production" database, this option is only allowed for schema and user permission changes, not for application code changes.
    • In a "sandbox" database, the web-based administrative UI also involves a small IDE for modifying application code. Such code is often moved to a "production" database via option #2 below.
  2. Log into the web-based UIs of 2 databases related to the same "copy of Salesforce bought by a company" and specify that certain components of one database should be written to the other. (Often used by novices for migrating application code from a "sandbox" database to a "production" database. Known as using "change sets.")
  3. Upload a fresh copy of a configuration file to the database's SOAP HTTP API endpoint in a packaged "build" ZIP file.
    • (Downloading/uploading via the Eclipse "Force.com IDE" plugin or the "ANT Force.com Migration Tool" or Workbench or MavensMate or WelkinSuite are all just tools that take care of the "SOAP" part of option #3 for you.)

Option #3, which involves getting/sending copies of the configuration files to and from Salesforce's servers, is the only option that truly integrates with version control systems. That's because it's the only option that actually involves using copies of the configuration files sitting on a machine that isn't the Salesforce server itself. (Salesforce won't let you make a version control repository out of the database server itself the way some other code-hosting cloud services will.)

Because option #3 can involve version control systems, it plays nicely with the modern software application deployment philosophy of "push configuration files onto end-user-facing servers via your version control system." However, a key rule that everyone has to follow to make this philosophy work seems to be: "Do NOT modify the database's configuration files in any way other than via the version control system. Seriously. Don't. Even if the database will let you. Just don't."

Unfortunately, that's pretty much impossible to enforce with Salesforce. Salesforce advertises how easy it is to use option #1 & option #2. They encourage "clicks, not code" for schema and user permission changes (as well as for some "application" development). In the end, it's unlikely that the database configuration files actually out on Salesforce's servers truly match the copies of those configuration files sitting in your version control system if you're only moving copies in the direction of VCS repository -> database.


If you find that your company is mostly using option #1 and option #2 to modify code in your Salesforce databases, you can still use version control software. Essentially, you use the "get copies of database configuration files" functionality of the database's SOAP HTTP API to download copies of code onto a machine that does host a version control system repository. Do this as often as you want to "checkpoint" the current state of a Salesforce database. Example here.

I've also heard of companies that finagle some sort of hybrid solution: they do what I just suggested with a scheduled tool for everything but configuration files representing "application code."
People who write "application code" are expected to follow "push configuration files onto the end-user-facing server via your version control system" practices.
(From what I understand, they have to be pretty skilled with version control tool branching and merging, as well as efficient XML file editing, to handle the fact that lay-people might concurrently use option #1 & option #2 to modify schemas & user permissions that they also need to modify and were planning to modify via option #3.)



Personally, I think it's also hard to decide how many version control system repositories to spin up:

  • One repository for each database in a "copy of Salesforce a company bought"
  • One big repository (perhaps with 1 sub-folder per database) for each "copy of Salesforce a company bought"

Novices to Salesforce and Git using the Eclipse "Force.com IDE" plugin to facilitate downloading copies of their databases' configuration files could easily find themselves in the "1 repository per database" approach because version control systems often use a computer's file-system folder as the boundary of a "repository," and default setup with the Eclipse plugin seems to give you one file-system folder per database (not clumped together into any sort of overarching folder representing the "copy of Salesforce a company bought").

  • That might actually be rather nice, because then your check-in comments about changes you've made in the various databases are naturally segregated from each other. It might also be easier to use graphical user interface tools to search a repository for a file by its name without seeing a bunch of clutter from other databases if the repository only represents other databases.
  • However, you're leaving your colleagues with more repositories to wade through asking themselves, "What's this one for?"

I don't completely understand it yet, but my understanding is that people in "push configuration files onto the end-user-facing server via your version control system" workplaces prefer the "one big repository for each 'copy of Salesforce a company bought'" repository structure.

Wednesday, November 2, 2016

Visualforce vs. Lightning Components Side-By-Side 1

In my last post, I proposed converting an existing Visualforce page to use Lightning Components. I decided instead to start with a few simple "build the same thing in both environments and compare them" exercises.

Visualforce & Lightning Components are different coding environments for writing web applications, hosted in the user interface of a Salesforce database, that interact with data stored in that database.

Here is the first such exercise.


Apex Controller "BlogExample1ContactController.apxc" - used by both files:

public class BlogExample1ContactController {
    @AuraEnabled
    public static List<Contact> getCsFromServer() {
        // Do various proper security stuff and then...
        return [SELECT LastName, FirstName, Id, Email FROM Contact WHERE LastName LIKE 'LastXYZZY-%']; // <<-- DO NOT actually just do this.  Need proper security stuff.
    }
}

This server-side code, when executed, returns a list of "Contact"-typed records from the database (filtered to expose only records where LastName starts with "LastXYZZY-," and exposing only data from the LastName, FirstName, Id, & Email columns of those records).

 


The VisualForce page only takes 1 more file of code to get data displaying at https://MyCustomDomainName.na##.visual.force.com/apex/BlogExample1VFPage.

Visualforce Page "BlogExample1VFPage.vfp" - used in VisualForce:

<apex:page docType="html-5.0" controller="BlogExample1ContactController">
    <p>Hi</p>
    <apex:repeat value="{!CsFromServer}" var="cntct">
        <hr/>
        <p><apex:outputField value="{!cntct.LastName}"/></p>
        <p><apex:outputField value="{!cntct.FirstName}"/></p>
        <hr/>
    </apex:repeat>
    <p>Bye</p>
</apex:page>

 

Here's what visiting the VisualForce page looks like:


The Lightning Components app requires 4 more files of code (1 "app" + 1 "component" and 2 more JavaScript files as its "controller" & "helper" files) to get data displaying at https://MyCustomDomainName-dev-ed.lightning.force.com/c/BlogExample1LCApp.app.

Lightning Components App "BlogExample1LCApp.app":

<aura:application >
    <c:BlogExample1LCBasicComponent />
</aura:application>

This app's only job is to exist (lines 1 & 3) - it gets its own URL so I can actually browse to it and see it.

Oh, and it has to say, "go see the Component called 'BlogExample1LCBasicComponent'" (line 2).

"c" is a built-in variable that means "the code-space that is 'all Lightning Components.'"

 

Lightning Components App "BlogExample1LCBasicComponent.cmp":

<aura:component controller="BlogExample1ContactController">
    <aura:handler name="init" value="{!this}" action="{!c.initializationCode}" />
    <aura:attribute name="cs" type="Contact[]"/>
    <p>Hi</p>
    <aura:iteration items="{!v.cs}" var="con">
        <hr/>
        <p><ui:outputText value="{!con.LastName}"/></p>
        <p><ui:outputText value="{!con.FirstName}"/></p>
        <hr/>
    </aura:iteration>
    <p>Bye</p>
</aura:component>

My Lightning Components example is so simple that it's just got 1 component.

If we'd wanted fancier HTML representing each record of the database that we're displaying, we could have moved the contents of that "aura:iteration" tagset to their own component and put a reference to that component inside the tagset instead. (Although we'd have to make sure, when defining it, to give it an "attribute" that can hold a "Contact"-typed record and, when referencing the component, to "pass" that attribute the current value of the iteration's "con" variable.)

Anyway, this component's definition indicates that it's "controlled" by our Apex controller, much as you see in our Visualforce code.

Overall, this component looks a lot like our Visualforce page. The main difference is that we have to explicitly tell it to execute JavaScript code ("initializationCode(...)"). (Note: "initializationCode(...)" itself also has to be told to actually talk to the server-side Apex controller and fetch data ... we'll see that later.)

If we don't do that, there'll be no data between "Hi" & "Bye," even if there's data in the database that matches the query in our Apex controller.

The Visualforce line '<apex:repeat value="{!CsFromServer}" var="cntct">' knows to go talk to the server to fill in data summoned by the code in the "value" tag. (It also has the server loop through that data and generate HTML/CSS.)

The Lightning Components line '<aura:iteration items="{!v.cs}" var="con">' doesn't. It just says, "if this component's 'cs' variable (which lives in a web-surfer's browser) has any data in it, have the browser loop through it and generate HTML/CSS."

Line 3 establishes that the component has a variable named "cs" and that its data type is a list of "Contacts."

Line 2 ('<aura:handler name="init" value="{!this}" ... />') actually executes JavaScript that will go & talk to the server and that is responsible for setting the value of "cs" (in this case, upon page load).

 

Lightning Components JavaScript file "BlogExample1LCBasicComponentController.js":

({
 initializationCode : function(component, event, helper) {
        helper.getCons(component);
 }
})

Every JavaScript function intended to be summoned directly from "component" markup should be in the associated "Controller" file and should be defined to expect 3 parameters: a component, an event, and a helper (in that order).

Every JavaScript function intended to be summoned from other JavaScript, rather than from the "component" markup itself, should be in the associated "Helper" file.

In this case, our "initializationCode" doesn't do any real work - it just executes a JavaScript function found in the "Helper" file and passes it the value of the "component" passed to it (which would be the component from which we summoned it).

 

Lightning Components App "BlogExample1LCBasicComponentHelper.js":

({
    getCons : function(component) {
        var action = component.get("c.getCsFromServer");
        action.setCallback(this, function(response) {
            var state = response.getState();
            if (component.isValid() && state === "SUCCESS") {
                component.set("v.cs", response.getReturnValue());
            }
        });
        $A.enqueueAction(action);
    }
})

This JavaScript function inspects the "component" it's been passed, notices that that "component" has an Apex "controller" class attached to it in its definition, and goes about talking to that code (calling its "getCsFromServer()" method).

If data actually comes back from that process, it assigns that data to the component's variable "cs."

("v," by the way, is included in the framework and is a variable that an instance of a component uses to refer to itself. It also appears in the component's code above.)

 

Here's what visiting the Lightning Components app looks like, when viewed on its own:

(Notice how "vanilla" it looks - what you code is what you get!)


Finally, I have a bunch of random ways of abbreviating the word "contact" or "contacts" scattered throughout the source code to make variable scope clear. Hope it helps, sorry if it confuses!

Sunday, October 23, 2016

Lightning Components vs. Visualforce

In my last post, I wrote that "there's a new, alternative style of coding available for building custom reporting/data-entry screens inside Salesforce ("Lightning Components" instead of "Visualforce"). Our web developers don't write either right now, but if they did, they'd be happy to know that the new option is more modern-web-development-ey than the old option.

I've been slowly reading the thick book about Lightning Components that I picked up at Dreamforce.

After getting through chapter 2 ("quick start"), I thought, "I should take an existing Visualforce page I've written and rewrite it in Lightning Components as a programming exercise.


Most Visualforce pages I've written are simply a way of displaying the results of an SOQL query to an end-user, so they weren't going to be very exciting.

The most "interactive" one I've written prompts the user for a date and plugs that into an SOQL query whose results are displayed to the end-user. Some of the fields in the display allow their values to be changed, and there are Save/Cancel buttons for all changes made so far on the page. This seemed like my best candidate for a conversion.


Unfortunately, even this isn't a great candidate for conversion, as far as I can tell. Here's why.

Step 1: Think of everything a user of your VF/LC app "does" to the page as a "browser event."
Here are all of mine:

  • User fills a date in the date-picker box
  • User clicks the "submit chosen date and refresh page with data from server, showing event attendees for that date" button**
  • User fills in an "attendance status" picklist with a given value
  • User clicks the "save changes to server and refresh page with latest data from server" button**
  • User clicks the "discard 'attendance status' picklist-value changes made in browser" button*

Step 2: For each "browser event," ask yourself two questions:

  1. Does it actually make any substantial changes happen, besides what the user expected?
    (If "user clicks a checkbox" is the "browser event", then "checkmark toggles inside checkbox" is NOT a "substantial change," but "text next to checkbox changes color" IS.)
  2. Does the "substantial change" depend upon the Salesforce.com server being contacted?
    ("...saves changes to server..." or "...gets data from server..." DO require the server to be contacted. "...Text next to checkbox changes color..." DOES NOT.)

In my case, the "user fills in" items don't do anything "substantial."

My "substantial changes" (marked with at least 1 asterisk & boldfaced) occur when the user clicks buttons.

Of those 3 "substantial changes," 2 are completely dependent upon the Salesforce.com server being contacted (marked with 2 asterisks).

Only the "discard" button's behavior could be done without contacting the server, if you had been saving "old values" in the browser's memory as the user made changes.
Even there, though, it's probably much simpler to just ask the server to provide clean data.


From what I can tell so far, Lightning Components makes "substantial changes that don't have to talk to the Salesforce.com server" easier to write.

However, "substantial changes have to talk to the Salesforce.com server" involve writing way more code in Lightning Components than they require in Visualforce.


Although it has a few "sections" to it, not all of which are always visible, overall, my page is merely a single "editable SOQL query" user-interface with a few extra controls.

More importantly, I want it to behave "synchronously" with the server.
That is, once a user has clicked a button, I don't want them messing around with the clickable parts of the page until the data-refresh from the server is complete. I want them to lose any work they try to do between button-click and page-refresh.

I have never written a Visualforce page that includes any "substantial" results of users' "browser events" (e.g. button-clicks) that are easily handled entirely inside their browser with JavaScript (e.g. "change the color of the text").

Therefore, from what I can tell, my conversion will inherently involve a lot more code in Lightning Components than it involves in Visualforce.
(And more memory usage in the user's browser.)
Please correct me if I'm wrong!


Visualforce has a very efficient one-line syntax for communicating with the Salesforce.com server and refreshing the Visualforce page in response.

Lightning Components, from what I can tell (please correct me if I'm wrong!), makes you:

  • Code a browser-event handler to fire LC-event-#1
  • Code a LC-event-#1 handler to talk to the server and to fire LC-event-#2 upon response
  • Register a listener for LC-event-#2
  • Code a LC-event-#2 handler to refresh the page with data returned from the server

I still plan to do the conversion as practice, but I wonder if there's a way to get the best of both worlds for my apps: Lightning-Components-looking responsive styling coded in a minimal amount of Visualforce.

Wednesday, October 12, 2016

Dreamforce 2016 Lessons Learned

Lucky me, I got to attend Dreamforce. I had colleagues going to plain-English "what you can do with Salesforce" talks, so I filled my days with "how to program Salesforce" lessons. Here's a summary of what I learned.

(Also, I plan to put out a "how to install WinPython on Windows" walk-through but haven't had time to capture screenshots of the version I found that works well without administrator rights. It'll happen, though!)

  1. I got to go to a "gripe to the people who maintain Salesforce" session and gripe about certain "relational databases have been doing this for decades--why doesn't Salesforce?" things (e.g. the fact that data-cleansing "trigger" code can't be written against certain data tables in Salesforce). I learned some interesting history about why such things are broken. (For the triggers, the fact that the tables having data in them had about a 3-year head start on triggers existing at all. Which means you have to rebuild the tables or something, and apparently that's expensive in Salesforce-employee-time for not-many-customers, lots-of-rows tables, so money to pay someone to do it has to be budgeted from the top at Salesforce.) Ultimately, as I suspected, the almighty dollar is an issue. Fixing such basic functionality doesn't easily compete with adding other features that are better at bringing in new multi-million-dollar contracts to Salesforce, so the problem gets picked at slowly as limited budgets allow. But it was fun to get to tell someone whose job it is to fix these issues how annoying these issues are to his face and get sympathetically nodded to!
  2. Salesforce is pushing a new "artificial intelligence" product they just acquired pretty hard, but it seems to me that it's for companies whose business models mean they can be "fuzzy" about how they handle customer data. (One of the brag-stories was something along the lines of, "We didn't like our 'unsubscribe rate,' so we used AI to predict who was likely to unsubscribe and to preemptively unsubscribe them!") That said, I didn't actually go to any sessions about Einstein - it just had a display in the way of my coding classes. A marketing colleague was way more interested (e.g. mining Facebook/LinkedIn data).
  3. There's a new, alternative style of coding available for building custom reporting/data-entry screens inside Salesforce ("Lightning Components" instead of "Visualforce"). Our web developers don't write either right now, but if they did, they'd be happy to know that the new option is more modern-web-development-ey than the old option.
  4. There's an option to turn on a new look & feel for the user interface in Salesforce ("Lightning Experience"). It's been pretty hideous (a bunch of icons instead of words), but it's getting better (they brought the words back). Not sure if it's worth the work. The new & old user-interfaces have different features + some overlap, so it's not just flipping a switch. I learned some ways to find out what's around & what isn't + to preview what our org would look like in Lightning Experience. Overall, I think our users tend to want to be able to see lots of data at a time, and font sizes and whitespace seem smaller in the "Classic" UI, so I think "Classic" is better for us.
  5. I learned some things I can do to make sure that my administrator rights for a web application I have hosted externally are secure (e.g. lock down my admin account with two-factor authentication).
  6. I thought I was writing one type of test ("unit tests") against code I was writing in Salesforce, but it turns out it's actually closer to another ("integration tests"). Probably OK because "integration tests" tend to be closer to answering the question, "does this do what the end-user asked for?" Not sure if writing true "unit tests" is actually necessary for what I'm writing, but good to know the difference and some new things to learn ("mocking") if I ever want to do so.
  7. I learned some code necessary to ensure that code I write in Salesforce respects normal user-access settings (it goes beyond "with sharing"). That said, most of what I've written purposely "plays God" & takes care of stuff a user can't. But perhaps some of it doesn't have to "play God," so this is a nice tool to have in my toolbelt. I learned it last year but went for a refresher this year and plan to start using it more.
  8. Database triggers (code that executes when people save data in Salesforce) can make "asynchronous" calls to external services that take a while to return results (e.g. "Hey Google, what's the driving time between these two addresses?"). However, you want to make sure you write such code only for database-data-change conditions that are likely not to happen too often (e.g. less than 20 times per 10 minutes, or less than a few hundred times per day), because only 50 such calls can stack up in Salesforce at a time, and the 51st+ just kind of fail, possibly irrecoverably).
  9. I learned that, with Apex, I can write my own REST API that interacts with my Salesforce org, rather than relying on the Salesforce REST API. This can potentially let me reduce the number of "API calls" I have to make when, say, interacting with my Salesforce org from Python running on my hard drive. Potentially. Doesn't yet pass the XKCD automation test for my work.

Tuesday, August 30, 2016

Python for Salesforce Administrators - Combining Multiple Tables

In our last (huge!) post, we did a number of "conditional" row filters and edits (e.g. changing the value of the "email" field if someone's last name began with a certain letter).

In this post, we'll put two CSV files together (by putting two "DataFrames" representing them into a new "DataFrame" with data from both sources). This can replace VLOOKUP(...) or INDEX(MATCH(...)...) operations in Excel.

We'll just do exercises with 2 CSV files at a time, but to combine 3 or more, just combine them "2 at a time" in sequence.
(Combine #1 & #2, then combine the output of that w/ #3, etc.)


(Prep work: First, make sure you've created a ".CSV" file like the one described in "First Scripts". I called mine "sample1.csv". As in that post, in each example, we'll overwrite the contents of a piece of Python code saved to our hard drive as "script.py" and run the entire script.)


More prep work: create a 2nd CSV

Create another ".CSV" file with the following contents - I am calling mine "sample2.csv" and storing it in the same folder where I stored "sample1.csv":

"PersonId","FirstName","LastName","Em","FavoriteFood"
"983mv","Shirley","Temple","st@example.com","Lollipops"
"9e84f","Andrea","Smith","as@example.com","Kale"
"k28fo","Donald","Duck","dd@example.com","Pancakes"
"x934","Marilyn","Monroe","mm@example.com","Carrots"
"8xi","Albert","Howard","ahotherem@example.com","Potatoes"
"02e","Vandana","Shiva","vs@example.com","Amaranth"

Note the following traits about this 2nd CSV file:

  • Some people are in both files, others are in just #1, & others are in just #2
  • None of the column names are spelled exactly the same way between the two CSV files. However, certain columns seem to express the same general idea (name, email)
    • That said, there isn't total consistency - the potato-loving "Albert Howard" in #2 has a different e-mail address from the "Albert Howard" in #1.
      • Q: Is this additional contact information about the same real-world person or are we looking at two different real-world "Albert Howard"s?
      • A: Tough call. This is for you to decide according to your knowledge of your data.
        What's important is to realize that this is a business problem you must solve before telling the computer how to decide that two rows from two different CSV files "belong together."
        (But you probably already knew that if you're an Excel CSV-joining whiz.)
  • Other columns seem to be unique to one CSV or another in terms of actual meaning ("Company" is just in #1, "FavoriteFood" is just in #2, and "Id"/"PersonId" don't seem to have anything to do with each other with respect to people who are in both CSV files, so they are probably each unique to their respective CSV files as well.
  • #1's "Id" & #2's "PersonId" columns don't really seem to have anything to do with each other, with respect to people who are in both CSV files

Merging Two DataFrames: Code Overview

Here's the basic way of writing a command for merging two DataFrames (the output from this command is yet another DataFrame):

ReferenceToDataFrameNumber1.merge(ReferenceToDataFrameNumber2, how='MergeTypeCode', on=['ColumnNameHere', 'AnotherColumnNameHere'])

First, pick one of our two DataFrames (doesn't matter which) and decide to call it the "left"-side DataFrame (its columns will come out on the left-hand side of the output DataFrame).

It could be the "nickname" where we've stored the contents of a DataFrame (such as "df1"), or it could be a chunk of code whose output is a DataFrame.

After that, we type ".merge()"

Inside the parentheses of the ".merge()" command, the first thing we type is a reference to our other DataFrame (which we will refer to as the "right"-side DataFrame).

Again, it could be the "nickname" where we've stored the contents of a DataFrame (such as "df2"), or it could be a chunk of code whose output is a DataFrame.

Then we type a comma and a space.

After that, still inside the parentheses of the ".merge()" command, we type "how='...'" where the "..." is one of the following 4 lower-cased words: "outer", "inner", "left", or "right".

  1. "Outer" means, "include every row from both DataFrames in the output, but put matching rows' data together on the same line."
  2. "Inner" means, "in the output, only include rows where the values from both DataFrames match."
  3. "Left" means, "include every row from the 'left-side' DataFrame in the output, and put data from the 'right-side' DataFrame on the same line if it has data that just so happens to match."
  4. "Right" means, "include every row from the 'right-side' DataFrame in the output, and put data from the 'left-side' DataFrame on the same line if it has data that just so happens to match."

Then we type a comma and a space.

Finally, we type "on=...".

  • The "..." can be of the format "'JustOneColumnNameHere'", "['JustOneColumnNameHere']", or, to say that values must match across MULTIPLE column names at the same time (e.g. a first AND last name match) "['ColumnNameHere','AnotherColumnNameHere','AndSoOn']".
  • However, the colum names specified in "on" must be spelled the same - all the way down to capitalization - between the two DataFrames.
    So if the spelling between "equivalent" columns is different, as in our data, we're going to have to rename columns from one DataFrame or the other before passing it to the ".merge()" operation. We'll go over the code for that in an example later.

Important Pandas Note: Unlike with Excel VLOOKUP() or INDEX(MATCH()), in any type of Pandas ".merge()" operation, if multiple lines from the DataFrames being merged match each other, you'll get multiple lines of output.

It's actually more accurate than the way Excel handles things, but it takes some getting used to.

So an "inner" merge on DataFrames from these two CSV files using "on='Id'":

Input #1:

"Id","PastEmployer"
"1","RCA"
"2","United States Congress"
"2","Fox"

Input #2:

"Id","BelovedFood"
"1","Snow Peas"
"2","Carrots"
"2","Beets"
"2","Cookies"

Would produce output that looks like this. Note the 6 lines for ID #2 - that's 2 multiplied by 3 (2 lines from input #1, matching 3 lines from input #2):

"Id","PastEmployer","BelovedFood"
"1","RCA","Snow Peas"
"2","United States Congress","Carrots"
"2","United States Congress","Beets"
"2","United States Congress","Cookies"
"2","Fox","Carrots"
"2","Fox","Beets"
"2","Fox","Cookies"

Example With "Outer"

Code:

import pandas
df1 = pandas.read_csv('C:\\tempexamples\\sample1.csv', dtype=object)
df2 = pandas.read_csv('C:\\tempexamples\\sample2.csv', dtype=object)
print('---Contents of DataFrame "df1":---')
print(df1)
print('---Contents of DataFrame "df2":---')
print(df2)
newdf = df1.merge(df2.rename(columns = {'LastName':'Last', 'FirstName':'First', 'Em':'Email'}), how='outer', on=['Last', 'First'])
print('---Contents of DataFrame "newdf":---')
print(newdf)

Output text:

---Contents of DataFrame "df1":---
      Id    First      Last           Email                      Company
0   5829    Jimmy    Buffet  jb@example.com                          RCA
1   2894  Shirley  Chisholm  sc@example.com       United States Congress
2    294  Marilyn    Monroe  mm@example.com                          Fox
3  30829    Cesar    Chavez  cc@example.com          United Farm Workers
4    827  Vandana     Shiva  vs@example.com                     Navdanya
5   9284   Andrea     Smith  as@example.com     University of California
6    724   Albert    Howard  ah@example.com  Imperial College of Science
---Contents of DataFrame "df2":---
  PersonId FirstName LastName                     Em FavoriteFood
0    983mv   Shirley   Temple         st@example.com    Lollipops
1    9e84f    Andrea    Smith         as@example.com         Kale
2    k28fo    Donald     Duck         dd@example.com     Pancakes
3     x934   Marilyn   Monroe         mm@example.com      Carrots
4      8xi    Albert   Howard  ahotherem@example.com     Potatoes
5      02e   Vandana    Shiva         vs@example.com     Amaranth
---Contents of DataFrame "newdf":---
      Id    First      Last         Email_x                      Company PersonId                Email_y FavoriteFood
0   5829    Jimmy    Buffet  jb@example.com                          RCA      NaN                    NaN          NaN
1   2894  Shirley  Chisholm  sc@example.com       United States Congress      NaN                    NaN          NaN
2    294  Marilyn    Monroe  mm@example.com                          Fox     x934         mm@example.com      Carrots
3  30829    Cesar    Chavez  cc@example.com          United Farm Workers      NaN                    NaN          NaN
4    827  Vandana     Shiva  vs@example.com                     Navdanya      02e         vs@example.com     Amaranth
5   9284   Andrea     Smith  as@example.com     University of California    9e84f         as@example.com         Kale
6    724   Albert    Howard  ah@example.com  Imperial College of Science      8xi  ahotherem@example.com     Potatoes
7    NaN  Shirley    Temple             NaN                          NaN    983mv         st@example.com    Lollipops
8    NaN   Donald      Duck             NaN                          NaN    k28fo         dd@example.com     Pancakes

"sample1.csv" has 7 rows and "sample2.csv" has 6 rows. In this output, there are 9 rows because 4 rows from each CSV "match" a row from the other CSV exactly once (same first+last name) and can share a line in the output data, but 5 rows are unique to one CSV or the other, for a total of 9 output rows.

I printed the contents of DataFrames I nicknamed "df1" & "df2" in the output, but the area to focus on is the output of the DataFrame we saved by the nickname "newdf" and the code on line 8 that produced it.

There's a lot to go over here, so let's dive in.

First - remember that note from above saying that the column names needed to match? Instead of just passing in the nickname "df2" as a reference to a "right-side" DataFrame in the beginning of the ".merge()" parentheses, I passed in a code that generates a new DataFrame.

"df2.rename(columns = {'LastName':'Last', 'FirstName':'First', 'Em':'Email'})" produces a dataframe almost exactly like the one stored under the nickname "df2," only its "LastName," "FirstName," & "Em" columns have been renamed to match the naming of columns from the DataFrame stored under the nickname "df1."

That's the main tricky thing I did!

There's a lot of code to read on line 8, but if you think of that entire chunk before the first comma (",") as one big reference to a DataFrame to merge with "df1," you'll see that it still follows the normal pattern that the ".merge()" command requires us to follow.

I set the "how" parameter in line 8 to "'outer'", so our output file includes data from both spreadsheets, with blanks ("NaN") filling in the gaps where we don't know data about someone (because they weren't in the spreadsheet containing that data).

Finally, although I renamed 3 columns as if I were going to merge on them all, in the end, I decided that I just wanted to "match" on First Name & Last Name. I decided I was okay with the e-mail addresses being different.

Note that the "on=" columns got combined (the output only has 1 "First" column and 1 "Last" column), but the other columns that coincidentally had the same name in both inputs (that is, "Email") did not. To help you tell them apart, Pandas renamed them "Email_x" (from the left side) and "Email_y" (from the right side) in the output.


Examples With "Inner", "Left," & "Right"

Let's change just one thing in our code (besides not printing out df1 & df2): let's change "outer" to "inner."

Code:

import pandas
df1 = pandas.read_csv('C:\\tempexamples\\sample1.csv', dtype=object)
df2 = pandas.read_csv('C:\\tempexamples\\sample2.csv', dtype=object)
newdf = df1.merge(df2.rename(columns = {'LastName':'Last', 'FirstName':'First', 'Em':'Email'}), how='inner', on=['Last', 'First'])
print('---Contents of DataFrame "newdf":---')
print(newdf)

Output text:

---Contents of DataFrame "newdf":---
     Id    First    Last         Email_x                      Company PersonId                Email_y FavoriteFood
0   294  Marilyn  Monroe  mm@example.com                          Fox     x934         mm@example.com      Carrots
1   827  Vandana   Shiva  vs@example.com                     Navdanya      02e         vs@example.com     Amaranth
2  9284   Andrea   Smith  as@example.com     University of California    9e84f         as@example.com         Kale
3   724   Albert  Howard  ah@example.com  Imperial College of Science      8xi  ahotherem@example.com     Potatoes

Of our 7 rows from "sample1.csv" and our 6 rows from "sample2.csv," only 4 rows from each share both a first and last name with a row in the other CSV.

Since this is an "inner" merge, we've said we only want the output to contain rows where the data has something in common.

---

Now let's change "inner" to "left."

Code:

import pandas
df1 = pandas.read_csv('C:\\tempexamples\\sample1.csv', dtype=object)
df2 = pandas.read_csv('C:\\tempexamples\\sample2.csv', dtype=object)
newdf = df1.merge(df2.rename(columns = {'LastName':'Last', 'FirstName':'First', 'Em':'Email'}), how='left', on=['Last', 'First'])
print('---Contents of DataFrame "newdf":---')
print(newdf)

Output text:

---Contents of DataFrame "newdf":---
      Id    First      Last         Email_x                      Company PersonId                Email_y FavoriteFood
0   5829    Jimmy    Buffet  jb@example.com                          RCA      NaN                    NaN          NaN
1   2894  Shirley  Chisholm  sc@example.com       United States Congress      NaN                    NaN          NaN
2    294  Marilyn    Monroe  mm@example.com                          Fox     x934         mm@example.com      Carrots
3  30829    Cesar    Chavez  cc@example.com          United Farm Workers      NaN                    NaN          NaN
4    827  Vandana     Shiva  vs@example.com                     Navdanya      02e         vs@example.com     Amaranth
5   9284   Andrea     Smith  as@example.com     University of California    9e84f         as@example.com         Kale
6    724   Albert    Howard  ah@example.com  Imperial College of Science      8xi  ahotherem@example.com     Potatoes

There are 7 rows because that's how many rows "sample1.csv" (which became DataFrame "df1") had. 4 of those rows have "extra" data at right from "sample2.csv".

---

Now let's change "left" to "right."

Code:

import pandas
df1 = pandas.read_csv('C:\\tempexamples\\sample1.csv', dtype=object)
df2 = pandas.read_csv('C:\\tempexamples\\sample2.csv', dtype=object)
newdf = df1.merge(df2.rename(columns = {'LastName':'Last', 'FirstName':'First', 'Em':'Email'}), how='right', on=['Last', 'First'])
print('---Contents of DataFrame "newdf":---')
print(newdf)

Output text:

---Contents of DataFrame "newdf":---
     Id    First    Last         Email_x                      Company PersonId                Email_y FavoriteFood
0   294  Marilyn  Monroe  mm@example.com                          Fox     x934         mm@example.com      Carrots
1   827  Vandana   Shiva  vs@example.com                     Navdanya      02e         vs@example.com     Amaranth
2  9284   Andrea   Smith  as@example.com     University of California    9e84f         as@example.com         Kale
3   724   Albert  Howard  ah@example.com  Imperial College of Science      8xi  ahotherem@example.com     Potatoes
4   NaN  Shirley  Temple             NaN                          NaN    983mv         st@example.com    Lollipops
5   NaN   Donald    Duck             NaN                          NaN    k28fo         dd@example.com     Pancakes

There are 6 rows because that's how many rows "sample2.csv" (which became DataFrame "df2" and then its "renamed" version) had. 4 of those rows have "extra" data at left from "sample1.csv".


Example: Merging On Different Columns Than Before

Let's change just one thing in our code - we're going to say that a first-last name match isn't good enough.

We're going to tell our code to only merge people if first name, last name, AND e-mail match.

(In other words, treat the "Albert Howard" records as two different people since their e-mail addresses don't match.)

This code will go through all 4 examples in one piece of code, so to save typing...

  • I gave that awful piece of code based on "df2" a nickname of "df2renamed"
  • I gave my list of column-names to "match" on a nickname of "joincols".
    (You can give code like "'zzz'" or "['yyy','zzz']" nicknames, too!)

Code:

import pandas
df1 = pandas.read_csv('C:\\tempexamples\\sample1.csv', dtype=object)
df2 = pandas.read_csv('C:\\tempexamples\\sample2.csv', dtype=object)
df2renamed = df2.rename(columns = {'LastName':'Last', 'FirstName':'First', 'Em':'Email'})
joincols = ['Last', 'First', 'Email']
newdfouter = df1.merge(df2renamed, how='outer', on=joincols)
newdfinner = df1.merge(df2renamed, how='inner', on=joincols)
newdfleft = df1.merge(df2renamed, how='left', on=joincols)
newdfright = df1.merge(df2renamed, how='right', on=joincols)
print('---Contents of DataFrame "newdfouter":---')
print(newdfouter)
print('---Contents of DataFrame "newdfinner":---')
print(newdfinner)
print('---Contents of DataFrame "newdfleft":---')
print(newdfleft)
print('---Contents of DataFrame "newdfright":---')
print(newdfright)

Output text:

---Contents of DataFrame "newdfouter":---
      Id    First      Last                  Email                      Company PersonId FavoriteFood
0   5829    Jimmy    Buffet         jb@example.com                          RCA      NaN          NaN
1   2894  Shirley  Chisholm         sc@example.com       United States Congress      NaN          NaN
2    294  Marilyn    Monroe         mm@example.com                          Fox     x934      Carrots
3  30829    Cesar    Chavez         cc@example.com          United Farm Workers      NaN          NaN
4    827  Vandana     Shiva         vs@example.com                     Navdanya      02e     Amaranth
5   9284   Andrea     Smith         as@example.com     University of California    9e84f         Kale
6    724   Albert    Howard         ah@example.com  Imperial College of Science      NaN          NaN
7    NaN  Shirley    Temple         st@example.com                          NaN    983mv    Lollipops
8    NaN   Donald      Duck         dd@example.com                          NaN    k28fo     Pancakes
9    NaN   Albert    Howard  ahotherem@example.com                          NaN      8xi     Potatoes
---Contents of DataFrame "newdfinner":---
     Id    First    Last           Email                   Company PersonId FavoriteFood
0   294  Marilyn  Monroe  mm@example.com                       Fox     x934      Carrots
1   827  Vandana   Shiva  vs@example.com                  Navdanya      02e     Amaranth
2  9284   Andrea   Smith  as@example.com  University of California    9e84f         Kale
---Contents of DataFrame "newdfleft":---
      Id    First      Last           Email                      Company PersonId FavoriteFood
0   5829    Jimmy    Buffet  jb@example.com                          RCA      NaN          NaN
1   2894  Shirley  Chisholm  sc@example.com       United States Congress      NaN          NaN
2    294  Marilyn    Monroe  mm@example.com                          Fox     x934      Carrots
3  30829    Cesar    Chavez  cc@example.com          United Farm Workers      NaN          NaN
4    827  Vandana     Shiva  vs@example.com                     Navdanya      02e     Amaranth
5   9284   Andrea     Smith  as@example.com     University of California    9e84f         Kale
6    724   Albert    Howard  ah@example.com  Imperial College of Science      NaN          NaN
---Contents of DataFrame "newdfright":---
     Id    First    Last                  Email                   Company PersonId FavoriteFood
0   294  Marilyn  Monroe         mm@example.com                       Fox     x934      Carrots
1   827  Vandana   Shiva         vs@example.com                  Navdanya      02e     Amaranth
2  9284   Andrea   Smith         as@example.com  University of California    9e84f         Kale
3   NaN  Shirley  Temple         st@example.com                       NaN    983mv    Lollipops
4   NaN   Donald    Duck         dd@example.com                       NaN    k28fo     Pancakes
5   NaN   Albert  Howard  ahotherem@example.com                       NaN      8xi     Potatoes

Note that there are now only 3 rows, not 4, that are considered to "match," because the two Albert Howard records have different e-mail addresses.

That means that there are now:

  • 10 rows in "outer"
    (3 "matches" and 7 "unique-to-one-or-the-other" rows, instead of 4 & 5)
  • 3 rows in "inner"
    (3 "matches")
  • The same number of rows as before in "left" and "right," but no data filled in from "the other file" on Albert Howard's row in either
    (because it wasn't considered "matching")

Style Notes

  1. I always like to save the results of ".merge()" commands to a new nickname. I tend to call it something like "joindf." So, "joindf = ..."
    I've found that the ".merge()" code takes up so much space on the screen, I really don't want to combine too much more code with it. I prefer to make a clean break at this point, which is where nicknaming comes in handy.
     
  2. You may have noticed all my ".read_csv()" commands added ", dtype=object" inside the parentheses after the filename. I was getting errors when I merged my DataFrames - I think Pandas was trying a little too hard to decide whether values it was importing from CSV were numbers, plain text, etc - so this little piece of code said, "Don't worry about it. Just import what you see as plain text."
    • Q: How did I guess my errors had to do with Pandas trying to figure out numbers vs. plain-text?
    • A: Because the errors seemed to have something to do with Pandas trying to do "absolute value" operations against my data.
      Whenever you have some data in your file that kind of looks numeric / date-like / etc, if you don't actually want to do math against it, a good way to fix errors it to just force your data to import as plain-text.
       
  3. I lied to you a little bit about having to rename columns before you merge two DataFrames. There is another way, and you might run into it online. But I don't like it. Nevertheless, I'll explain it.

Instead of "on=...", you have a "left_on=..." and "right_on=...", where each "..." is a single column-name or a list of column names (of the same size, in the "same order").

However, it produces ugly output, in my opinion. Your data from those columns ends up split across two different output columns, instead of being neatly under one same-named column.

Here's an example - I'll throw in some code we haven't covered in lines 5, 6, & 8 to re-order the columns and expose why I consider it "ugly." Note that in line 4, we just pass the ".merge()" operation a reference to "df2" itself (because in this case, we're not renaming its columns).

Take a look at how Jimmy Buffet's last name is in "Last," but Donald Duck's's last name is in "LastName":

Code:

import pandas
df1 = pandas.read_csv('C:\\tempexamples\\sample1.csv', dtype=object)
df2 = pandas.read_csv('C:\\tempexamples\\sample2.csv', dtype=object)
newdfugly = df1.merge(df2, how='outer', left_on=['Last','First'], right_on=['LastName','FirstName'])
def put_cols_at_left(referenceToDataFrame, listOfColumnNames):
    return referenceToDataFrame[listOfColumnNames+[c for c in referenceToDataFrame.columns if c not in listOfColumnNames]]
print('---Contents of DataFrame "newdfugly" (with "last" & "first" columns moved to the left):---')
print(put_cols_at_left(newdfugly, ['Last','LastName','First','FirstName']))

Output text:

---Contents of DataFrame "newdfugly" (with "last" & "first" columns moved to the left):---
       Last LastName    First FirstName     Id           Email                      Company PersonId                     Em FavoriteFood
0    Buffet      NaN    Jimmy       NaN   5829  jb@example.com                          RCA      NaN                    NaN          NaN
1  Chisholm      NaN  Shirley       NaN   2894  sc@example.com       United States Congress      NaN                    NaN          NaN
2    Monroe   Monroe  Marilyn   Marilyn    294  mm@example.com                          Fox     x934         mm@example.com      Carrots
3    Chavez      NaN    Cesar       NaN  30829  cc@example.com          United Farm Workers      NaN                    NaN          NaN
4     Shiva    Shiva  Vandana   Vandana    827  vs@example.com                     Navdanya      02e         vs@example.com     Amaranth
5     Smith    Smith   Andrea    Andrea   9284  as@example.com     University of California    9e84f         as@example.com         Kale
6    Howard   Howard   Albert    Albert    724  ah@example.com  Imperial College of Science      8xi  ahotherem@example.com     Potatoes
7       NaN   Temple      NaN   Shirley    NaN             NaN                          NaN    983mv         st@example.com    Lollipops
8       NaN     Duck      NaN    Donald    NaN             NaN                          NaN    k28fo         dd@example.com     Pancakes

It's up to you, though. I sometimes use the "left_on" & "right_on" style when I'm merely experimenting with joining data and don't need any "pretty" output yet because it's just a little less text to type.


That's a good pausing point!

Please comment: what would you like to learn next?

Table of Contents