Pages

Wednesday, December 21, 2016

Python Miscellany - Exploring A 1-Million-Row Task Table Export

Mass-mailing-system e-mail histories have put us over our Salesforce data storage limits.

Various admin tools say it's the task table hogging all the space. Indeed, when I exported the entire tasks table as a CSV via Apex Data Loader, it was a gigabyte large on my hard drive. So this was obviously the table that we needed to delete some records out of (archiving their CSV-export values on a fileserver we own), but it sure wasn't going to be an efficient file to play with in Excel.

This post brings you along on some of the miscellaneous play I did with the file. I'll be describing output, rather than showing it to you, as this is real company data and I'd rather not take the time to clean it up with fake data for public display.

First, I executed the following statements:

import pandas
df = pandas.read_csv('C:\\example\\tasks.csv', encoding='latin1', dtype='object')

Then I commented out those lines (put a "#" sign at the beginning of each row) because that 2nd line of code had taken a LONG time to run, even in Python. I knew that Spyder, my "IDE" that came with "WinPython," picks up where the last code left off when you hit the "Run" button again, so in further experimentation, I was simply careful not to accidentally overwrite the value of "df" (that way, I could keep executing code against it).


The next thing I did was export the first 5 lines of DF to CSV, so they WOULD be openable in Excel, just in case I wanted to.

df.head().to_csv('C:\\example\\tasks-HEAD.csv', index=0, quoting=1)

I also checked the row count in my Task table, which is how I knew the CSV file had 1 million lines (I exported it yesterday and forgot the result from Apex Data Loader by today):

print(len(df.index))

Next, I interested myself in the subject headings, since I knew that Marketo & Pardot always started the Subject line of mass e-mails in distinctive ways. I just couldn't remember what those ways were.

For starters, I was curious how many distinct subject headings were to be found among the 50 million rows.

print(len(df['SUBJECT'].unique()))

7,000, it turns out.

Next, I wanted to skim through those 7,000 values and see if I could remember what the Marketo-related ones were called.

Just like filtering an Excel spreadsheet and clicking the down-arrow-icon that shows you all the unique values in alphabetical order gets a little useless when you have too many long, similar values, a Python output console isn't ideal for reading lots of data at once.

Therefore, I exported the data into a 7,000-line, 1-column CSV file (which would open easily & read well in a text editor like Notepad).

pandas.DataFrame(df['SUBJECT'].sort_values().unique()).to_csv('C:\\example\\task-uniquesubjects.csv', index=0, header=False)

Aha: Pardot used "Pardot List Email:" and Marketo used "Was Sent Email:" at the beginning of all subject lines in records that I considered candidates for deletion if they were over 2 years old.

(Note: if an error message comes up that says, "AttributeError: 'Series' object has no attribute 'sort_values'," your version of the Pandas plugin to Python is too old for the ".sort_values()" command against Series-typed data. Replace that part of the code with ".sort(inplace=False)".


I have two different business units willing to let me delete old e-mail copies, but they have different timeframes. One is willing to let me delete anything of theirs over 2 years old. The other is willing to let me delete anything of theirs over 1 year old. So now I needed to comb through the raw data matching my two subject heading patterns, looking for anything that would indicate business unit.

print(df[(df['SUBJECT'].str.startswith('Was Sent Email:') | df['SUBJECT'].str.startswith('Pardot List Email:')) & (pandas.to_datetime(df['ACTIVITYDATE']) < (pandas.to_datetime('today')-pandas.DateOffset(years=2)))].head())

Unfortunately, nothing in any of the other columns gave me a hint which business unit had been responsible for sending the e-mail.

I knew that we were using Marketo long before Pardot ... if I just deleted Marketo outbound e-mails over 2 years old (the least common denominator between the two business units' decisions), how many rows would I be deleting?

print(len(df[(df['SUBJECT'].str.startswith('Was Sent Email:')) & (pandas.to_datetime(df['ACTIVITYDATE']) < (pandas.to_datetime('today')-pandas.DateOffset(years=2)))]))

Hmmm. 85,000. Out of 1 million. Well ... not a bad start (about 85MB out of 1GB)...but I need to find more.

Come to think of it, how many is "both Pardot and Marketo, more than a year old," period?

print(len(df[(df['SUBJECT'].str.startswith('Was Sent Email:') | df['SUBJECT'].str.startswith('Pardot List Email:')) & (pandas.to_datetime(df['ACTIVITYDATE']) < (pandas.to_datetime('today')-pandas.DateOffset(years=1)))]))

200,000. Out of 1 million. So now we're talking about getting rid of 200MB of data (the table is about 1GB), which is probably enough to buy us some time with Salesforce.


So our happy medium is somewhere in the middle, and we probably can't do much better than getting rid of 200MB of data given the business units' requests.

Maybe, though, it wouldn't be a bad idea to just start with the low-hanging fruit and get rid of that first 80MB of data or so (Marketo >2 years old).

Let's dive just a little deeper into what I think is "Marketo 2-Year-Old e-mail copies" to make absolutely sure that that's what they are.

First ... I when I printed the "head()" of Marketo e-mails, I noticed that there seemed to be some diversity in a custom field called "ACTIVITY_REPORT_DEPT__C." Let's see what we've got in there:

tempmksbj2yo = df[(df['SUBJECT'].str.startswith('Was Sent Email:')) & (pandas.to_datetime(df['ACTIVITYDATE']) < (pandas.to_datetime('today')-pandas.DateOffset(years=2)))]
print(tempmksbj2yo['ACTIVITY_REPORT_DEPT__C'].unique())

"ERP API", "Marketo Sync," & "Student Worker

How many do we have of each?

tempmksbj2yo = df[(df['SUBJECT'].str.startswith('Was Sent Email:')) & (pandas.to_datetime(df['ACTIVITYDATE']) < (pandas.to_datetime('today')-pandas.DateOffset(years=2)))]
print(tempmksbj2yo.groupby('ACTIVITY_REPORT_DEPT__C').size())

Good to know - there are just a few dozen rows that aren't "Marketo Sync." In future logic (as below), I'll further specify that a "Marketo" e-mail has this "Marketo Sync" trait.


Now let's take our records we want to delete from Salesforce and get them ready for doing so by exporting the raw data to CSV (for archiving on an on-premise file server) and by exporting the record IDs to a different CSV (for putting into "Apex Data Loader" as a DELETE operation). Let's also export to CSV what remains, to make it easier to pick up where we left off hunting for more deleteable records (this ".to_csv()" operation takes a LONG time to execute because it's a huge file!). And before doing all that (commented out the last line to keep it from running when first verifying this, since it takes so long to run), let's also verify whether the "remaining" file has the same size as the original file minus our "to delete" file.

mk2yologic = (df['SUBJECT'].str.startswith('Was Sent Email:')) & (pandas.to_datetime(df['ACTIVITYDATE']) < (pandas.to_datetime('today')-pandas.DateOffset(years=2))) & (df['ACTIVITY_REPORT_DEPT__C'] == 'Marketo Sync')

mk2yo = df[mk2yologic]
remaining = df[~mk2yologic]

print('df size is ' + str(len(df.index)))
print('mk2yo size is ' + str(len(mk2yo.index)))
print('df - mk2yo size is ' + str(len(df.index) - len(mk2yo.index)))
print('remaining size is ' + str(len(remaining.index)))

mk2yo.to_csv('C:\\example\\task-marketo-2yo-or-more-deleting.csv', index=0, quoting=1)
mk2yo['ID'].to_csv('C:\\example\\task-marketo-2yo-or-more-idstodelete.csv', index=0, header=True)
remaining.to_csv('C:\\example\\task-remaining-after-removing-marketo-2yo-or-more.csv', index=0, quoting=1)

Next, it's on out of Python-land for a while, and into Apex trigger-editing (there's one from an AppExchange plugin that it helps to disable when deleting this many Tasks) and using the "Apex Data Loader" to execute the DELETE operation.


P.S. That didn't help enough. You know where I found another problem? E-mails that went out to a business unit's entire mailing list. 1 task per recipient per e-mail. Here's how I came up with a list of the (10) Subject-ActivityDate combinations that had the largest number of records:

rmn = df[~((df['SUBJECT'].str.startswith('Was Sent Email:')) & (pandas.to_datetime(df['ACTIVITYDATE']) < (pandas.to_datetime('today')-pandas.DateOffset(years=1))) & (df['ACTIVITY_REPORT_DEPT__C'] == 'Marketo Sync'))]
print(rmn.groupby(['SUBJECT','ACTIVITYDATE']).size().order(ascending=True).reset_index(name='count').query('count>10000'))

Those 10 mailings represent 250,000 of our 1 million "Task" records, as verified here:

rmn = df[~((df['SUBJECT'].str.startswith('Was Sent Email:')) & (pandas.to_datetime(df['ACTIVITYDATE']) < (pandas.to_datetime('today')-pandas.DateOffset(years=1))) & (df['ACTIVITY_REPORT_DEPT__C'] == 'Marketo Sync'))]
print(rmn.groupby(['SUBJECT','ACTIVITYDATE']).size().order(ascending=True).reset_index(name='count').query('count>10000')['count'].sum())

Time to talk with some business users about whether this data really needs to live in Salesforce, even though it's recent...

Thursday, December 8, 2016

Python for Salesforce Administrators - Filtering Rows By Maximum Date Per Group

I had a special request to show off Python for filtering a CSV file to only leave behind the "most recent activity" for a given "Account" record. Happy to oblige!

However, I can't explain as well as I've been doing in other examples exactly what's going on here. It's a bit of black magic to me, but it works. Follow along.


Preparing A CSV

Now I'm going to open Notepad and make myself a CSV file (when I do File->Save As, change the "Save as File Type" from .TXT to "All Files (*.*)") and save the file as "c:\tempexamples\sample3.csv".

Here's what that file contains (representing 5 columns worth of fields & 7 rows worth of data):

"Actv Id","Actv Subject","Acct Name","Contact Name","Actv Date"
"02848v","Sent email","Costco","James Brown","10/9/2014"
"vsd8923j","Phone call","Gallaudet University","Maria Hernandez","3/8/2016"
"3289vd09","Sent email","United States Congress","Tamika Page","7/9/2016"
"das90","Lunch appointment","United States Congress","Tamika Page","3/4/2015"
"vad0923","Sent email","Salesforce","Leslie Andrews","4/28/2013"
"dc89a","Phone call","Costco","Sheryl Larson","5/29/2016"
"adf8o32","Conference invitation","Salesforce","Leslie Andrews","2/7/2015"
"fa9s3","Breakfast appointment","Costco","James Brown","9/3/2014"
"938fk3","Phone call","United States Congress","Shirley Chisholm","7/9/2016"

If I run this code (note that I've got a special "read_csv" line that specifies to treat 'Actv Date' as a sortable date):
import pandas
df = pandas.read_csv('C:\\tempexamples\\sample3.csv', dtype=object, parse_dates=['Actv Date'])
print(df)
The output looks like this:
    Actv Id           Actv Subject               Acct Name      Contact Name  Actv Date
0    02848v             Sent email                  Costco       James Brown 2014-10-09
1  vsd8923j             Phone call    Gallaudet University   Maria Hernandez 2016-03-08
2  3289vd09             Sent email  United States Congress       Tamika Page 2016-07-09
3     das90      Lunch appointment  United States Congress       Tamika Page 2015-03-04
4   vad0923             Sent email              Salesforce    Leslie Andrews 2013-04-28
5     dc89a             Phone call                  Costco     Sheryl Larson 2016-05-29
6   adf8o32  Conference invitation              Salesforce    Leslie Andrews 2015-02-07
7     fa9s3  Breakfast appointment                  Costco       James Brown 2014-09-03
8    938fk3             Phone call  United States Congress  Shirley Chisholm 2016-07-09

Grouping and Filtering

Firstly, I'm going to show you the final code.

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample3.csv', dtype=object, parse_dates=['Actv Date'])
groupingByAcctName = df.groupby('Acct Name')
groupedDataFrame = groupingByAcctName.apply(lambda x: x[x['Actv Date'] == x['Actv Date'].max()])
outputdf = groupedDataFrame.reset_index(drop=True)
print(outputdf)

The output looks like this:

    Actv Id           Actv Subject               Acct Name      Contact Name  Actv Date
0     dc89a             Phone call                  Costco     Sheryl Larson 2016-05-29
1  vsd8923j             Phone call    Gallaudet University   Maria Hernandez 2016-03-08
2   adf8o32  Conference invitation              Salesforce    Leslie Andrews 2015-02-07
3  3289vd09             Sent email  United States Congress       Tamika Page 2016-07-09
4    938fk3             Phone call  United States Congress  Shirley Chisholm 2016-07-09

(Exporting back to CSV instructions here, under the 5th program, "Writing To CSV.")

Note that "United States Congress" still has 2 records, because two things happened on the same date. This particular script I gave you doesn't "toss a coin" and break the tie between e-mailing Tamika and calling Shirley. It just leaves both "most recent" records in place.
(See the end of this post for tie-breaking.)

I can spot-check that it works because when line 4 ends with "max()" (as in most-recent activity), the result for Costco is Sheryl Larson's activity in May 2016. If I change it to "min()" (as in least-recent activity), the result for Costco is James Brown's activity in September 2014.

Variation - Multi-Column Group

If I change ".groupby('Acct Name')" at the end of line 3 in the previous example to ".groupby(['Acct Name', 'Contact Name'])" and leave line 4 as "max," I get the following output (note 2 Costco rows - one for James, one for Sheryl, but with James's October 2014 activity as his most recent):

    Actv Id           Actv Subject               Acct Name      Contact Name  Actv Date
0    02848v             Sent email                  Costco       James Brown 2014-10-09
1     dc89a             Phone call                  Costco     Sheryl Larson 2016-05-29
2  vsd8923j             Phone call    Gallaudet University   Maria Hernandez 2016-03-08
3   adf8o32  Conference invitation              Salesforce    Leslie Andrews 2015-02-07
4    938fk3             Phone call  United States Congress  Shirley Chisholm 2016-07-09
5  3289vd09             Sent email  United States Congress       Tamika Page 2016-07-09

What's Really Going On?

Line 3, "df.groupby('Acct Name')," produces data that we can save with a nickname ("groupingByAcctName"), but that doesn't really print well (try it - try adding a "print(groupingByAcctName)" line to the script). However, trying to print it does tell us that it's a "type" of data called a "DataFrameGroupBy." (Just like there are "types" of data called "DataFrames" & "Series.")

Line 4 applies a "lambda" function to our cryptic data that we stored under the nickname "groupingByAcctName."

The output, which we give a nickname of "groupedDataFrame," is just a "DataFrame." (I can tell by running the line "print(type(groupedDataFrame))".)

We've seen the ".apply(lambda ...)" operation before, in "Fancier Row-Filtering and Data-Editing."

It turns out that it can also be tacked onto the end of "DataFrameGroupBy"-typed data, not just "Series"-typed data.

When ".apply(lambda ...)" is tacked onto Series-type data, it does whatever's inside the "lambda ..." to each item in the Series and spits out a new Series of the same length with contents that have been transformed accordingly.

A DataFrameGroupBy can be thought of kind of like a "Series that holds DataFrames" (each mini-DataFrame being a row of the original DataFrame that falls into the "group").
When ".apply(lambda ...)" is tacked onto DataFrameGroupBy-type data, it does whatever's inside the "lambda ..." to each DataFrame in the DataFrameGroupBy and spits out a new DataFrame that jams the results back together (although, as we'll see later, the labeling is a bit complex).

In this case, there were 4 unique 'Acct Name' values. What we did to each mini-DataFrame inside was say, "Just show the rows of this mini-DataFrame that have a maximal 'Actv Date.' One 'Acct Name' had a tie on 'Actv Date,' so we actually ended up with 5 rows in the output DataFrame (which we then stored under the nickname "groupedDataFrame").

Our "CodeThatMakesAValueGoHere" was "x[x['Actv Date'] == x['Actv Date'].max()]"

What's going here is that "x" is a DataFrame.
"lambda x : ..." means, "Use 'x' as a placeholder for each of whatever is in the thing we're running this '.apply(lambda ...)' on."
If we were to use our original straight-from-the-CSV DataFrame and say "print(df[df['Actv Date'] == df['Actv Date'].max()])" we would get 2 rows: the tie for "most recent activity in the whole original DataFrame" (the July 10, 2016 activities).
However, now we're saying "Do the same thing, but within each of the mini-DataFrames that have already been broken up into unique 'Actv Date' clumps."

This code:

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample3.csv', dtype=object, parse_dates=['Actv Date'])
groupingByAcctName = df.groupby('Acct Name')
groupedDataFrame = groupingByAcctName.apply(lambda x: x[x['Actv Date'] == x['Actv Date'].max()])
print(groupedDataFrame)

Produces this output:

                           Actv Id           Actv Subject               Acct Name      Contact Name  Actv Date
Acct Name                                                                                                     
Costco                 5     dc89a             Phone call                  Costco     Sheryl Larson 2016-05-29
Gallaudet University   1  vsd8923j             Phone call    Gallaudet University   Maria Hernandez 2016-03-08
Salesforce             6   adf8o32  Conference invitation              Salesforce    Leslie Andrews 2015-02-07
United States Congress 2  3289vd09             Sent email  United States Congress       Tamika Page 2016-07-09
                       8    938fk3             Phone call  United States Congress  Shirley Chisholm 2016-07-09

Here we have something with "rows" and "columns," with its rows "numbered" by Pandas - it is indeed a DataFrame, even though it's got some weird groupings and gaps. (Although note how the row-numbers have been preserved from the original DataFrame! And note how to the left of the numbers there are labels that seem to group the numbers ... I said that DataFrame row-numbering was "complicated," didn't I?)

In case you're curious, here's output from the same code, only with the "grouping by both account and contact name" variation mentioned earlier.

                                            Actv Id           Actv Subject               Acct Name      Contact Name  Actv Date
Acct Name              Contact Name                                                                                            
Costco                 James Brown      0    02848v             Sent email                  Costco       James Brown 2014-10-09
                       Sheryl Larson    5     dc89a             Phone call                  Costco     Sheryl Larson 2016-05-29
Gallaudet University   Maria Hernandez  1  vsd8923j             Phone call    Gallaudet University   Maria Hernandez 2016-03-08
Salesforce             Leslie Andrews   6   adf8o32  Conference invitation              Salesforce    Leslie Andrews 2015-02-07
United States Congress Shirley Chisholm 8    938fk3             Phone call  United States Congress  Shirley Chisholm 2016-07-09
                       Tamika Page      2  3289vd09             Sent email  United States Congress       Tamika Page 2016-07-09

Before we can export our DataFrame into something that looks like our original CSV file (which is pretty much everything to the right of the row-numbers), we need to strip off that far-left "Acct Name" and the row-numbers, plus strip out the extra line between the column headers to the right and their data.

That's where the ".reset_index(drop=True)" command comes into play. Tacked onto a DataFrame, it produces a new DataFrame with all the row-numbers reset to "0, 1, 2..." and no labels to the left, plus no whitespace above the data. To make the code easier to read, I saved this new DataFrame as "outputdf."

And that's the end of the dirty details.


Breaking Ties

Afterthought: if you want to break ties so you have exactly 1 output row per 'Acct Name', here's how you do it:

  1. Figure out if there's anything else that's a "secondary tiebreaker."
  2. Once you run out of "secondary tiebreakers," pick an arbitrary single record with ".head(n=1)" or ".tail(n=1)" (depending on whether you want the first or the last record).
    • Making a final tie-break of "Oh, just pick the first line" is a common pattern in data processing when your business requirements say it's vital not to allow any ties to come out of an aggregation.

We have to change our approach to designing our code a bit to accommodate these extra steps. Here's the pattern:

  1. Break our original DataFrame up into a "Series of DataFrames grouped by (whatever we're computing the Max/Min among)"
    (a.k.a. a "DataFrameGroupBy")
    The code for this is ".groupBy(...)"
  2. Run an ".apply(lambda ...)" operation on the result of the previous step. For each mini-DataFrame being processed by the ".apply(lambda ...)" operation:
    1. Sort the rows of the mini-DataFrame first by the "primary sorter column," then by the "secondary sorter column," and so on (".sort_values(...)"). This produces a new mini-DataFrame with all the same rows, but sorted as specified.
      (Note - "ascending" is the default sort for each column with ".sort_values(...)", but you can specify on a column-by-column basis. I've shown that in the example code, but you can leave the ", ascending = ..." code out if you simply want an "ascending" sort across the board.)
    2. Produce a new mini-DataFrame that contains just the first/last row of the mini-DataFrame that came out of the previous step (depending on whether your sorting operation put the min/max you're looking for at the beginning or the end of the mini-DataFrame). The code for this is ".head(n=1)" or ".tail(n=1)".
  3. Strip the extra labels off the resulting big DataFrame so that it's ready for export back to CSV (".reset_index(drop=True)")

Afterthought: if you want to break ties by LastModifiedDate, just do it all over again. I've created a new "sample4.csv" as an example:

"Actv Id","Actv Subject","Acct Name","Contact Name","Actv Date","Mod Date"
"02848v","Sent email","Costco","James Brown","10/9/2014","10/10/2016 10:00:00"
"vsd8923j","Phone call","Gallaudet University","Maria Hernandez","3/8/2016","4/13/2016 14:23:04"
"3289vd09","Sent email","United States Congress","Tamika Page","7/9/2016","7/10/2016 9:35:36"
"das90","Lunch appointment","United States Congress","Tamika Page","3/4/2015","3/5/2015 13:01:01"
"vad0923","Sent email","Salesforce","Leslie Andrews","4/28/2013","9/30/2015 10:04:58"
"dc89a","Phone call","Costco","Sheryl Larson","5/29/2016","6/1/2016 11:38:00"
"adf8o32","Conference invitation","Salesforce","Leslie Andrews","2/7/2015","2/8/2015 08:36:00"
"fa9s3","Breakfast appointment","Costco","James Brown","9/3/2014","9/4/2014 07:35:00"
"938fk3","Phone call","United States Congress","Shirley Chisholm","7/9/2016","7/10/2016 08:34:20"

Here's some code to do this:
(Click here for playable sample...do not type your own company's data into this link!)

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample4.csv', dtype=object, parse_dates=['Actv Date', 'Mod Date'])
groupingByAcctName = df.groupby('Acct Name')
groupedDataFrame = groupingByAcctName.apply(lambda x : x.sort_values(['Actv Date','Mod Date'], ascending=[True,True]).tail(n=1))
outputdf = groupedDataFrame .reset_index(drop=True)
print(outputdf)

Or a more concise version of the code, where I strung multiple lines together and didn't stop to give them nicknames (this form gets easier to read as your code gets longer):

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample4.csv', dtype=object, parse_dates=['Actv Date', 'Mod Date'])
print(df.groupby('Acct Name').apply(lambda x : x.sort_values(['Actv Date','Mod Date'], ascending=[True,True]).tail(n=1)).reset_index(drop=True))

And here's the output text:

    Actv Id           Actv Subject               Acct Name     Contact Name  Actv Date            Mod Date
0     dc89a             Phone call                  Costco    Sheryl Larson 2016-05-29 2016-06-01 11:38:00
1  vsd8923j             Phone call    Gallaudet University  Maria Hernandez 2016-03-08 2016-04-13 14:23:04
2   adf8o32  Conference invitation              Salesforce   Leslie Andrews 2015-02-07 2015-02-08 08:36:00
3  3289vd09             Sent email  United States Congress      Tamika Page 2016-07-09 2016-07-10 09:35:36

Again, I can spot-check it because the tie-break between the "United States Congress" is that Tamika's July 2016 activity record was modified about an hour after Shirley's.

  • If I change "ascending=[True,True]" to "ascending=[True,False]" (sort 'Actv Date' ascending but 'Mod Date' descending, then pick the last record in the group) the tie-break goes to Shirley's record instead.
  • Similarly, if I change "ascending=[True,True]" to "ascending=[False,True]" (sort 'Actv Date' descending but 'Mod Date' ascending, then pick the last record in the group), I'm saying I want the least recent 'Actv Date', but I still want ties broken by most recent 'Mod Date.'
  • And if I change "ascending=[True,True]" to "ascending=[False,False]" (sort both dates descending, then pick the last record in the group), I'm saying I truly want the least recent record in the group.
    (In this case, "[False,True]" & "[False,False]" produce the same results because I don't have any ties for "oldest activity in the group" on 'Actv Date' that need breaking by 'Mod Date'. Tamika's March 4, 2015 record is the oldest for Congress, etc.)

Table of Contents

Python for Salesforce Administrators - Installing Python 3 on Windows

As I mentioned here, an "IDE" is, basically, a text-editor with a "play" button to execute whatever code you've typed into the text editor.

My favorite suite of software for running "Python 3" on Windows is called WinPython.
(Read up here about Python 2 & Python 3.)

You can install and run WinPython without administrator rights.

The latest versions are always listed at the software's Github page, but Windows 7 & Windows 8 users whose systems don't have the "Microsoft Visual C++ Redistributable for Visual Studio 2015" update installed into Windows might not be able to actually run the very latest versions of WinPython after installing them. (And, unfortunately, you can't install this update to Windows without administrator rights.)

The most recent "64-bit" version of WinPython for "Python 3" that I've found that works without that update is WinPython 3.4.3.5, which you can download here.

WinPython doesn't really install anything on your computer. It just puts a bunch of files into whatever folder you tell it to put those files into. When you want to "uninstall" it, you simply delete the folder. (Similarly, you have to create your own Desktop and Start-Menu shortcuts by hand.) If you "install" a certain version of WinPython and find that you can't run certain programs due to an error saying that "api-ms-win-crt-runtime-l1-1-0.dll" is missing, or if those program never show any sign of running at all, delete your "installation" and start over with a different version of WinPython.


Here are screenshots of the steps I took as I installed WinPython 3.4.3.5, 64-bit, on a thumb drive.

Step 1: download the installer file to my hard drive

Step 2: run the installer file by double-clicking on it

Step 3: wait a few seconds for the installer to come up

Step 4: accept the licensing agreement

Step 5: specify where on your computer (or on a thumb drive) you want the folder containing "WinPython" files to live & click "Install"

Step 6: wait for the installer to put all of the files on your hard drive, then click "Next," then click "Finish"


Let's see what we've done so far.

Step 7: navigate to the folder where you installed WinPython and observe the two files "Spyder.exe" and "WinPython Command Prompt.exe"

  • "Spyder" is your IDE (the text-editor with a play button)
  • The "WinPython Command Prompt" is what you run if you need to type commands that begin with "pip" (which is command-line-interface software that can download and install Python "modules" you don't already have based on the module name - if you got a "full" version of WinPython, you probably won't need to use this)

You might want to manually create shortcuts to these two files - or at least to Spyder - in your Start Menu and/or on your Desktop.


All right ... let's run some code!

Step 8: run "Spyder.exe" by double-clicking on it (it takes a minute to load -- something went wrong if you don't see a greenish "splash screen" with a spiderweb on it within 20 seconds, though)

Step 9: type the following text into the text-editor in Python and click the "play" button (or hit "F5"), which means "run":

print('hello there')

Note: If you get a pop-up dialog with all sorts of options about executing code consoles or interpreters and such, click "cancel" on the pop-up dialog box and then click the play button again - for some reason. The pop-up box was giving you a chance to highly customize where you want to see any output from running your code. Clicking "cancel" and trying again seems to tell Spyder, "you figure out where to display the output of this code, don't bother me with all these options." (Unfortunately, I didn't catch this dialog box with a screenshot.)

Step 10: observe that the "console" in another corner of your screen" said "hello there" in it!

Note: Typically, when you are programming, the first thing you do with a new tool is try to get it to display some simple text on your screen. That's what we did in Steps 9 & 10. But it's important to remember that not every command in a programming language displays text on the screen when it runs. For example, in some of the Python examples on this blog, certain commands create CSV files - they don't display any text on the screen at all when they're done creating the CSV file. To know when such code is done, you could add a line to your script afterwards along the lines of "print('all done')". You may also pick up on subtle changes in the "console" in Spyder as code starts & stops running (for example, you get a new "[]" prompt - which you can ignore - when code has finished running).

Nevertheless, no matter what your code does, the area of Spyder where you typed in Step 9 is the area where you will want to copy/paste all of the example code from this blog before tweaking it to match your own needs and clicking "Run" (the play-button / F5).

Now that you're up and running, time to learn how to use Python!


Web-Browser-Based Exercises

DO NOT DO NOT DO NOT use your own company's data with this trick.

You have to promise me. Pinky-swear.

You MUST use a properly-installed copy of Python as described above for data you actually want to work with, no matter how inconsequential you think it is.

BUT ... if you'd like to just play along with the "sample1.csv"-type data in this blog, I've found a way you can do so right now, using nothing but your web browser.

I painstakingly made all of that data up, word by word. That's why it's okay to just throw into some random service hosted online.

Real CSV files from your company are NOT okay to throw into some random service hosted online. Even if it doesn't seem like important data. Just do it on your local computer and be safe, not sorry.

Anyway ... for imitating exercises in this blog, you can go to https://repl.it/languages/python3

(Update, 2018: repl.it sometimes seems to run slowly, lately, when using the "Pandas" functionality of Python. At the moment, https://codebunk.com/b/ still seems to work, but that link is just meant for trying their paid product, so please don't go too crazy.)

If an exercise in this blog claims it works with a file "C:\tempexamples\sample000.csv" whose contents look like this:

"Id","First","Last","Email","Company"
"5829","Jimmy","Buffet","jb@example.com","RCA"
"2894","Shirley","Chisholm","sc@example.com","United States Congress"

And then if the Python code looks like this:

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample000.csv')
print('---Here is a sampling of the first lines---')
print(df.head())

Then your code at Repl.it will be:

import sys
if sys.version_info[0] < 3: from StringIO import StringIO
else: from io import StringIO

sample = StringIO("""
"Id","First","Last","Email","Company"
"5829","Jimmy","Buffet","jb@example.com","RCA"
"2894","Shirley","Chisholm","sc@example.com","United States Congress"
""")

import pandas
df = pandas.read_csv(sample)
print('---Here is a sampling of the first lines---')
print(df.head())

Note that you have to put in 3 lines of importing extra "modules," then you have to do some weird, very precise code surrounding a copy/paste from the contents of the CSV file, and finally, in the "df = pandas.read_csv()" line, instead of saying "'C:\\tempexamples\\sample000.csv'", you refer to the nickname you gave your fake CSV file (in this case, "sample") - without any single-quotes around it.

Drive carefully. Remember your pinky-swear.


Table of Contents

Thursday, December 1, 2016

Salesforce Application Deployment and Versioning - a few random thoughts

At work, we're starting to talk about overhauling our software application build, deployment, and versioning processes and technologies.

One of the tensions in coming up with organized versioning has to do with the fact that these days, it's considered good practice to push software application code onto end-user-facing servers via your version control system. An alternative approach is to push software application code onto end-user-facing servers some other way, and then separately throw a copy of "the latest server code" into your version control system. In my opinion, it can be pretty tough to decide which of these philosophies to follow when trying to use version control software with Salesforce.


Background: Each copy of Salesforce that a company buys comes with at least one ("production"), and possibly several more ("sandbox"), independent databases. (They're vaguely aware of each other's existence, but live on different servers, don't have any configuration-file interdependencies, etc.) "Sandbox" databases can be destroyed and rebuilt -- empty of data -- in the image of the "production" database (application code included), at the click of a few buttons from the "production" database's web-based administrative UI. New "sandbox" databases can be created the same way.


The back-end configuration files from which each of those databases knows its own definition (schema, application code, user permissions) are in plain text (usually XML, but some Java-like source code files as well) and live on Salesforce's servers along with the data stored in the database.

For each database, there are 3 ways to modify those back-end configuration files:

  1. Log into the database's web-based administrative UI and point-and-click your way through changes (think of it like modifying an MS Access database).
    • In a "production" database, this option is only allowed for schema and user permission changes, not for application code changes.
    • In a "sandbox" database, the web-based administrative UI also involves a small IDE for modifying application code. Such code is often moved to a "production" database via option #2 below.
  2. Log into the web-based UIs of 2 databases related to the same "copy of Salesforce bought by a company" and specify that certain components of one database should be written to the other. (Often used by novices for migrating application code from a "sandbox" database to a "production" database. Known as using "change sets.")
  3. Upload a fresh copy of a configuration file to the database's SOAP HTTP API endpoint in a packaged "build" ZIP file.
    • (Downloading/uploading via the Eclipse "Force.com IDE" plugin or the "ANT Force.com Migration Tool" or Workbench or MavensMate or WelkinSuite are all just tools that take care of the "SOAP" part of option #3 for you.)

Option #3, which involves getting/sending copies of the configuration files to and from Salesforce's servers, is the only option that truly integrates with version control systems. That's because it's the only option that actually involves using copies of the configuration files sitting on a machine that isn't the Salesforce server itself. (Salesforce won't let you make a version control repository out of the database server itself the way some other code-hosting cloud services will.)

Because option #3 can involve version control systems, it plays nicely with the modern software application deployment philosophy of "push configuration files onto end-user-facing servers via your version control system." However, a key rule that everyone has to follow to make this philosophy work seems to be: "Do NOT modify the database's configuration files in any way other than via the version control system. Seriously. Don't. Even if the database will let you. Just don't."

Unfortunately, that's pretty much impossible to enforce with Salesforce. Salesforce advertises how easy it is to use option #1 & option #2. They encourage "clicks, not code" for schema and user permission changes (as well as for some "application" development). In the end, it's unlikely that the database configuration files actually out on Salesforce's servers truly match the copies of those configuration files sitting in your version control system if you're only moving copies in the direction of VCS repository -> database.


If you find that your company is mostly using option #1 and option #2 to modify code in your Salesforce databases, you can still use version control software. Essentially, you use the "get copies of database configuration files" functionality of the database's SOAP HTTP API to download copies of code onto a machine that does host a version control system repository. Do this as often as you want to "checkpoint" the current state of a Salesforce database. Example here.

I've also heard of companies that finagle some sort of hybrid solution: they do what I just suggested with a scheduled tool for everything but configuration files representing "application code."
People who write "application code" are expected to follow "push configuration files onto the end-user-facing server via your version control system" practices.
(From what I understand, they have to be pretty skilled with version control tool branching and merging, as well as efficient XML file editing, to handle the fact that lay-people might concurrently use option #1 & option #2 to modify schemas & user permissions that they also need to modify and were planning to modify via option #3.)



Personally, I think it's also hard to decide how many version control system repositories to spin up:

  • One repository for each database in a "copy of Salesforce a company bought"
  • One big repository (perhaps with 1 sub-folder per database) for each "copy of Salesforce a company bought"

Novices to Salesforce and Git using the Eclipse "Force.com IDE" plugin to facilitate downloading copies of their databases' configuration files could easily find themselves in the "1 repository per database" approach because version control systems often use a computer's file-system folder as the boundary of a "repository," and default setup with the Eclipse plugin seems to give you one file-system folder per database (not clumped together into any sort of overarching folder representing the "copy of Salesforce a company bought").

  • That might actually be rather nice, because then your check-in comments about changes you've made in the various databases are naturally segregated from each other. It might also be easier to use graphical user interface tools to search a repository for a file by its name without seeing a bunch of clutter from other databases if the repository only represents other databases.
  • However, you're leaving your colleagues with more repositories to wade through asking themselves, "What's this one for?"

I don't completely understand it yet, but my understanding is that people in "push configuration files onto the end-user-facing server via your version control system" workplaces prefer the "one big repository for each 'copy of Salesforce a company bought'" repository structure.