Pages

Thursday, December 8, 2016

Python for Salesforce Administrators - Filtering Rows By Maximum Date Per Group

I had a special request to show off Python for filtering a CSV file to only leave behind the "most recent activity" for a given "Account" record. Happy to oblige!

However, I can't explain as well as I've been doing in other examples exactly what's going on here. It's a bit of black magic to me, but it works. Follow along.


Preparing A CSV

Now I'm going to open Notepad and make myself a CSV file (when I do File->Save As, change the "Save as File Type" from .TXT to "All Files (*.*)") and save the file as "c:\tempexamples\sample3.csv".

Here's what that file contains (representing 5 columns worth of fields & 7 rows worth of data):

"Actv Id","Actv Subject","Acct Name","Contact Name","Actv Date"
"02848v","Sent email","Costco","James Brown","10/9/2014"
"vsd8923j","Phone call","Gallaudet University","Maria Hernandez","3/8/2016"
"3289vd09","Sent email","United States Congress","Tamika Page","7/9/2016"
"das90","Lunch appointment","United States Congress","Tamika Page","3/4/2015"
"vad0923","Sent email","Salesforce","Leslie Andrews","4/28/2013"
"dc89a","Phone call","Costco","Sheryl Larson","5/29/2016"
"adf8o32","Conference invitation","Salesforce","Leslie Andrews","2/7/2015"
"fa9s3","Breakfast appointment","Costco","James Brown","9/3/2014"
"938fk3","Phone call","United States Congress","Shirley Chisholm","7/9/2016"

If I run this code (note that I've got a special "read_csv" line that specifies to treat 'Actv Date' as a sortable date):
import pandas
df = pandas.read_csv('C:\\tempexamples\\sample3.csv', dtype=object, parse_dates=['Actv Date'])
print(df)
The output looks like this:
    Actv Id           Actv Subject               Acct Name      Contact Name  Actv Date
0    02848v             Sent email                  Costco       James Brown 2014-10-09
1  vsd8923j             Phone call    Gallaudet University   Maria Hernandez 2016-03-08
2  3289vd09             Sent email  United States Congress       Tamika Page 2016-07-09
3     das90      Lunch appointment  United States Congress       Tamika Page 2015-03-04
4   vad0923             Sent email              Salesforce    Leslie Andrews 2013-04-28
5     dc89a             Phone call                  Costco     Sheryl Larson 2016-05-29
6   adf8o32  Conference invitation              Salesforce    Leslie Andrews 2015-02-07
7     fa9s3  Breakfast appointment                  Costco       James Brown 2014-09-03
8    938fk3             Phone call  United States Congress  Shirley Chisholm 2016-07-09

Grouping and Filtering

Firstly, I'm going to show you the final code.

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample3.csv', dtype=object, parse_dates=['Actv Date'])
groupingByAcctName = df.groupby('Acct Name')
groupedDataFrame = groupingByAcctName.apply(lambda x: x[x['Actv Date'] == x['Actv Date'].max()])
outputdf = groupedDataFrame.reset_index(drop=True)
print(outputdf)

The output looks like this:

    Actv Id           Actv Subject               Acct Name      Contact Name  Actv Date
0     dc89a             Phone call                  Costco     Sheryl Larson 2016-05-29
1  vsd8923j             Phone call    Gallaudet University   Maria Hernandez 2016-03-08
2   adf8o32  Conference invitation              Salesforce    Leslie Andrews 2015-02-07
3  3289vd09             Sent email  United States Congress       Tamika Page 2016-07-09
4    938fk3             Phone call  United States Congress  Shirley Chisholm 2016-07-09

(Exporting back to CSV instructions here, under the 5th program, "Writing To CSV.")

Note that "United States Congress" still has 2 records, because two things happened on the same date. This particular script I gave you doesn't "toss a coin" and break the tie between e-mailing Tamika and calling Shirley. It just leaves both "most recent" records in place.
(See the end of this post for tie-breaking.)

I can spot-check that it works because when line 4 ends with "max()" (as in most-recent activity), the result for Costco is Sheryl Larson's activity in May 2016. If I change it to "min()" (as in least-recent activity), the result for Costco is James Brown's activity in September 2014.

Variation - Multi-Column Group

If I change ".groupby('Acct Name')" at the end of line 3 in the previous example to ".groupby(['Acct Name', 'Contact Name'])" and leave line 4 as "max," I get the following output (note 2 Costco rows - one for James, one for Sheryl, but with James's October 2014 activity as his most recent):

    Actv Id           Actv Subject               Acct Name      Contact Name  Actv Date
0    02848v             Sent email                  Costco       James Brown 2014-10-09
1     dc89a             Phone call                  Costco     Sheryl Larson 2016-05-29
2  vsd8923j             Phone call    Gallaudet University   Maria Hernandez 2016-03-08
3   adf8o32  Conference invitation              Salesforce    Leslie Andrews 2015-02-07
4    938fk3             Phone call  United States Congress  Shirley Chisholm 2016-07-09
5  3289vd09             Sent email  United States Congress       Tamika Page 2016-07-09

What's Really Going On?

Line 3, "df.groupby('Acct Name')," produces data that we can save with a nickname ("groupingByAcctName"), but that doesn't really print well (try it - try adding a "print(groupingByAcctName)" line to the script). However, trying to print it does tell us that it's a "type" of data called a "DataFrameGroupBy." (Just like there are "types" of data called "DataFrames" & "Series.")

Line 4 applies a "lambda" function to our cryptic data that we stored under the nickname "groupingByAcctName."

The output, which we give a nickname of "groupedDataFrame," is just a "DataFrame." (I can tell by running the line "print(type(groupedDataFrame))".)

We've seen the ".apply(lambda ...)" operation before, in "Fancier Row-Filtering and Data-Editing."

It turns out that it can also be tacked onto the end of "DataFrameGroupBy"-typed data, not just "Series"-typed data.

When ".apply(lambda ...)" is tacked onto Series-type data, it does whatever's inside the "lambda ..." to each item in the Series and spits out a new Series of the same length with contents that have been transformed accordingly.

A DataFrameGroupBy can be thought of kind of like a "Series that holds DataFrames" (each mini-DataFrame being a row of the original DataFrame that falls into the "group").
When ".apply(lambda ...)" is tacked onto DataFrameGroupBy-type data, it does whatever's inside the "lambda ..." to each DataFrame in the DataFrameGroupBy and spits out a new DataFrame that jams the results back together (although, as we'll see later, the labeling is a bit complex).

In this case, there were 4 unique 'Acct Name' values. What we did to each mini-DataFrame inside was say, "Just show the rows of this mini-DataFrame that have a maximal 'Actv Date.' One 'Acct Name' had a tie on 'Actv Date,' so we actually ended up with 5 rows in the output DataFrame (which we then stored under the nickname "groupedDataFrame").

Our "CodeThatMakesAValueGoHere" was "x[x['Actv Date'] == x['Actv Date'].max()]"

What's going here is that "x" is a DataFrame.
"lambda x : ..." means, "Use 'x' as a placeholder for each of whatever is in the thing we're running this '.apply(lambda ...)' on."
If we were to use our original straight-from-the-CSV DataFrame and say "print(df[df['Actv Date'] == df['Actv Date'].max()])" we would get 2 rows: the tie for "most recent activity in the whole original DataFrame" (the July 10, 2016 activities).
However, now we're saying "Do the same thing, but within each of the mini-DataFrames that have already been broken up into unique 'Actv Date' clumps."

This code:

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample3.csv', dtype=object, parse_dates=['Actv Date'])
groupingByAcctName = df.groupby('Acct Name')
groupedDataFrame = groupingByAcctName.apply(lambda x: x[x['Actv Date'] == x['Actv Date'].max()])
print(groupedDataFrame)

Produces this output:

                           Actv Id           Actv Subject               Acct Name      Contact Name  Actv Date
Acct Name                                                                                                     
Costco                 5     dc89a             Phone call                  Costco     Sheryl Larson 2016-05-29
Gallaudet University   1  vsd8923j             Phone call    Gallaudet University   Maria Hernandez 2016-03-08
Salesforce             6   adf8o32  Conference invitation              Salesforce    Leslie Andrews 2015-02-07
United States Congress 2  3289vd09             Sent email  United States Congress       Tamika Page 2016-07-09
                       8    938fk3             Phone call  United States Congress  Shirley Chisholm 2016-07-09

Here we have something with "rows" and "columns," with its rows "numbered" by Pandas - it is indeed a DataFrame, even though it's got some weird groupings and gaps. (Although note how the row-numbers have been preserved from the original DataFrame! And note how to the left of the numbers there are labels that seem to group the numbers ... I said that DataFrame row-numbering was "complicated," didn't I?)

In case you're curious, here's output from the same code, only with the "grouping by both account and contact name" variation mentioned earlier.

                                            Actv Id           Actv Subject               Acct Name      Contact Name  Actv Date
Acct Name              Contact Name                                                                                            
Costco                 James Brown      0    02848v             Sent email                  Costco       James Brown 2014-10-09
                       Sheryl Larson    5     dc89a             Phone call                  Costco     Sheryl Larson 2016-05-29
Gallaudet University   Maria Hernandez  1  vsd8923j             Phone call    Gallaudet University   Maria Hernandez 2016-03-08
Salesforce             Leslie Andrews   6   adf8o32  Conference invitation              Salesforce    Leslie Andrews 2015-02-07
United States Congress Shirley Chisholm 8    938fk3             Phone call  United States Congress  Shirley Chisholm 2016-07-09
                       Tamika Page      2  3289vd09             Sent email  United States Congress       Tamika Page 2016-07-09

Before we can export our DataFrame into something that looks like our original CSV file (which is pretty much everything to the right of the row-numbers), we need to strip off that far-left "Acct Name" and the row-numbers, plus strip out the extra line between the column headers to the right and their data.

That's where the ".reset_index(drop=True)" command comes into play. Tacked onto a DataFrame, it produces a new DataFrame with all the row-numbers reset to "0, 1, 2..." and no labels to the left, plus no whitespace above the data. To make the code easier to read, I saved this new DataFrame as "outputdf."

And that's the end of the dirty details.


Breaking Ties

Afterthought: if you want to break ties so you have exactly 1 output row per 'Acct Name', here's how you do it:

  1. Figure out if there's anything else that's a "secondary tiebreaker."
  2. Once you run out of "secondary tiebreakers," pick an arbitrary single record with ".head(n=1)" or ".tail(n=1)" (depending on whether you want the first or the last record).
    • Making a final tie-break of "Oh, just pick the first line" is a common pattern in data processing when your business requirements say it's vital not to allow any ties to come out of an aggregation.

We have to change our approach to designing our code a bit to accommodate these extra steps. Here's the pattern:

  1. Break our original DataFrame up into a "Series of DataFrames grouped by (whatever we're computing the Max/Min among)"
    (a.k.a. a "DataFrameGroupBy")
    The code for this is ".groupBy(...)"
  2. Run an ".apply(lambda ...)" operation on the result of the previous step. For each mini-DataFrame being processed by the ".apply(lambda ...)" operation:
    1. Sort the rows of the mini-DataFrame first by the "primary sorter column," then by the "secondary sorter column," and so on (".sort_values(...)"). This produces a new mini-DataFrame with all the same rows, but sorted as specified.
      (Note - "ascending" is the default sort for each column with ".sort_values(...)", but you can specify on a column-by-column basis. I've shown that in the example code, but you can leave the ", ascending = ..." code out if you simply want an "ascending" sort across the board.)
    2. Produce a new mini-DataFrame that contains just the first/last row of the mini-DataFrame that came out of the previous step (depending on whether your sorting operation put the min/max you're looking for at the beginning or the end of the mini-DataFrame). The code for this is ".head(n=1)" or ".tail(n=1)".
  3. Strip the extra labels off the resulting big DataFrame so that it's ready for export back to CSV (".reset_index(drop=True)")

Afterthought: if you want to break ties by LastModifiedDate, just do it all over again. I've created a new "sample4.csv" as an example:

"Actv Id","Actv Subject","Acct Name","Contact Name","Actv Date","Mod Date"
"02848v","Sent email","Costco","James Brown","10/9/2014","10/10/2016 10:00:00"
"vsd8923j","Phone call","Gallaudet University","Maria Hernandez","3/8/2016","4/13/2016 14:23:04"
"3289vd09","Sent email","United States Congress","Tamika Page","7/9/2016","7/10/2016 9:35:36"
"das90","Lunch appointment","United States Congress","Tamika Page","3/4/2015","3/5/2015 13:01:01"
"vad0923","Sent email","Salesforce","Leslie Andrews","4/28/2013","9/30/2015 10:04:58"
"dc89a","Phone call","Costco","Sheryl Larson","5/29/2016","6/1/2016 11:38:00"
"adf8o32","Conference invitation","Salesforce","Leslie Andrews","2/7/2015","2/8/2015 08:36:00"
"fa9s3","Breakfast appointment","Costco","James Brown","9/3/2014","9/4/2014 07:35:00"
"938fk3","Phone call","United States Congress","Shirley Chisholm","7/9/2016","7/10/2016 08:34:20"

Here's some code to do this:
(Click here for playable sample...do not type your own company's data into this link!)

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample4.csv', dtype=object, parse_dates=['Actv Date', 'Mod Date'])
groupingByAcctName = df.groupby('Acct Name')
groupedDataFrame = groupingByAcctName.apply(lambda x : x.sort_values(['Actv Date','Mod Date'], ascending=[True,True]).tail(n=1))
outputdf = groupedDataFrame .reset_index(drop=True)
print(outputdf)

Or a more concise version of the code, where I strung multiple lines together and didn't stop to give them nicknames (this form gets easier to read as your code gets longer):

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample4.csv', dtype=object, parse_dates=['Actv Date', 'Mod Date'])
print(df.groupby('Acct Name').apply(lambda x : x.sort_values(['Actv Date','Mod Date'], ascending=[True,True]).tail(n=1)).reset_index(drop=True))

And here's the output text:

    Actv Id           Actv Subject               Acct Name     Contact Name  Actv Date            Mod Date
0     dc89a             Phone call                  Costco    Sheryl Larson 2016-05-29 2016-06-01 11:38:00
1  vsd8923j             Phone call    Gallaudet University  Maria Hernandez 2016-03-08 2016-04-13 14:23:04
2   adf8o32  Conference invitation              Salesforce   Leslie Andrews 2015-02-07 2015-02-08 08:36:00
3  3289vd09             Sent email  United States Congress      Tamika Page 2016-07-09 2016-07-10 09:35:36

Again, I can spot-check it because the tie-break between the "United States Congress" is that Tamika's July 2016 activity record was modified about an hour after Shirley's.

  • If I change "ascending=[True,True]" to "ascending=[True,False]" (sort 'Actv Date' ascending but 'Mod Date' descending, then pick the last record in the group) the tie-break goes to Shirley's record instead.
  • Similarly, if I change "ascending=[True,True]" to "ascending=[False,True]" (sort 'Actv Date' descending but 'Mod Date' ascending, then pick the last record in the group), I'm saying I want the least recent 'Actv Date', but I still want ties broken by most recent 'Mod Date.'
  • And if I change "ascending=[True,True]" to "ascending=[False,False]" (sort both dates descending, then pick the last record in the group), I'm saying I truly want the least recent record in the group.
    (In this case, "[False,True]" & "[False,False]" produce the same results because I don't have any ties for "oldest activity in the group" on 'Actv Date' that need breaking by 'Mod Date'. Tamika's March 4, 2015 record is the oldest for Congress, etc.)

Table of Contents

No comments:

Post a Comment