Pages

Wednesday, August 24, 2016

Python for Salesforce Administrators - More Scripts - Rename/Drop/Add Columns

Review: Previous Post

In our last post, we learned to use a programming language called Python and one of its "modules" (additional sets of commands) called Pandas to:

  • Display text on the "standard output" part of our screen
  • Import a CSV file from our hard drive into a special kind of data structure called a "DataFrame," which can be thought of as a virtual Excel spreadsheet
  • Displayed the contents of our "DataFrame" to the screen (all the data & just the first five lines)
  • Displayed just the "last name" column's worth of data from our "DataFrame"
  • Displayed, for each row, whether or not the "last name" column's worth of data started with a given letter
  • Filtered out rows from our "DataFrame" that didn't start with a given letter
  • Exported a CSV file back to our hard drive containing just the rows of our original CSV that started with a given letter

Today we'll rename columns, strip out unneeded columns, and add new columns with "constant" (unchanging from row to row) values. As a real-world example for these tasks:

  • We've exported a list of Contact records as a CSV file.
  • We need to add them all to an existing Campaign.
  • We already know the Campaign's 18-digit ID, so all we need to do is associate the Contacts' 18-digit IDs to this "Campaign ID" and insert these records into the "CampaignMember" table.
    (Nevermind that there are other easy ways to add Contacts to a Campaign...it's an example.)

Prep Work

First, make sure you've created a ".CSV" file like the one described in "First Scripts" (I called mine "sample1.csv"). It looks like this in a text editor:

"Id","First","Last","Email","Company"
"5829","Jimmy","Buffet","jb@example.com","RCA"
"2894","Shirley","Chisholm","sc@example.com","United States Congress"
"294","Marilyn","Monroe","mm@example.com","Fox"
"30829","Cesar","Chavez","cc@example.com","United Farm Workers"
"827","Vandana","Shiva","vs@example.com","Navdanya"
"9284","Andrea","Smith","as@example.com","University of California"
"724","Albert","Howard","ah@example.com","Imperial College of Science"

As in that post, in each example, we'll overwrite the contents of a piece of Python code saved to our hard drive as "script.py" and run the entire script.

Again, the first two lines of most of our examples will be:

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample1.csv')

This tells our code-running software that we will be using the "Pandas" module and imports data from our .CSV file into a "DataFrame" that we give a nickname of "df" for referring to later in our code.


Removing Columns

Let's get rid of the "Email" & "Company" columns. They're just clutter for the task at hand.
(With certain jobs, I prefer to leave "name" fields in the data, even if it won't be imported back into Salesforce, so I can spot-check the output of my script more easily.)

This code:

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample1.csv')
print('---The contents of "df" right after import---')
print(df)
df = df.drop(['Email','Company'], axis=1)
print('---The contents of "df" after removing two columns---')
print(df)

Produces this text output:

---The contents of "df" right after import---
      Id    First      Last           Email                      Company
0   5829    Jimmy    Buffet  jb@example.com                          RCA
1   2894  Shirley  Chisholm  sc@example.com       United States Congress
2    294  Marilyn    Monroe  mm@example.com                          Fox
3  30829    Cesar    Chavez  cc@example.com          United Farm Workers
4    827  Vandana     Shiva  vs@example.com                     Navdanya
5   9284   Andrea     Smith  as@example.com     University of California
6    724   Albert    Howard  ah@example.com  Imperial College of Science
---The contents of "df" after removing two columns---
      Id    First      Last
0   5829    Jimmy    Buffet
1   2894  Shirley  Chisholm
2    294  Marilyn    Monroe
3  30829    Cesar    Chavez
4    827  Vandana     Shiva
5   9284   Andrea     Smith
6    724   Albert    Howard

Lines 3 & 6 just make it easier to read our script's output.

Line 4 displays the contents of our "DataFrame" that we called "df" immediately after generating it by importing our CSV file.

Line 5 of our script comes in two parts.

The first part is after the "equals" sign and means, "generate a new 'DataFrame' that is what you get when you strip columns named 'Email' and 'Company' out of the 'DataFrame' I've called 'df'."
(Trivia: Yesterday, I said that sometimes, certain operations in Pandas could have very similar ways of typing them depending on whether you wanted them to perform the behavior against a row or a column, and that sometimes, Pandas would just figure it out, whereas other times, you had to specify exactly what you meant to get the behavior you want. ".drop()" is one of those Pandas operations that needs things spelled out. In this code, the "axis=1" specifies that we want to drop columns, not rows.)

The second part is before + including the "equals" sign says, "You know that new 'DataFrame' you just made me? Save it over the old contents of 'df' - I don't need those anymore; I like this 'DataFrame' better. Whenever I refer to 'df' from now on, I'm referring to this new 'DataFrame' you just made me."

Line 7 displays the contents of the new version of "df," which now only has 3 columns.


Renaming Columns

When you insert records into the "CampaignMember" table, you have to provide a "ContactId" field & a "CampaignId" field. Since our data came straight out of the "Contact" table, its ID is still just called "Id." Let's rename it to "ContactId." And let's rename "First" to "First Name" and "Last" to "Last Name" while we're at it, just to show you how to rename multiple columns at once.

Code:

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample1.csv')
df = df.drop(['Email','Company'], axis=1)
df = df.rename(columns = {'Id':'ContactId', 'First':'First Name', 'Last':'Last Name'})
print('---The contents of "df" after removing two columns and renaming columns---')
print(df)

Text output:

---The contents of "df" after removing two columns and renaming columns---
   ContactId First Name Last Name
0       5829      Jimmy    Buffet
1       2894    Shirley  Chisholm
2        294    Marilyn    Monroe
3      30829      Cesar    Chavez
4        827    Vandana     Shiva
5       9284     Andrea     Smith
6        724     Albert    Howard

Line 3 was "line 5" in our previous example. It makes a new "DataFrame" with just 3 columns (right-side of the equals sign) and overwrites the value of "df" with this new "DataFrame" (the equals sign and to its left).

Line 4 works similarly.

To the right of the equals sign, the code says, "generate a new 'DataFrame' that looks just like the contents 'df' (as 'df' stands after line 3), only with the various columns renamed."

The second part is before + including the "equals" sign says, "You know that new 'DataFrame' you just made me? Save it over the old contents of 'df' - I don't need those anymore; I like this 'DataFrame' better. Whenever I refer to 'df' from now on, I'm referring to this new 'DataFrame' you just made me."

Line 5 displays a label for readability.

Line 6 displays the contents of the newest version of "df," which now has 3 columns, all of which have been renamed.

Adding A Column With A Fixed Value

Now let's add the "18-digit Id" of the Campaign we want to put all these people into. (I'm using "xyzzy" instead for legibility.)

Code:

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample1.csv')
df = df.drop(['Email','Company'], axis=1)
df = df.rename(columns = {'Id':'ContactId', 'First':'First Name', 'Last':'Last Name'})
df['CampaignId']='xyzzy'
print('---The contents of "df" after removing two columns, renaming columns, and adding a column.---')
print(df)

Text output:

---The contents of "df" after removing two columns and renaming columns---
   ContactId First Name Last Name CampaignId
0       5829      Jimmy    Buffet      xyzzy
1       2894    Shirley  Chisholm      xyzzy
2        294    Marilyn    Monroe      xyzzy
3      30829      Cesar    Chavez      xyzzy
4        827    Vandana     Shiva      xyzzy
5       9284     Andrea     Smith      xyzzy
6        724     Albert    Howard      xyzzy

This is the same as our previous code example, only I added a new line of code on line 5.

Note that we don't have the "df = df.something()" syntax anymore.

Instead, we put brackets on the "df" to the left of the equals sign and the value we wanted in every single row for this new column on the right of the equals sign.

That's just a difference in syntax (code-spelling & "grammar") to get used to in Pandas commands.

When you add a column, you say, "You know 'df'? Well, I'd like you to modify it, and here's how."

When you drop or rename a column, you say, "Take the old copy of 'df', make me something new based on it, and then overwrite 'df' with this 'something new.'"

(Trivia - actually, that's only sorta true. There is usually an "inPlace=true" flag you can add to "df.something()" operations to make them use less space on your computer (memory) while they run - in return, you drop the "df = " part of the code. However, it's not always reliable when you run the code, so if you're not having any problems with with my code examples, I don't recommend using this flag.

In other words, "df = df.drop(['Email','Company'], axis=1)" and "df.drop(['Email','Company'], axis=1, inplace=True)" get the same job done, and the 2nd is theoretically more efficient, but every once in a while with the 2nd version, when you run your code, the whole thing will crash with a random error.)

Anyway, line 5 says, "modify 'df' so that it has a new column called 'CampaignId,' and in every row in that new column, put a value of 'xyzzy'."

What's really interesting is that this is very similar to the syntax for referring to the contents of an existing column - the only difference is that the word we typed inside the square brackets didn't yet exist in our "DataFrame." Instead of giving us an error, though, Pandas just said, "Oh, you must want me to create that for you - just put an '=' to the right of your reference to this nonexistent column and a value for me to put in its rows, and I'll go right ahead & do that."

Line 6 displays a label for readability.

Line 7 displays the contents of the newest version of "df," which now has 4 columns, 3 of which are renamed from their original names, and 1 of which is brand new.


That's it for this post! Coming up, we'll work on more "variable" ways of adding/changing data in our DataFrame. (e.g. changing "Male" to "M" or adding a new column, but setting its value based on the values we find in existing columns.)

This is where Pandas can really start to get faster than Excel - in Excel, you would filter your CSV file, copy/paste a value over every matching row, filter it again, copy/paste again, etc.

For now, practice these, make sure you can read them, and ask in the comments if you have any questions about the code.


Table of Contents

No comments:

Post a Comment