Review: Previous Post
In our last post, we learned to use a programming language called Python and one of its "modules" (additional sets of commands) called Pandas to:
- Display text on the "standard output" part of our screen
- Import a CSV file from our hard drive into a special kind of data structure called a "DataFrame," which can be thought of as a virtual Excel spreadsheet
- Displayed the contents of our "DataFrame" to the screen (all the data & just the first five lines)
- Displayed just the "last name" column's worth of data from our "DataFrame"
- Displayed, for each row, whether or not the "last name" column's worth of data started with a given letter
- Filtered out rows from our "DataFrame" that didn't start with a given letter
- Exported a CSV file back to our hard drive containing just the rows of our original CSV that started with a given letter
Today we'll rename columns, strip out unneeded columns, and add new columns with "constant" (unchanging from row to row) values. As a real-world example for these tasks:
- We've exported a list of Contact records as a CSV file.
- We need to add them all to an existing Campaign.
- We already know the Campaign's 18-digit ID, so all we need to do is associate the Contacts' 18-digit IDs to this "Campaign ID" and insert these records into the "CampaignMember" table.
(Nevermind that there are other easy ways to add Contacts to a Campaign...it's an example.)
Prep Work
First, make sure you've created a ".CSV" file like the one described in "First Scripts" (I called mine "sample1.csv"). It looks like this in a text editor:
"Id","First","Last","Email","Company" "5829","Jimmy","Buffet","jb@example.com","RCA" "2894","Shirley","Chisholm","sc@example.com","United States Congress" "294","Marilyn","Monroe","mm@example.com","Fox" "30829","Cesar","Chavez","cc@example.com","United Farm Workers" "827","Vandana","Shiva","vs@example.com","Navdanya" "9284","Andrea","Smith","as@example.com","University of California" "724","Albert","Howard","ah@example.com","Imperial College of Science"
As in that post, in each example, we'll overwrite the contents of a piece of Python code saved to our hard drive as "script.py" and run the entire script.
Again, the first two lines of most of our examples will be:
import pandas df = pandas.read_csv('C:\\tempexamples\\sample1.csv')
This tells our code-running software that we will be using the "Pandas" module and imports data from our .CSV file into a "DataFrame" that we give a nickname of "df" for referring to later in our code.
Removing Columns
Let's get rid of the "Email" & "Company" columns. They're just clutter for the task at hand.
(With certain jobs, I prefer to leave "name" fields in the data, even if it won't be imported back into Salesforce, so I can spot-check the output of my script more easily.)
This code:
import pandas df = pandas.read_csv('C:\\tempexamples\\sample1.csv') print('---The contents of "df" right after import---') print(df) df = df.drop(['Email','Company'], axis=1) print('---The contents of "df" after removing two columns---') print(df)
Produces this text output:
---The contents of "df" right after import--- Id First Last Email Company 0 5829 Jimmy Buffet jb@example.com RCA 1 2894 Shirley Chisholm sc@example.com United States Congress 2 294 Marilyn Monroe mm@example.com Fox 3 30829 Cesar Chavez cc@example.com United Farm Workers 4 827 Vandana Shiva vs@example.com Navdanya 5 9284 Andrea Smith as@example.com University of California 6 724 Albert Howard ah@example.com Imperial College of Science ---The contents of "df" after removing two columns--- Id First Last 0 5829 Jimmy Buffet 1 2894 Shirley Chisholm 2 294 Marilyn Monroe 3 30829 Cesar Chavez 4 827 Vandana Shiva 5 9284 Andrea Smith 6 724 Albert Howard
Lines 3 & 6 just make it easier to read our script's output.
Line 4 displays the contents of our "DataFrame" that we called "df" immediately after generating it by importing our CSV file.
Line 5 of our script comes in two parts.
The first part is after the "equals" sign and means, "generate a new 'DataFrame' that is what you get when you strip columns named 'Email' and 'Company' out of the 'DataFrame' I've called 'df'."
(Trivia: Yesterday, I said that sometimes, certain operations in Pandas could have very similar ways of typing them depending on whether you wanted them to perform the behavior against a row or a column, and that sometimes, Pandas would just figure it out, whereas other times, you had to specify exactly what you meant to get the behavior you want. ".drop()" is one of those Pandas operations that needs things spelled out. In this code, the "axis=1" specifies that we want to drop columns, not rows.)
The second part is before + including the "equals" sign says, "You know that new 'DataFrame' you just made me? Save it over the old contents of 'df' - I don't need those anymore; I like this 'DataFrame' better. Whenever I refer to 'df' from now on, I'm referring to this new 'DataFrame' you just made me."
Line 7 displays the contents of the new version of "df," which now only has 3 columns.
Renaming Columns
When you insert records into the "CampaignMember" table, you have to provide a "ContactId" field & a "CampaignId" field. Since our data came straight out of the "Contact" table, its ID is still just called "Id." Let's rename it to "ContactId." And let's rename "First" to "First Name" and "Last" to "Last Name" while we're at it, just to show you how to rename multiple columns at once.
Code:
import pandas df = pandas.read_csv('C:\\tempexamples\\sample1.csv') df = df.drop(['Email','Company'], axis=1) df = df.rename(columns = {'Id':'ContactId', 'First':'First Name', 'Last':'Last Name'}) print('---The contents of "df" after removing two columns and renaming columns---') print(df)
Text output:
---The contents of "df" after removing two columns and renaming columns--- ContactId First Name Last Name 0 5829 Jimmy Buffet 1 2894 Shirley Chisholm 2 294 Marilyn Monroe 3 30829 Cesar Chavez 4 827 Vandana Shiva 5 9284 Andrea Smith 6 724 Albert Howard
Line 3 was "line 5" in our previous example. It makes a new "DataFrame" with just 3 columns (right-side of the equals sign) and overwrites the value of "df" with this new "DataFrame" (the equals sign and to its left).
Line 4 works similarly.
To the right of the equals sign, the code says, "generate a new 'DataFrame' that looks just like the contents 'df' (as 'df' stands after line 3), only with the various columns renamed."
The second part is before + including the "equals" sign says, "You know that new 'DataFrame' you just made me? Save it over the old contents of 'df' - I don't need those anymore; I like this 'DataFrame' better. Whenever I refer to 'df' from now on, I'm referring to this new 'DataFrame' you just made me."
Line 5 displays a label for readability.
Line 6 displays the contents of the newest version of "df," which now has 3 columns, all of which have been renamed.
Adding A Column With A Fixed Value
Now let's add the "18-digit Id" of the Campaign we want to put all these people into. (I'm using "xyzzy" instead for legibility.)
Code:
import pandas df = pandas.read_csv('C:\\tempexamples\\sample1.csv') df = df.drop(['Email','Company'], axis=1) df = df.rename(columns = {'Id':'ContactId', 'First':'First Name', 'Last':'Last Name'}) df['CampaignId']='xyzzy' print('---The contents of "df" after removing two columns, renaming columns, and adding a column.---') print(df)
Text output:
---The contents of "df" after removing two columns and renaming columns--- ContactId First Name Last Name CampaignId 0 5829 Jimmy Buffet xyzzy 1 2894 Shirley Chisholm xyzzy 2 294 Marilyn Monroe xyzzy 3 30829 Cesar Chavez xyzzy 4 827 Vandana Shiva xyzzy 5 9284 Andrea Smith xyzzy 6 724 Albert Howard xyzzy
This is the same as our previous code example, only I added a new line of code on line 5.
Note that we don't have the "df = df.something()" syntax anymore.
Instead, we put brackets on the "df" to the left of the equals sign and the value we wanted in every single row for this new column on the right of the equals sign.
That's just a difference in syntax (code-spelling & "grammar") to get used to in Pandas commands.
When you add a column, you say, "You know 'df'? Well, I'd like you to modify it, and here's how."
When you drop or rename a column, you say, "Take the old copy of 'df', make me something new based on it, and then overwrite 'df' with this 'something new.'"
(Trivia - actually, that's only sorta true. There is usually an "inPlace=true" flag you can add to "df.something()" operations to make them use less space on your computer (memory) while they run - in return, you drop the "df = " part of the code. However, it's not always reliable when you run the code, so if you're not having any problems with with my code examples, I don't recommend using this flag.
In other words, "df = df.drop(['Email','Company'], axis=1)" and "df.drop(['Email','Company'], axis=1, inplace=True)" get the same job done, and the 2nd is theoretically more efficient, but every once in a while with the 2nd version, when you run your code, the whole thing will crash with a random error.)
Anyway, line 5 says, "modify 'df' so that it has a new column called 'CampaignId,' and in every row in that new column, put a value of 'xyzzy'."
What's really interesting is that this is very similar to the syntax for referring to the contents of an existing column - the only difference is that the word we typed inside the square brackets didn't yet exist in our "DataFrame." Instead of giving us an error, though, Pandas just said, "Oh, you must want me to create that for you - just put an '=' to the right of your reference to this nonexistent column and a value for me to put in its rows, and I'll go right ahead & do that."
Line 6 displays a label for readability.
Line 7 displays the contents of the newest version of "df," which now has 4 columns, 3 of which are renamed from their original names, and 1 of which is brand new.
That's it for this post! Coming up, we'll work on more "variable" ways of adding/changing data in our DataFrame. (e.g. changing "Male" to "M" or adding a new column, but setting its value based on the values we find in existing columns.)
This is where Pandas can really start to get faster than Excel - in Excel, you would filter your CSV file, copy/paste a value over every matching row, filter it again, copy/paste again, etc.
For now, practice these, make sure you can read them, and ask in the comments if you have any questions about the code.
Table of Contents
- Introduction & Table of Contents
- Before You Begin Programming
- Installing Python 3 on Windows
- First Scripts - Import/Export CSVs, Filter Out Rows
- Rename/Drop/Add Columns
- Recognizing Pandas DataFrames vs. Series (Important!)
- Fancier Row-Filtering and Data-Editing
- Combining Multiple Tables
- Filtering Rows By Maximum Date Per Group
- Introduction to XML and JSON
No comments:
Post a Comment