Pages

Tuesday, August 30, 2016

Python for Salesforce Administrators - Combining Multiple Tables

In our last (huge!) post, we did a number of "conditional" row filters and edits (e.g. changing the value of the "email" field if someone's last name began with a certain letter).

In this post, we'll put two CSV files together (by putting two "DataFrames" representing them into a new "DataFrame" with data from both sources). This can replace VLOOKUP(...) or INDEX(MATCH(...)...) operations in Excel.

We'll just do exercises with 2 CSV files at a time, but to combine 3 or more, just combine them "2 at a time" in sequence.
(Combine #1 & #2, then combine the output of that w/ #3, etc.)


(Prep work: First, make sure you've created a ".CSV" file like the one described in "First Scripts". I called mine "sample1.csv". As in that post, in each example, we'll overwrite the contents of a piece of Python code saved to our hard drive as "script.py" and run the entire script.)


More prep work: create a 2nd CSV

Create another ".CSV" file with the following contents - I am calling mine "sample2.csv" and storing it in the same folder where I stored "sample1.csv":

"PersonId","FirstName","LastName","Em","FavoriteFood"
"983mv","Shirley","Temple","st@example.com","Lollipops"
"9e84f","Andrea","Smith","as@example.com","Kale"
"k28fo","Donald","Duck","dd@example.com","Pancakes"
"x934","Marilyn","Monroe","mm@example.com","Carrots"
"8xi","Albert","Howard","ahotherem@example.com","Potatoes"
"02e","Vandana","Shiva","vs@example.com","Amaranth"

Note the following traits about this 2nd CSV file:

  • Some people are in both files, others are in just #1, & others are in just #2
  • None of the column names are spelled exactly the same way between the two CSV files. However, certain columns seem to express the same general idea (name, email)
    • That said, there isn't total consistency - the potato-loving "Albert Howard" in #2 has a different e-mail address from the "Albert Howard" in #1.
      • Q: Is this additional contact information about the same real-world person or are we looking at two different real-world "Albert Howard"s?
      • A: Tough call. This is for you to decide according to your knowledge of your data.
        What's important is to realize that this is a business problem you must solve before telling the computer how to decide that two rows from two different CSV files "belong together."
        (But you probably already knew that if you're an Excel CSV-joining whiz.)
  • Other columns seem to be unique to one CSV or another in terms of actual meaning ("Company" is just in #1, "FavoriteFood" is just in #2, and "Id"/"PersonId" don't seem to have anything to do with each other with respect to people who are in both CSV files, so they are probably each unique to their respective CSV files as well.
  • #1's "Id" & #2's "PersonId" columns don't really seem to have anything to do with each other, with respect to people who are in both CSV files

Merging Two DataFrames: Code Overview

Here's the basic way of writing a command for merging two DataFrames (the output from this command is yet another DataFrame):

ReferenceToDataFrameNumber1.merge(ReferenceToDataFrameNumber2, how='MergeTypeCode', on=['ColumnNameHere', 'AnotherColumnNameHere'])

First, pick one of our two DataFrames (doesn't matter which) and decide to call it the "left"-side DataFrame (its columns will come out on the left-hand side of the output DataFrame).

It could be the "nickname" where we've stored the contents of a DataFrame (such as "df1"), or it could be a chunk of code whose output is a DataFrame.

After that, we type ".merge()"

Inside the parentheses of the ".merge()" command, the first thing we type is a reference to our other DataFrame (which we will refer to as the "right"-side DataFrame).

Again, it could be the "nickname" where we've stored the contents of a DataFrame (such as "df2"), or it could be a chunk of code whose output is a DataFrame.

Then we type a comma and a space.

After that, still inside the parentheses of the ".merge()" command, we type "how='...'" where the "..." is one of the following 4 lower-cased words: "outer", "inner", "left", or "right".

  1. "Outer" means, "include every row from both DataFrames in the output, but put matching rows' data together on the same line."
  2. "Inner" means, "in the output, only include rows where the values from both DataFrames match."
  3. "Left" means, "include every row from the 'left-side' DataFrame in the output, and put data from the 'right-side' DataFrame on the same line if it has data that just so happens to match."
  4. "Right" means, "include every row from the 'right-side' DataFrame in the output, and put data from the 'left-side' DataFrame on the same line if it has data that just so happens to match."

Then we type a comma and a space.

Finally, we type "on=...".

  • The "..." can be of the format "'JustOneColumnNameHere'", "['JustOneColumnNameHere']", or, to say that values must match across MULTIPLE column names at the same time (e.g. a first AND last name match) "['ColumnNameHere','AnotherColumnNameHere','AndSoOn']".
  • However, the colum names specified in "on" must be spelled the same - all the way down to capitalization - between the two DataFrames.
    So if the spelling between "equivalent" columns is different, as in our data, we're going to have to rename columns from one DataFrame or the other before passing it to the ".merge()" operation. We'll go over the code for that in an example later.

Important Pandas Note: Unlike with Excel VLOOKUP() or INDEX(MATCH()), in any type of Pandas ".merge()" operation, if multiple lines from the DataFrames being merged match each other, you'll get multiple lines of output.

It's actually more accurate than the way Excel handles things, but it takes some getting used to.

So an "inner" merge on DataFrames from these two CSV files using "on='Id'":

Input #1:

"Id","PastEmployer"
"1","RCA"
"2","United States Congress"
"2","Fox"

Input #2:

"Id","BelovedFood"
"1","Snow Peas"
"2","Carrots"
"2","Beets"
"2","Cookies"

Would produce output that looks like this. Note the 6 lines for ID #2 - that's 2 multiplied by 3 (2 lines from input #1, matching 3 lines from input #2):

"Id","PastEmployer","BelovedFood"
"1","RCA","Snow Peas"
"2","United States Congress","Carrots"
"2","United States Congress","Beets"
"2","United States Congress","Cookies"
"2","Fox","Carrots"
"2","Fox","Beets"
"2","Fox","Cookies"

Example With "Outer"

Code:

import pandas
df1 = pandas.read_csv('C:\\tempexamples\\sample1.csv', dtype=object)
df2 = pandas.read_csv('C:\\tempexamples\\sample2.csv', dtype=object)
print('---Contents of DataFrame "df1":---')
print(df1)
print('---Contents of DataFrame "df2":---')
print(df2)
newdf = df1.merge(df2.rename(columns = {'LastName':'Last', 'FirstName':'First', 'Em':'Email'}), how='outer', on=['Last', 'First'])
print('---Contents of DataFrame "newdf":---')
print(newdf)

Output text:

---Contents of DataFrame "df1":---
      Id    First      Last           Email                      Company
0   5829    Jimmy    Buffet  jb@example.com                          RCA
1   2894  Shirley  Chisholm  sc@example.com       United States Congress
2    294  Marilyn    Monroe  mm@example.com                          Fox
3  30829    Cesar    Chavez  cc@example.com          United Farm Workers
4    827  Vandana     Shiva  vs@example.com                     Navdanya
5   9284   Andrea     Smith  as@example.com     University of California
6    724   Albert    Howard  ah@example.com  Imperial College of Science
---Contents of DataFrame "df2":---
  PersonId FirstName LastName                     Em FavoriteFood
0    983mv   Shirley   Temple         st@example.com    Lollipops
1    9e84f    Andrea    Smith         as@example.com         Kale
2    k28fo    Donald     Duck         dd@example.com     Pancakes
3     x934   Marilyn   Monroe         mm@example.com      Carrots
4      8xi    Albert   Howard  ahotherem@example.com     Potatoes
5      02e   Vandana    Shiva         vs@example.com     Amaranth
---Contents of DataFrame "newdf":---
      Id    First      Last         Email_x                      Company PersonId                Email_y FavoriteFood
0   5829    Jimmy    Buffet  jb@example.com                          RCA      NaN                    NaN          NaN
1   2894  Shirley  Chisholm  sc@example.com       United States Congress      NaN                    NaN          NaN
2    294  Marilyn    Monroe  mm@example.com                          Fox     x934         mm@example.com      Carrots
3  30829    Cesar    Chavez  cc@example.com          United Farm Workers      NaN                    NaN          NaN
4    827  Vandana     Shiva  vs@example.com                     Navdanya      02e         vs@example.com     Amaranth
5   9284   Andrea     Smith  as@example.com     University of California    9e84f         as@example.com         Kale
6    724   Albert    Howard  ah@example.com  Imperial College of Science      8xi  ahotherem@example.com     Potatoes
7    NaN  Shirley    Temple             NaN                          NaN    983mv         st@example.com    Lollipops
8    NaN   Donald      Duck             NaN                          NaN    k28fo         dd@example.com     Pancakes

"sample1.csv" has 7 rows and "sample2.csv" has 6 rows. In this output, there are 9 rows because 4 rows from each CSV "match" a row from the other CSV exactly once (same first+last name) and can share a line in the output data, but 5 rows are unique to one CSV or the other, for a total of 9 output rows.

I printed the contents of DataFrames I nicknamed "df1" & "df2" in the output, but the area to focus on is the output of the DataFrame we saved by the nickname "newdf" and the code on line 8 that produced it.

There's a lot to go over here, so let's dive in.

First - remember that note from above saying that the column names needed to match? Instead of just passing in the nickname "df2" as a reference to a "right-side" DataFrame in the beginning of the ".merge()" parentheses, I passed in a code that generates a new DataFrame.

"df2.rename(columns = {'LastName':'Last', 'FirstName':'First', 'Em':'Email'})" produces a dataframe almost exactly like the one stored under the nickname "df2," only its "LastName," "FirstName," & "Em" columns have been renamed to match the naming of columns from the DataFrame stored under the nickname "df1."

That's the main tricky thing I did!

There's a lot of code to read on line 8, but if you think of that entire chunk before the first comma (",") as one big reference to a DataFrame to merge with "df1," you'll see that it still follows the normal pattern that the ".merge()" command requires us to follow.

I set the "how" parameter in line 8 to "'outer'", so our output file includes data from both spreadsheets, with blanks ("NaN") filling in the gaps where we don't know data about someone (because they weren't in the spreadsheet containing that data).

Finally, although I renamed 3 columns as if I were going to merge on them all, in the end, I decided that I just wanted to "match" on First Name & Last Name. I decided I was okay with the e-mail addresses being different.

Note that the "on=" columns got combined (the output only has 1 "First" column and 1 "Last" column), but the other columns that coincidentally had the same name in both inputs (that is, "Email") did not. To help you tell them apart, Pandas renamed them "Email_x" (from the left side) and "Email_y" (from the right side) in the output.


Examples With "Inner", "Left," & "Right"

Let's change just one thing in our code (besides not printing out df1 & df2): let's change "outer" to "inner."

Code:

import pandas
df1 = pandas.read_csv('C:\\tempexamples\\sample1.csv', dtype=object)
df2 = pandas.read_csv('C:\\tempexamples\\sample2.csv', dtype=object)
newdf = df1.merge(df2.rename(columns = {'LastName':'Last', 'FirstName':'First', 'Em':'Email'}), how='inner', on=['Last', 'First'])
print('---Contents of DataFrame "newdf":---')
print(newdf)

Output text:

---Contents of DataFrame "newdf":---
     Id    First    Last         Email_x                      Company PersonId                Email_y FavoriteFood
0   294  Marilyn  Monroe  mm@example.com                          Fox     x934         mm@example.com      Carrots
1   827  Vandana   Shiva  vs@example.com                     Navdanya      02e         vs@example.com     Amaranth
2  9284   Andrea   Smith  as@example.com     University of California    9e84f         as@example.com         Kale
3   724   Albert  Howard  ah@example.com  Imperial College of Science      8xi  ahotherem@example.com     Potatoes

Of our 7 rows from "sample1.csv" and our 6 rows from "sample2.csv," only 4 rows from each share both a first and last name with a row in the other CSV.

Since this is an "inner" merge, we've said we only want the output to contain rows where the data has something in common.

---

Now let's change "inner" to "left."

Code:

import pandas
df1 = pandas.read_csv('C:\\tempexamples\\sample1.csv', dtype=object)
df2 = pandas.read_csv('C:\\tempexamples\\sample2.csv', dtype=object)
newdf = df1.merge(df2.rename(columns = {'LastName':'Last', 'FirstName':'First', 'Em':'Email'}), how='left', on=['Last', 'First'])
print('---Contents of DataFrame "newdf":---')
print(newdf)

Output text:

---Contents of DataFrame "newdf":---
      Id    First      Last         Email_x                      Company PersonId                Email_y FavoriteFood
0   5829    Jimmy    Buffet  jb@example.com                          RCA      NaN                    NaN          NaN
1   2894  Shirley  Chisholm  sc@example.com       United States Congress      NaN                    NaN          NaN
2    294  Marilyn    Monroe  mm@example.com                          Fox     x934         mm@example.com      Carrots
3  30829    Cesar    Chavez  cc@example.com          United Farm Workers      NaN                    NaN          NaN
4    827  Vandana     Shiva  vs@example.com                     Navdanya      02e         vs@example.com     Amaranth
5   9284   Andrea     Smith  as@example.com     University of California    9e84f         as@example.com         Kale
6    724   Albert    Howard  ah@example.com  Imperial College of Science      8xi  ahotherem@example.com     Potatoes

There are 7 rows because that's how many rows "sample1.csv" (which became DataFrame "df1") had. 4 of those rows have "extra" data at right from "sample2.csv".

---

Now let's change "left" to "right."

Code:

import pandas
df1 = pandas.read_csv('C:\\tempexamples\\sample1.csv', dtype=object)
df2 = pandas.read_csv('C:\\tempexamples\\sample2.csv', dtype=object)
newdf = df1.merge(df2.rename(columns = {'LastName':'Last', 'FirstName':'First', 'Em':'Email'}), how='right', on=['Last', 'First'])
print('---Contents of DataFrame "newdf":---')
print(newdf)

Output text:

---Contents of DataFrame "newdf":---
     Id    First    Last         Email_x                      Company PersonId                Email_y FavoriteFood
0   294  Marilyn  Monroe  mm@example.com                          Fox     x934         mm@example.com      Carrots
1   827  Vandana   Shiva  vs@example.com                     Navdanya      02e         vs@example.com     Amaranth
2  9284   Andrea   Smith  as@example.com     University of California    9e84f         as@example.com         Kale
3   724   Albert  Howard  ah@example.com  Imperial College of Science      8xi  ahotherem@example.com     Potatoes
4   NaN  Shirley  Temple             NaN                          NaN    983mv         st@example.com    Lollipops
5   NaN   Donald    Duck             NaN                          NaN    k28fo         dd@example.com     Pancakes

There are 6 rows because that's how many rows "sample2.csv" (which became DataFrame "df2" and then its "renamed" version) had. 4 of those rows have "extra" data at left from "sample1.csv".


Example: Merging On Different Columns Than Before

Let's change just one thing in our code - we're going to say that a first-last name match isn't good enough.

We're going to tell our code to only merge people if first name, last name, AND e-mail match.

(In other words, treat the "Albert Howard" records as two different people since their e-mail addresses don't match.)

This code will go through all 4 examples in one piece of code, so to save typing...

  • I gave that awful piece of code based on "df2" a nickname of "df2renamed"
  • I gave my list of column-names to "match" on a nickname of "joincols".
    (You can give code like "'zzz'" or "['yyy','zzz']" nicknames, too!)

Code:

import pandas
df1 = pandas.read_csv('C:\\tempexamples\\sample1.csv', dtype=object)
df2 = pandas.read_csv('C:\\tempexamples\\sample2.csv', dtype=object)
df2renamed = df2.rename(columns = {'LastName':'Last', 'FirstName':'First', 'Em':'Email'})
joincols = ['Last', 'First', 'Email']
newdfouter = df1.merge(df2renamed, how='outer', on=joincols)
newdfinner = df1.merge(df2renamed, how='inner', on=joincols)
newdfleft = df1.merge(df2renamed, how='left', on=joincols)
newdfright = df1.merge(df2renamed, how='right', on=joincols)
print('---Contents of DataFrame "newdfouter":---')
print(newdfouter)
print('---Contents of DataFrame "newdfinner":---')
print(newdfinner)
print('---Contents of DataFrame "newdfleft":---')
print(newdfleft)
print('---Contents of DataFrame "newdfright":---')
print(newdfright)

Output text:

---Contents of DataFrame "newdfouter":---
      Id    First      Last                  Email                      Company PersonId FavoriteFood
0   5829    Jimmy    Buffet         jb@example.com                          RCA      NaN          NaN
1   2894  Shirley  Chisholm         sc@example.com       United States Congress      NaN          NaN
2    294  Marilyn    Monroe         mm@example.com                          Fox     x934      Carrots
3  30829    Cesar    Chavez         cc@example.com          United Farm Workers      NaN          NaN
4    827  Vandana     Shiva         vs@example.com                     Navdanya      02e     Amaranth
5   9284   Andrea     Smith         as@example.com     University of California    9e84f         Kale
6    724   Albert    Howard         ah@example.com  Imperial College of Science      NaN          NaN
7    NaN  Shirley    Temple         st@example.com                          NaN    983mv    Lollipops
8    NaN   Donald      Duck         dd@example.com                          NaN    k28fo     Pancakes
9    NaN   Albert    Howard  ahotherem@example.com                          NaN      8xi     Potatoes
---Contents of DataFrame "newdfinner":---
     Id    First    Last           Email                   Company PersonId FavoriteFood
0   294  Marilyn  Monroe  mm@example.com                       Fox     x934      Carrots
1   827  Vandana   Shiva  vs@example.com                  Navdanya      02e     Amaranth
2  9284   Andrea   Smith  as@example.com  University of California    9e84f         Kale
---Contents of DataFrame "newdfleft":---
      Id    First      Last           Email                      Company PersonId FavoriteFood
0   5829    Jimmy    Buffet  jb@example.com                          RCA      NaN          NaN
1   2894  Shirley  Chisholm  sc@example.com       United States Congress      NaN          NaN
2    294  Marilyn    Monroe  mm@example.com                          Fox     x934      Carrots
3  30829    Cesar    Chavez  cc@example.com          United Farm Workers      NaN          NaN
4    827  Vandana     Shiva  vs@example.com                     Navdanya      02e     Amaranth
5   9284   Andrea     Smith  as@example.com     University of California    9e84f         Kale
6    724   Albert    Howard  ah@example.com  Imperial College of Science      NaN          NaN
---Contents of DataFrame "newdfright":---
     Id    First    Last                  Email                   Company PersonId FavoriteFood
0   294  Marilyn  Monroe         mm@example.com                       Fox     x934      Carrots
1   827  Vandana   Shiva         vs@example.com                  Navdanya      02e     Amaranth
2  9284   Andrea   Smith         as@example.com  University of California    9e84f         Kale
3   NaN  Shirley  Temple         st@example.com                       NaN    983mv    Lollipops
4   NaN   Donald    Duck         dd@example.com                       NaN    k28fo     Pancakes
5   NaN   Albert  Howard  ahotherem@example.com                       NaN      8xi     Potatoes

Note that there are now only 3 rows, not 4, that are considered to "match," because the two Albert Howard records have different e-mail addresses.

That means that there are now:

  • 10 rows in "outer"
    (3 "matches" and 7 "unique-to-one-or-the-other" rows, instead of 4 & 5)
  • 3 rows in "inner"
    (3 "matches")
  • The same number of rows as before in "left" and "right," but no data filled in from "the other file" on Albert Howard's row in either
    (because it wasn't considered "matching")

Style Notes

  1. I always like to save the results of ".merge()" commands to a new nickname. I tend to call it something like "joindf." So, "joindf = ..."
    I've found that the ".merge()" code takes up so much space on the screen, I really don't want to combine too much more code with it. I prefer to make a clean break at this point, which is where nicknaming comes in handy.
     
  2. You may have noticed all my ".read_csv()" commands added ", dtype=object" inside the parentheses after the filename. I was getting errors when I merged my DataFrames - I think Pandas was trying a little too hard to decide whether values it was importing from CSV were numbers, plain text, etc - so this little piece of code said, "Don't worry about it. Just import what you see as plain text."
    • Q: How did I guess my errors had to do with Pandas trying to figure out numbers vs. plain-text?
    • A: Because the errors seemed to have something to do with Pandas trying to do "absolute value" operations against my data.
      Whenever you have some data in your file that kind of looks numeric / date-like / etc, if you don't actually want to do math against it, a good way to fix errors it to just force your data to import as plain-text.
       
  3. I lied to you a little bit about having to rename columns before you merge two DataFrames. There is another way, and you might run into it online. But I don't like it. Nevertheless, I'll explain it.

Instead of "on=...", you have a "left_on=..." and "right_on=...", where each "..." is a single column-name or a list of column names (of the same size, in the "same order").

However, it produces ugly output, in my opinion. Your data from those columns ends up split across two different output columns, instead of being neatly under one same-named column.

Here's an example - I'll throw in some code we haven't covered in lines 5, 6, & 8 to re-order the columns and expose why I consider it "ugly." Note that in line 4, we just pass the ".merge()" operation a reference to "df2" itself (because in this case, we're not renaming its columns).

Take a look at how Jimmy Buffet's last name is in "Last," but Donald Duck's's last name is in "LastName":

Code:

import pandas
df1 = pandas.read_csv('C:\\tempexamples\\sample1.csv', dtype=object)
df2 = pandas.read_csv('C:\\tempexamples\\sample2.csv', dtype=object)
newdfugly = df1.merge(df2, how='outer', left_on=['Last','First'], right_on=['LastName','FirstName'])
def put_cols_at_left(referenceToDataFrame, listOfColumnNames):
    return referenceToDataFrame[listOfColumnNames+[c for c in referenceToDataFrame.columns if c not in listOfColumnNames]]
print('---Contents of DataFrame "newdfugly" (with "last" & "first" columns moved to the left):---')
print(put_cols_at_left(newdfugly, ['Last','LastName','First','FirstName']))

Output text:

---Contents of DataFrame "newdfugly" (with "last" & "first" columns moved to the left):---
       Last LastName    First FirstName     Id           Email                      Company PersonId                     Em FavoriteFood
0    Buffet      NaN    Jimmy       NaN   5829  jb@example.com                          RCA      NaN                    NaN          NaN
1  Chisholm      NaN  Shirley       NaN   2894  sc@example.com       United States Congress      NaN                    NaN          NaN
2    Monroe   Monroe  Marilyn   Marilyn    294  mm@example.com                          Fox     x934         mm@example.com      Carrots
3    Chavez      NaN    Cesar       NaN  30829  cc@example.com          United Farm Workers      NaN                    NaN          NaN
4     Shiva    Shiva  Vandana   Vandana    827  vs@example.com                     Navdanya      02e         vs@example.com     Amaranth
5     Smith    Smith   Andrea    Andrea   9284  as@example.com     University of California    9e84f         as@example.com         Kale
6    Howard   Howard   Albert    Albert    724  ah@example.com  Imperial College of Science      8xi  ahotherem@example.com     Potatoes
7       NaN   Temple      NaN   Shirley    NaN             NaN                          NaN    983mv         st@example.com    Lollipops
8       NaN     Duck      NaN    Donald    NaN             NaN                          NaN    k28fo         dd@example.com     Pancakes

It's up to you, though. I sometimes use the "left_on" & "right_on" style when I'm merely experimenting with joining data and don't need any "pretty" output yet because it's just a little less text to type.


That's a good pausing point!

Please comment: what would you like to learn next?

Table of Contents

Thursday, August 25, 2016

Python for Salesforce Administrators - Fancier Row-Filtering And Data-Editing

In our last post, we talked about "DataFrames" vs. "Series" in the Pandas module of the Python programming language. Please read it before you continue with these examples.

In this post, we'll do a deeper diver into row-filtering. We'll also explore different ways to change the values of cells in your "DataFrame" (an abstraction of a CSV file).

(Prep work: First, make sure you've created a ".CSV" file like the one described in "First Scripts". I called mine "sample1.csv". As in that post, in each example, we'll overwrite the contents of a piece of Python code saved to our hard drive as "script.py" and run the entire script.)


Previously, we saw that Python/Pandas commands structured as a "true/false-series-making-thing inside 'df[]'" generates a DataFrame with the same 5 columns as "df", and just the rows where the "true/false" series is "true," as in this snippet of code:

SeriesShowingWhichRowsHaveLastNameS = df['Last'].str.startswith('S')
df[SeriesShowingWhichRowsHaveLastNameS]

Filtering Rows Based On Multiple Columns At Once

We can get fancier than that. Now we'll put two similar-but-different "true/false" Series together and use them as the filter instead.

In this example, our output will be a DataFrame showing us all rows from our main DataFrame "df" where the last name starts with "S" OR the first name starts with "M."

Code:

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample1.csv')
SeriesShowingWhichRowsHaveLastNameS = df['Last'].str.startswith('S')
SeriesShowingWhichRowsHaveFirstNameM = df['First'].str.startswith('M')
SeriesShowingRowsWhereEitherIsTrue = SeriesShowingWhichRowsHaveLastNameS | SeriesShowingWhichRowsHaveFirstNameM
print('---SeriesShowingWhichRowsHaveLastNameS contents:---')
print(SeriesShowingWhichRowsHaveLastNameS)
print('---SeriesShowingWhichRowsHaveFirstNameM contents:---')
print(SeriesShowingWhichRowsHaveFirstNameM)
print('---SeriesWhereEitherIsTrue contents:---')
print(SeriesShowingRowsWhereEitherIsTrue)
print('---DataFrame Where Either Is True:---')
print(df[SeriesShowingRowsWhereEitherIsTrue])

Output text:

---SeriesShowingWhichRowsHaveLastNameS contents:---
0    False
1    False
2    False
3    False
4     True
5     True
6    False
Name: Last, dtype: bool
---SeriesShowingWhichRowsHaveFirstNameM contents:---
0    False
1    False
2     True
3    False
4    False
5    False
6    False
Name: First, dtype: bool
---SeriesWhereEitherIsTrue contents:---
0    False
1    False
2     True
3    False
4     True
5     True
6    False
dtype: bool
---DataFrame Where Either Is True:---
     Id    First    Last           Email                   Company
2   294  Marilyn  Monroe  mm@example.com                       Fox
4   827  Vandana   Shiva  vs@example.com                  Navdanya
5  9284   Andrea   Smith  as@example.com  University of California

In this code, we make 2 Series (lines 3 & 4) that each filter the contents of one column.

Then we put them together into a 3rd Series with an "or" ("|") operator that can combine series (line 5).

Then we take a look at what's inside these Series (lines 6-11).

Note how the output from the "combined" Series we made in line 5 has 3 rows set to "True" - these are the 3 rows where either of the previous two Series had a value set to "True."

Finally, we call our trusty "df[]" on the "combined" series and take a look at it (lines 12-13).

Sure enough, we have a 3-line extract of "df" on the screen as a new DataFrame!


At this point, I could write another blog post about what exactly should go into the building-block Series you use for filters (e.g. the right-hand side of the "=" in lines 3 & 4 of the previous code example).

You can get way crazier than you can in Excel. You're definitely not limited to "starts with a capital S."

However, that'd be a pretty big diversion, so I'm not going to address it now.

If I forget to come back to the subject somewhere in this series, please post in the comments and remind me. But know that it won't be comprehensive - if you can imagine it, Google for it!


Editing Data - Modifying Every Row In 1 Column

Now let's get into writing data, not just reading it.

Remember how when we were just reading data from the DataFrame "df," "df['Last']" produced a Series but "df[CodeThatMakesATrueFalseSeries]" or "df[['Last']]" or "df[['First','Last']]" produced a DataFrame?

Similarly, when it's used for writing data (it's on the left side of an "="), exactly what you put into the "[]" impacts how Pandas behaves.

When writing data to "df," on the left-hand side of an "=" command, "df['Last']", "df[['Last']]", and "df[['First','Last']]" all behave similarly.

They all take whatever is on the right-hand side of the "=" and overwrite every row of the specified columns with that value. Take a look.

Code:

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample1.csv')
df['Last']='aaa'
df[['Email']]='bbb'
df[['Id','Company']]='ccc'
print(df)

Output text:

    Id    First Last Email Company
0  ccc    Jimmy  aaa   bbb     ccc
1  ccc  Shirley  aaa   bbb     ccc
2  ccc  Marilyn  aaa   bbb     ccc
3  ccc    Cesar  aaa   bbb     ccc
4  ccc  Vandana  aaa   bbb     ccc
5  ccc   Andrea  aaa   bbb     ccc
6  ccc   Albert  aaa   bbb     ccc

Editing Data - Adding Columns With Every Row The Same

When writing data to "df," on the left-hand side of an "=" command, "df['Last']", "df[['Last']]", and "df[['First','Last']]" behave differently from each other when the columns you specify don't exist in "df".

"df['NewColumnName'] = 'zzz'" will add a column called "NewColumnName" to our DataFrame nicknamed "df" and fill every row of that column with the text"zzz."

Code:

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample1.csv')
df['NewColumn'] = 'zzz'
print(df)

Output text:

      Id    First      Last           Email                      Company  NewColumn
0   5829    Jimmy    Buffet  jb@example.com                          RCA        zzz
1   2894  Shirley  Chisholm  sc@example.com       United States Congress        zzz
2    294  Marilyn    Monroe  mm@example.com                          Fox        zzz
3  30829    Cesar    Chavez  cc@example.com          United Farm Workers        zzz
4    827  Vandana     Shiva  vs@example.com                     Navdanya        zzz
5   9284   Andrea     Smith  as@example.com     University of California        zzz
6    724   Albert    Howard  ah@example.com  Imperial College of Science        zzz

(Don't worry if your output is getting so wide that your Python-running software wraps the output around to the next line, which you can detect by the presence of a "\" floating out in space to the right of your data just before it jumps to the next line. It'll look fine when you export your DataFrame to CSV.)

"df[['NewColumnName']] = 'zzz'" or "df[['NewColumnName1', 'NewColumnName2']] = 'zzz'", however, will produce errors.

Code:

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample1.csv')
df[['NewColumn']] = 'zzz'
print(df)

Output text:

Traceback (most recent call last):
   ...(lots of garble)...
KeyError: "['NewColumn'] not in index"

The takeaway: Only the one-column-name-at-a-time approach can add a new column to your DataFrame.


Editing Data - Modifying Every Column In Certain Rows

What about "df[CodeThatMakesATrueFalseSeries]"? How does it behave on the left side of an "="?

Code:

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample1.csv')
SeriesShowingWhichRowsHaveLastNameS = df['Last'].str.startswith('S')
df[SeriesShowingWhichRowsHaveLastNameS]='aaa'
print(df)

Output text:

      Id    First      Last           Email                      Company
0   5829    Jimmy    Buffet  jb@example.com                          RCA
1   2894  Shirley  Chisholm  sc@example.com       United States Congress
2    294  Marilyn    Monroe  mm@example.com                          Fox
3  30829    Cesar    Chavez  cc@example.com          United Farm Workers
4    aaa      aaa       aaa             aaa                          aaa
5    aaa      aaa       aaa             aaa                          aaa
6    724   Albert    Howard  ah@example.com  Imperial College of Science

Hmm...not sure how useful this is. (If we really didn't want a row, why not just filter it out, rather than filling it in with random things?) But at least now we know how it behaves.

"df[CodeThatMakesATrueFalseSeries]" takes whatever is on the right-hand side of the "=" and overwrites every column of "df" with with that value in the rows where the "Series" has a value of true.


Editing Data - Modifying Certain Columns In Certain Rows

What we often want, when editing data in a CSV file, is to make changes to values in a certain column but only for certain rows. Examples:

  • We want to inspect every row's value in a "Suffix" column and, if the value for that row is "M.D." or "Ph.D.", change that row's value in the "Prefix" column to "Dr."
  • We want to inspect every row's value in the "Email" column and, if the value for that row is obviously bad, change that row's value in the "Email" column to something better.
    (The "Email" example requires fancier "filter" and fancier "data-change" commands than we are going to cover right now, but the gist of solving these two problems is the same.)

Let's think through this.

We've seen that we can use "df[...]" to display just rows or just columns of a DataFrame called "df." We've seen that we can use "df[...]" to change just rows or just columns of a DataFrame called "df" if we put it on the left-hand side of an "=".

Is there any way we could just cleverly combine all this?
(The answer is "no," but let's see for ourselves.)

Can we...just do "df[...][...]" = 'zzz'"?

Code:

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample1.csv')
SeriesShowingWhichRowsHaveLastNameS = df['Last'].str.startswith('S')
df[SeriesShowingWhichRowsHaveLastNameS]['Email']='ccc'

Output text:

C:/tempexamples/script.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  print(((df[SeriesShowingWhichRowsHaveLastNameS])['Email']='ccc'))

Nope. Well ... what if we put "()" parentheses around 's around df[SeriesShowingWhichRowsHaveLastNameS] to make it clear that it's a "DataFrame" and thus should be able to have its own "[]" after it?

Code:

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample1.csv')
SeriesShowingWhichRowsHaveLastNameS = df['Last'].str.startswith('S')
(df[SeriesShowingWhichRowsHaveLastNameS])['Email']='ccc'

Output text:

C:/tempexamples/script.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  print(((df[SeriesShowingWhichRowsHaveLastNameS])['Email']='ccc'))

Nope. Same error. It looks like Pandas didn't need the parentheses for clarification.

The problem seems to be that Pandas just won't let any old "YourDataFrameHere[]" be on the left-hand side of an "=".

Certain DataFrames that came out of other DataFrames are considered "slices" and aren't allowed to be written with the "YourDataFrameHere[] = 'zzz'" style of commands.

(Trivia: those "slices" can, however, be read with the "YourDataFrameHere[]" syntax. Try changing line 4 of the previous example to "print(df[SeriesShowingWhichRowsHaveLastNameS]['Email'])" and see what you get. Also note whether it is a "Series" or a "DataFrame" and ponder why. If you have any questions, ask me in the comments.)

---

It looks like we're going to need something more complicated than "YourDataFrameHere[] = 'zzz'" to do this "both-row-and-column-based" editing.

We need something that lets us clearly specify which part of our code is filtering our rows, and which part of our code is specifying how to change the matching columns.

Enter the "df.loc[]" command.

This command asks for TWO values:

  1. A "true/false Series" for using as a filter against the rows of "df" (e.g. our "SeriesShowingWhichRowsHaveLastNameS" series)
  2. A column name (or list of column names) in need of displaying/editing)

To edit values, the whole "df.loc[...,...]" command goes on the left side of an "=", with whatever value you put on the right side of the "=" being what value you want filled in in the matching rows.

Note that you can only put one value on the right-hand side of the "=", so if you have different things to put in different parts of your DataFrame nicknamed "df," you'll need to do back-to-back "df.loc[...,...] = 'zzz'" statements, one apiece.

Let's have a look - we'll do the same "aaa bbb ccc" replacement as two examples ago, only we'll just do it to people whose last names start with "S":

Code:

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample1.csv')
SeriesShowingWhichRowsHaveLastNameS = df['Last'].str.startswith('S')
df.loc[SeriesShowingWhichRowsHaveLastNameS,'Last'] = 'aaa'
df.loc[SeriesShowingWhichRowsHaveLastNameS,['Email']] = 'bbb'
df.loc[SeriesShowingWhichRowsHaveLastNameS,['Id','Company']] = 'ccc'
print(df)

Output text:

      Id    First      Last           Email                      Company
0   5829    Jimmy    Buffet  jb@example.com                          RCA
1   2894  Shirley  Chisholm  sc@example.com       United States Congress
2    294  Marilyn    Monroe  mm@example.com                          Fox
3  30829    Cesar    Chavez  cc@example.com          United Farm Workers
4    ccc  Vandana       aaa             bbb                          ccc
5    ccc   Andrea       aaa             bbb                          ccc
6    724   Albert    Howard  ah@example.com  Imperial College of Science

The Takeaway:

  • For simple, every-column changes, you can use a simple "df[...] = 'zzz'" syntax.
  • For "depends on row AND column" changes, you need to say "df.loc[YourFilteringSeries,AColumnNameOrFew] = 'blahblahblah'".

Editing Data - Adding A Column With Data Only In Certain Rows

We've used the simpler style "df['NewColumnName'] = 'zzz'" before to add a "constant" value to our DataFrame "df."

But what would we do if we wanted to add a column that was full of blank values, except in certain rows?

Again, we can use our "df.loc[...,...]" style.

Just like "df['NewColumnName'] = 'zzz'" works, but ""df[['NewColumnName']] = 'zzz'" and "df['NewColumnName1','NewColumnName2'] = 'zzz'" don't work, "df.loc[...,...]" has similar restrictions.

If a column doesn't yet exist, the second value passed to "df.loc[...,...]" has to be a plain-old single column name, not a list of column names.

If we want to add multiple similar columns, we'll have to do have back-to-back statements, one apiece.

In other words, this won't work:

df.loc[SeriesShowingWhichRowsHaveLastNameS,['NewColumn1','NewColumn2']] = 'zzz'

But this will:

df.loc[SeriesShowingWhichRowsHaveLastNameS,'NewColumn1'] = 'zzz'
df.loc[SeriesShowingWhichRowsHaveLastNameS,'NewColumn2'] = 'zzz'

Let's see it in action.

Example code:

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample1.csv')
SeriesShowingWhichRowsHaveLastNameS = df['Last'].str.startswith('S')
df.loc[SeriesShowingWhichRowsHaveLastNameS,'NewColumn1'] = 'yyy'
df.loc[SeriesShowingWhichRowsHaveLastNameS,'NewColumn2'] = 'zzz'
print(df)

Output text:

      Id    First      Last           Email                      Company  NewColumn1 NewColumn2  
0   5829    Jimmy    Buffet  jb@example.com                          RCA         NaN        NaN  
1   2894  Shirley  Chisholm  sc@example.com       United States Congress         NaN        NaN 
2    294  Marilyn    Monroe  mm@example.com                          Fox         NaN        NaN 
3  30829    Cesar    Chavez  cc@example.com          United Farm Workers         NaN        NaN 
4    827  Vandana     Shiva  vs@example.com                     Navdanya         yyy        zzz 
5   9284   Andrea     Smith  as@example.com     University of California         yyy        zzz 
6    724   Albert    Howard  ah@example.com  Imperial College of Science         NaN        NaN 

"NaN" is Pandas-language for, "nothing here - not even an empty-piece-of-text. Just NOTHING."

The "NaN" spots will come out as blank areas (as will "empty-pieces-of-text") when you export your DataFrame to CSV, so don't worry about them in your DataFrame.


Editing Data - Adding Complexly-Filled-In Columns

  • Q: What if we want a "NewColumn1" + "NewColumn2" where "people whose last name starts with S" get "yyy," "people whose first name starts with A or M" get "zzz," and "everyone else" gets a blank value?
  • A: We put together the previous two examples. First we add the columns, then we edit them.

But look carefully at our data - "Andrea Smith" falls into both categories. So whether Andrea ends up with "yyy" or "zzz" will depend on the last command we run that modifies Andrea's data. In other words, it will depend on the order in which we type our commands.

Here's one example where she ends up with "zzz" (because that's the command we run last):

Code:

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample1.csv')
SeriesShowingWhichRowsHaveLastNameS = df['Last'].str.startswith('S')
SeriesShowingWhichRowsHaveFirstNameAOrM = df['First'].str.startswith('A') | df['First'].str.startswith('M')
df.loc[SeriesShowingWhichRowsHaveLastNameS,'NewColumn1'] = 'yyy' # Column name doesn't exist yet
print('---How "df" looks after adding NewColumn1---')
print(df)
df.loc[SeriesShowingWhichRowsHaveLastNameS,'NewColumn2'] = 'yyy' # Column name doesn't exist yet
print('---How "df" looks after adding NewColumn2---')
print(df)
df.loc[SeriesShowingWhichRowsHaveFirstNameAOrM,['NewColumn1','NewColumn2']] = 'zzz' # Column name already exists now
print('---How "df" looks after editing NewColumn1 & NewColumn2---')
print(df)

Output text:

---How "df" looks after adding NewColumn1---
      Id    First      Last           Email                      Company  NewColumn1 
0   5829    Jimmy    Buffet  jb@example.com                          RCA         NaN 
1   2894  Shirley  Chisholm  sc@example.com       United States Congress         NaN 
2    294  Marilyn    Monroe  mm@example.com                          Fox         NaN 
3  30829    Cesar    Chavez  cc@example.com          United Farm Workers         NaN 
4    827  Vandana     Shiva  vs@example.com                     Navdanya         yyy 
5   9284   Andrea     Smith  as@example.com     University of California         yyy 
6    724   Albert    Howard  ah@example.com  Imperial College of Science         NaN 
---How "df" looks after adding NewColumn2---
      Id    First      Last           Email                      Company  NewColumn1  NewColumn2  
0   5829    Jimmy    Buffet  jb@example.com                          RCA         NaN         NaN  
1   2894  Shirley  Chisholm  sc@example.com       United States Congress         NaN         NaN 
2    294  Marilyn    Monroe  mm@example.com                          Fox         NaN         NaN 
3  30829    Cesar    Chavez  cc@example.com          United Farm Workers         NaN         NaN 
4    827  Vandana     Shiva  vs@example.com                     Navdanya         yyy         yyy 
5   9284   Andrea     Smith  as@example.com     University of California         yyy         yyy 
6    724   Albert    Howard  ah@example.com  Imperial College of Science         NaN         NaN 
---How "df" looks after editing NewColumn1 & NewColumn2---
      Id    First      Last           Email                      Company  NewColumn1  NewColumn2  
0   5829    Jimmy    Buffet  jb@example.com                          RCA         NaN         NaN  
1   2894  Shirley  Chisholm  sc@example.com       United States Congress         NaN         NaN 
2    294  Marilyn    Monroe  mm@example.com                          Fox         zzz         zzz 
3  30829    Cesar    Chavez  cc@example.com          United Farm Workers         NaN         NaN 
4    827  Vandana     Shiva  vs@example.com                     Navdanya         yyy         yyy 
5   9284   Andrea     Smith  as@example.com     University of California         zzz         zzz 
6    724   Albert    Howard  ah@example.com  Imperial College of Science         zzz         zzz 

Simple as that!

Take a look at how the data changed in the new columns, and for whom it changed to what, over the 3 commands.

Just to make sure you're still awake, I threw in a few tricks we've learned before in the code.

For example, we know we can "OR"-together various series with the "|" operator, so on Line 4, instead of giving nicknames to "df['First'].str.startswith('A')" & "df['First'].str.startswith('M')" and putting the nicknames together with "|", I just put them together directly with "|" and saved the resulting series to an "A or M" nickname.

Furthermore, we know that "NewColumn1" & "NewColumn2" don't yet exist in "df" by line 5, so we can't combine them into a list for the 2nd "..." of a "df.loc[...,...] = 'yyy'" command. Each one has to be added to "df" separately.

However, both columns exist by the time we get to line 11, both columns exist. And we just so happen to want to put the same fixed value (right-hand side of the "=") into both columns. So in that case, we can pass a list to the 2nd "..." of "df.loc[...,...]"

(If we wanted different things in each column - say, "zzz" in NewColumn1 and capital-"ZZZ" in NewColumn2, we would have to do it on separate lines.)

This whole "write a different command on a new line for each different way you want to edit your data" approach isn't actually too different from how you probably do things in Excel.

  • You would normally add "Filtering" to the CSV file, expose/hide various rows, copy/paste a value into a certain column for all exposed rows, then re-filter the spreadsheet some other way, do it again, etc.
  • The difference is that with Python & Pandas, once you've written the code, it can run much faster than you can click your way around Excel.

Editing Data: Putting Complicated Values Into The Cells

In Excel, when you're "filtering and copying/pasting," you aren't always using a fixed value. Oftentimes, you use a formula to compute the value that should go into each cell.

You can do this in Pandas, too, with the ".apply(lambda ...)" operation.

".apply(lambda x : CodeThatMakesAValueGoesHere)" is something that can go onto the end of a Series (or code that generates a Series).

It generates a new Series with the same number of items as the Series it's been tacked onto.

To build the contents of that new Series, ".apply()" looks at each member of the old Series. While it's looking at any given member, it decides to call that member "x."

Based on the contents of "CodeThatMakesAValueGoesHere," it decides (perhaps based on that member's value) what should be in the corresponding position for the new Series instead of "x."

Here's a really simple example that isn't very useful:

Code:

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample1.csv')
SeriesShowingWhichRowsHaveLastNameS = df['Last'].str.startswith('S') # True/False-typed series
print('---Contents of "SeriesShowingWhichRowsHaveLastNameS":---')
print(SeriesShowingWhichRowsHaveLastNameS)
SeriesWithZZZ = SeriesShowingWhichRowsHaveLastNameS.apply(lambda x: 'zzz')
print('---Contents of "SeriesWithZZZ":---')
print(SeriesWithZZZ)

Output text:

---Contents of "SeriesShowingWhichRowsHaveLastNameS":---
0    False
1    False
2    False
3    False
4     True
5     True
6    False
Name: Last, dtype: bool
---Contents of "SeriesWithZZZ":---
0    zzz
1    zzz
2    zzz
3    zzz
4    zzz
5    zzz
6    zzz
Name: Last, dtype: object

You can see that even though we had all sorts of nice True/False values in "SeriesShowingWhichRowsHaveLastNameS", "SeriesShowingWhichRowsHaveLastNameS.apply(lambda x: 'zzz')" generated a Series that just blasted over all of them and put "zzz" in every row.

---

Let's replace our "'zzz'", which served as our contents of "CodeThatMakesAValueGoesHere" in our last example, with something fancier.

This is a special kind of if-then-else code in Python that can be inserted into your code wherever you would normally just type a value like "'zzz'".

Instead of "'zzz'", we'll say "'YesLastS' if x==True else 'NotLastS'".

We can use "x" in this code, because that's how "lambda x: CodeThatMakesAValueGoesHere" works. You can refer to "x" in your "CodeThatMakesAValueGoesHere."
(You can also give "x" a cooler name - just remember to refer to it by the same name on both sides of the colon.)

Our "'YesLastS' if x==True else 'NotLastS'" is a little hard to read, but all it means is, "If the value of whatever Pandas currently has in "X" is true, then generate a value of "YesLastS." Otherwise, generate a value of "NotLastS."

Let's see what our new code does.

Code:

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample1.csv')
SeriesShowingWhichRowsHaveLastNameS = df['Last'].str.startswith('S') # True/False-typed series
print('---Contents of "SeriesShowingWhichRowsHaveLastNameS":---')
print(SeriesShowingWhichRowsHaveLastNameS)
SeriesWithYesNoLastNameS = SeriesShowingWhichRowsHaveLastNameS.apply(lambda x: 'YesLastS' if x==True else 'NotLastS')
print('---Contents of "SeriesWithYesNoLastNameS":---')
print(SeriesWithYesNoLastNameS)

Output text:

---Contents of "SeriesShowingWhichRowsHaveLastNameS":---
0    False
1    False
2    False
3    False
4     True
5     True
6    False
Name: Last, dtype: bool
---Contents of "SeriesWithYesNoLastNameS":---
0    NotLastS
1    NotLastS
2    NotLastS
3    NotLastS
4    YesLastS
5    YesLastS
6    NotLastS
Name: Last, dtype: object

Great! We have now generated a Series (an ordinary data-containing one, not a "true/false"-containing one) where position-numbers corresponding to records from "df" whose last name starts with "S" say "YesLastS" and other records say "NotLastS".
The code that did this was "SeriesShowingWhichRowsHaveLastNameS.apply(lambda x: 'YesLastS' if x==True else 'NotLastS')"
"SeriesShowingWhichRowsHaveLastNameS" is a "true/false"-containing Series, and then we asked ".apply(lambda...)" to go through that series, one by one, and generate a new series depending on what was found in each position of "SeriesShowingWhichRowsHaveLastNameS."

---

  • Q: Why did we need such a fancy Series?
  • A: Because a "Series" can go on the right-hand side of an "=".

Watch. This is a long example, so read all the code and look at all the output carefully. It combines everything we've learned in this post.

Code:

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample1.csv')
df = df.drop(['Id','Email','Company'], axis=1)
print('---Contents of "df" after dropping columns:---')
print(df)
SeriesShowingWhichRowsHaveLastNameS = df['Last'].str.startswith('S') # True/False-typed series
SeriesShowingWhichRowsHaveFirstNameA = df['First'].str.startswith('A') # True/False-typed series
SeriesWithYesNoLastNameS = SeriesShowingWhichRowsHaveLastNameS.apply(lambda x: 'YesLastS' if x==True else 'NotLastS') # Series with data in it
MiscSeries=pandas.Series(['aaa','bbb','ccc','ddd','eee','fff','ggg'], index=[0,1,2,3,4,5,6], name='AnyOldName')
print('---Contents of "SeriesShowingWhichRowsHaveLastNameS":---')
print(SeriesShowingWhichRowsHaveLastNameS)
print('---Contents of "SeriesShowingWhichRowsHaveFirstNameA":---')
print(SeriesShowingWhichRowsHaveFirstNameA)
print('---Contents of "SeriesWithYesNoLastNameS":---')
print(SeriesWithYesNoLastNameS)
print('---Contents of "MiscSeries"---')
print(MiscSeries)
df['NewCol1'] = SeriesWithYesNoLastNameS # Add "NewCol1" & set all rows to right-hand side
print('---Contents of "df" after adding "NewCol1":---')
print(df)
df.loc[SeriesShowingWhichRowsHaveFirstNameA,'NewCol2'] = SeriesWithYesNoLastNameS # Add "NewCol2" & set only certain rows to right-hand side
print('---Contents of "df" after adding "NewCol2":---')
print(df)
df['NewCol3'] = MiscSeries
print('---Contents of "df" after adding "NewCol3":---')
print(df)
df.loc[SeriesShowingWhichRowsHaveFirstNameA,'NewCol4'] = MiscSeries
print('---Contents of "df" after adding "NewCol4":---')
print(df)

Output text:

---Contents of "df" after dropping columns:---
     First      Last
0    Jimmy    Buffet
1  Shirley  Chisholm
2  Marilyn    Monroe
3    Cesar    Chavez
4  Vandana     Shiva
5   Andrea     Smith
6   Albert    Howard
---Contents of "SeriesShowingWhichRowsHaveLastNameS":---
0    False
1    False
2    False
3    False
4     True
5     True
6    False
Name: Last, dtype: bool
---Contents of "SeriesShowingWhichRowsHaveFirstNameA":---
0    False
1    False
2    False
3    False
4    False
5     True
6     True
Name: First, dtype: bool
---Contents of "SeriesWithYesNoLastNameS":---
0    NotLastS
1    NotLastS
2    NotLastS
3    NotLastS
4    YesLastS
5    YesLastS
6    NotLastS
Name: Last, dtype: object
---Contents of "MiscSeries"---
0    aaa
1    bbb
2    ccc
3    ddd
4    eee
5    fff
6    ggg
Name: AnyOldName, dtype: object
---Contents of "df" after adding "NewCol1":---
     First      Last   NewCol1
0    Jimmy    Buffet  NotLastS
1  Shirley  Chisholm  NotLastS
2  Marilyn    Monroe  NotLastS
3    Cesar    Chavez  NotLastS
4  Vandana     Shiva  YesLastS
5   Andrea     Smith  YesLastS
6   Albert    Howard  NotLastS
---Contents of "df" after adding "NewCol2":---
     First      Last   NewCol1   NewCol2
0    Jimmy    Buffet  NotLastS       NaN
1  Shirley  Chisholm  NotLastS       NaN
2  Marilyn    Monroe  NotLastS       NaN
3    Cesar    Chavez  NotLastS       NaN
4  Vandana     Shiva  YesLastS       NaN
5   Andrea     Smith  YesLastS  YesLastS
6   Albert    Howard  NotLastS  NotLastS
---Contents of "df" after adding "NewCol3":---
     First      Last   NewCol1   NewCol2 NewCol3
0    Jimmy    Buffet  NotLastS       NaN     aaa
1  Shirley  Chisholm  NotLastS       NaN     bbb
2  Marilyn    Monroe  NotLastS       NaN     ccc
3    Cesar    Chavez  NotLastS       NaN     ddd
4  Vandana     Shiva  YesLastS       NaN     eee
5   Andrea     Smith  YesLastS  YesLastS     fff
6   Albert    Howard  NotLastS  NotLastS     ggg
---Contents of "df" after adding "NewCol4":---
     First      Last   NewCol1   NewCol2 NewCol3 NewCol4
0    Jimmy    Buffet  NotLastS       NaN     aaa     NaN
1  Shirley  Chisholm  NotLastS       NaN     bbb     NaN
2  Marilyn    Monroe  NotLastS       NaN     ccc     NaN
3    Cesar    Chavez  NotLastS       NaN     ddd     NaN
4  Vandana     Shiva  YesLastS       NaN     eee     NaN
5   Andrea     Smith  YesLastS  YesLastS     fff     fff
6   Albert    Howard  NotLastS  NotLastS     ggg     ggg

Both our line that adds a column called "NewCol1" and our line that adds a column called "NewCol2" show different values for "Andrea Smith" (whose last name begins with "S") and "Albert Howard" (whose last name doesn't).

The difference is how we added the new lines.

  1. With our simple "df['NewCol1'] = ..." command, we said, "add whatever is on the right-hand side of the '=' to every row in the new column."
    • Since "whatever was on the right-hand side" was a "Series" instead of a simple value like "'zzz'", "the values from positions 0-6" got filled in for "rows 0-6" of the new column, respectively.
      (You can also see this in action with the code that adds "NewCol3" based on the contents of "MiscSeries").
      • Q: "But wait...what if the Series on the right-hand side had had a different number of values than there are rows in the DataFrame "df"?
      • A: Python would complain or you'd get unexpected data back. Don't.
  2. On the other hand, with our fancy "df.loc[SeriesShowingWhichRowsHaveFirstNameA,'NewCol2'] = ..." command, we said:
    "Add whatever is on the right-hand side of the '=' to the rows in the new column where the 'df' DataFrame's first name starts with 'A'. And in the rows that don't, just leave the cells blank."
    (You can also see this in action in with the code that adds "NewCol4" based on the contents of "MiscSeries" AND based on "SeriesShowingWhichRowsHaveFirstNameA".)

That's a wrap for this post's "fancy filtering and editing."

We are probably ready to move on to "combining two CSV files" - stay tuned.


Table of Contents

Python for Salesforce Administrators - Recognizing Pandas DataFrames vs. Series

In our last post, we covered a few code examples that let you manipulate the contents of a ".CSV" (spreadsheet-like plain-text) file by renaming columns, dropping unwanted columns, and adding new columns.

Before we move on into any fancier filtering and modification tricks, we need to talk about the difference between a "DataFrame" and a "Series" in Python's Pandas module.


DataFrame vs. Series: What They Are

  • A "DataFrame" is 2-dimensional - it has both rows & columns. Think of it like an Excel spreadsheet. Its rows are numbered by Pandas behind the scenes. ("Numbered" - for our purposes. It's more complicated than this, but don't worry about it.)
  • A "Series" is 1-dimensional. The concept of "row" & "column" don't even make sense - it's just a list. Although the items in the list are numbered by Pandas behind the scenes. ("Numbered - for our purposes. It's more complicated than this, but don't worry about it.)

Why Do We Care?

Sometimes, they can be used in code interchangeably. For example, you can put either inside a "print()" command and your software for running Python programs will figure out what to do.

Other times, they can't.

Because they can't, it's time to learn to recognize when you're dealing with a "DataFrame" and when you're dealing with a "Series."


How To Recognize A DataFrame vs. A Series

When in doubt, print it out.

Take a look in the following examples.

(Prep work: First, make sure you've created a ".CSV" file like the one described in "First Scripts". I called mine "sample1.csv". As in that post, in each example, we'll overwrite the contents of a piece of Python code saved to our hard drive as "script.py" and run the entire script.)

Memorize These Clues: What A DataFrame Looks Like:

This code:

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample1.csv')
WhatAmI = df
print(WhatAmI)

Generates this text output:

      Id    First      Last           Email                      Company
0   5829    Jimmy    Buffet  jb@example.com                          RCA
1   2894  Shirley  Chisholm  sc@example.com       United States Congress
2    294  Marilyn    Monroe  mm@example.com                          Fox
3  30829    Cesar    Chavez  cc@example.com          United Farm Workers
4    827  Vandana     Shiva  vs@example.com                     Navdanya
5   9284   Andrea     Smith  as@example.com     University of California
6    724   Albert    Howard  ah@example.com  Imperial College of Science

Lines 1 & 2 are our classic "get our CSV file into Pandas, and give it a nickname of "df" (make sure you've made the CSV file from our first exercises).

In line 3, we create a new nickname called "WhatAmI", and we store a reference to the current contents of "df" in it.
(Trivia - in this case, we're literally talking about 2 nicknames for the same "DataFrame." Data-changing operations against either will impact them both.)

In line 4, we print out the contents of "WhatAmI" so we can see what kind of data it is - "DataFrame" or "Series."

There are three clues here that "WhatAmI" is a "DataFrame:"

  1. (Except above the "row-numbers",) There are column labels with nice neat familiar names.
  2. There is nothing displayed on our screen immediately after the line labeled "6."
  3. We have more than 1 column's worth of data
    (some DataFrames are only 1 column, but only DataFrames can have more than 1 column)

Memorize These Clues: What A Series Looks Like:

This code:

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample1.csv')
WhatAmI = df['Last']
print(WhatAmI)

Generates this text output:

0      Buffet
1    Chisholm
2      Monroe
3      Chavez
4       Shiva
5       Smith
6      Howard
Name: Last, dtype: object

The code is the same as in the previous example, only in Line 3, we store "the value of the column 'Last' from the DataFrame 'df'" into "WhatAmI."

There are three clues here that "WhatAmI" is a "Series:"

  1. Besides the row numbers, there is only 1 vertical line's worth of data.
  2. There are no column labels anymore.
  3. On our screen immediately after the line labeled "6," there is a line of text announcing that the thing we're looking at has a Name of "Last" and what type of data Pandas has decided is contained within the thing we're looking at.

Because of the row-numbers, it's easy to fail to notice that this is just a "position-numbered list," not a "row-numbered table," but that's how we need to think of it when we start putting complicated commands together.


Got that?

Review the clues again and make sure you see how they apply to the output-text examples.

On your screen, the "Series" looks like a 2-dimensional table because of the position-numbers.

The important things to remember are that:

  1. A "Series" is not actually 2-dimensional ... it's just a position-numbered list
  2. A "Series" is fundamentally different from a "DataFrame" as far as being able to use it in "Pandas" code is concerned

More Code Examples: Series vs. DataFrame

Let's look at another example of a series. You've seen the value that we're putting into this "WhatAmI" before (line 3, to the right of the "=") - it was part of our very first exercises.

Code:

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample1.csv')
WhatAmI = df['Last'].str.startswith('S')
print(WhatAmI)

Output text:

0    False
1    False
2    False
3    False
4     True
5     True
6    False
Name: Last, dtype: bool

Again, we now know that "df['Last'].str.startswith('S')" produced a "Series" because there's just 1 vertical line worth of data, because there are no column-labels, and because at the end of the output, we have a little "Name" & "Type" announcement.

---

Two examples ago, when we looked at "df['Last']", we saw that it produced a "series."

What if we put something besides "'Last'" into those "[]" brackets after the reference to our main DataFrame "df"?

In our last example, we were just playing with "df['Last'].str.startswith('S')". We know it produces a series.

Some questions to explore:

  • Will our code run if we put something that produces a "Series" (like "df['Last'].str.startswith('S')") inside the brackets of "df[]"?
  • What type of data will the result be? "Series" or "DataFrame?"

Let's find out.

Code:

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample1.csv')
WeKnowThisProducesASeries = df['Last'].str.startswith('S')
WhatAmI = df[WeKnowThisProducesASeries] # (Note the lack of single-quotes inside the [] - we are now referring to our "nickname" for some data we stored, not telling Pandas how to spell the name of a column.)
print(WhatAmI)

Output text:

     Id    First   Last           Email                   Company
4   827  Vandana  Shiva  vs@example.com                  Navdanya
5  9284   Andrea  Smith  as@example.com  University of California

Nice column-labels, nothing after the end of our data talking about "Name" & "Type"...this is a "DataFrame"!

(For readability, I broke "Line 3" into two parts, saving our "series" from the previous example with a new nickname and then referring to that nickname on the next line.
A single "line 3" of "WhatAmI = df[df['Last'].str.startswith('S')]" would have been equivalent and produced the same results. Try it!)

---

Let's look at one more example - I want to show you how subtle the differences in the way you spell Pandas commands can be.

Code:

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample1.csv')
WhatAmI1 = df['Last']
WhatAmI2 = df[['Last']]
WhatAmI3 = df[['First','Last']]
print('---WhatAmI1 contents (a 7-item Series):---')
print(WhatAmI1)
print('---WhatAmI2 contents (a 7-row, 1-column DataFrame):---')
print(WhatAmI2)
print('---WhatAmI3 contents (a 7-row, 2-column DataFrame):---')
print(WhatAmI3)

Output text:

---WhatAmI1 contents (a 7-item Series):---
0      Buffet
1    Chisholm
2      Monroe
3      Chavez
4       Shiva
5       Smith
6      Howard
Name: Last, dtype: object
---WhatAmI2 contents (a 7-row, 1-column DataFrame):---
       Last
0    Buffet
1  Chisholm
2    Monroe
3    Chavez
4     Shiva
5     Smith
6    Howard
---WhatAmI3 contents (a 7-row, 2-column DataFrame):---
     First      Last
0    Jimmy    Buffet
1  Shirley  Chisholm
2  Marilyn    Monroe
3    Cesar    Chavez
4  Vandana     Shiva
5   Andrea     Smith
6   Albert    Howard

This example has 3 "WhatAmI"s to examine. Note how subtle the differences are between #1 & #2.

  • In #1, you put "'Last'" between "df[]" brackets and get a Series.
  • In #2, you put "['Last']" between "df[]" brackets and get a DataFrame.

Woah! Why does Pandas do this?

"WhatAmI" #3 gives us a better view into what's going on with the "double-square-bracket" style of coding.

In Python, "[something, something, something]" is the basic syntax for a list of something-or-other.

In these code examples, the outer [] is attached to a reference to our main DataFrame "df" and gets interpreted as, "do whatever Pandas judges appropriate based on what's inside the 'df[]' brackets."

The inner "[]" brackets, however, are part of these "what's inside" the outer brackets. And these inner "[]" brackets specify that "what's inside" is a list of column-names, not just a single column-name.

In #2, the "list of column names" just so happens to have exactly 1 thing in it.

But in #3, we see that Pandas has decided that whenever "what's inside df[]" is a list of things, it would be a good idea to generate a DataFrame with those two names as column-headers and the corresponding data from "df."

If you think about it, it makes sense that putting a list of column names inside "df[]" would make Pandas generate a DataFrame as output instead of a Series:

  • Series can't have "multiple columns." Only "DataFrames" can.
  • How else would it make sense for Pandas to interpret you saying that you want to deal with "multiple columns from" our main DataFrame "df"?

Observations On "YourDataFrameNicknameHere[]":

Note that we have now seen a total of three ways that Pandas interprets the "[]" command after a reference to a DataFrame like "df," depending on what we put inside it.

  1. We can put 'ColumnNameHere' inside it.
    We'll get back a "Series" showing us the contents of "ColumnNameHere" from the "DataFrame" we nicknamed "df."
     
  2. We can put code that generates a "Series" full of True/False values inside it.
    We'll get back a "DataFrame" showing data from all of "df"'s columns, but only its rows where the "Series" values were "True."
     
  3. We can put ['ColumName1Here','ColumnName2Here'] inside it.
    We'll get back a "DataFrame" showing data from all of "df"'s rows, but only from "df"'s columns named in the list.

If you don't yet understand all this, or still have trouble recognizing "DataFrames" vs. "Series," please review all the examples and, if needed, ask for help in the comments.


Bigger Picture: Why Do We Care?

One day, you'll want to do something that isn't in these examples.

You'll find something close on Google or StackOverflow, but you won't know exactly how it works.

If you can learn to isolate and dissect chunks of code, put them into "print()" statements, and figure out what kinds of data they produce, then you are better-equipped to figure out what kinds of data the code around them require.

This will better let you "swap out chunks of code" from the example to suit your own needs.

Even I have to do this.

Much of what I'm teaching you in this and future blog posts came from studying this StackOverflow post's answers until I fully understood what everything in the answers does.

I had to write and run at least 40 lines of code, none of which were exactly in the examples of that StackOverflow post, before I was ready to write this blog post.

Not every command I played with even does anything useful. I just wanted to see what would come out the other side so I'd be better-equipped to write example code for you.

We'll be back to "fancy tricks against table-style data" in the next post!


Table of Contents

Wednesday, August 24, 2016

Python for Salesforce Administrators - More Scripts - Rename/Drop/Add Columns

Review: Previous Post

In our last post, we learned to use a programming language called Python and one of its "modules" (additional sets of commands) called Pandas to:

  • Display text on the "standard output" part of our screen
  • Import a CSV file from our hard drive into a special kind of data structure called a "DataFrame," which can be thought of as a virtual Excel spreadsheet
  • Displayed the contents of our "DataFrame" to the screen (all the data & just the first five lines)
  • Displayed just the "last name" column's worth of data from our "DataFrame"
  • Displayed, for each row, whether or not the "last name" column's worth of data started with a given letter
  • Filtered out rows from our "DataFrame" that didn't start with a given letter
  • Exported a CSV file back to our hard drive containing just the rows of our original CSV that started with a given letter

Today we'll rename columns, strip out unneeded columns, and add new columns with "constant" (unchanging from row to row) values. As a real-world example for these tasks:

  • We've exported a list of Contact records as a CSV file.
  • We need to add them all to an existing Campaign.
  • We already know the Campaign's 18-digit ID, so all we need to do is associate the Contacts' 18-digit IDs to this "Campaign ID" and insert these records into the "CampaignMember" table.
    (Nevermind that there are other easy ways to add Contacts to a Campaign...it's an example.)

Prep Work

First, make sure you've created a ".CSV" file like the one described in "First Scripts" (I called mine "sample1.csv"). It looks like this in a text editor:

"Id","First","Last","Email","Company"
"5829","Jimmy","Buffet","jb@example.com","RCA"
"2894","Shirley","Chisholm","sc@example.com","United States Congress"
"294","Marilyn","Monroe","mm@example.com","Fox"
"30829","Cesar","Chavez","cc@example.com","United Farm Workers"
"827","Vandana","Shiva","vs@example.com","Navdanya"
"9284","Andrea","Smith","as@example.com","University of California"
"724","Albert","Howard","ah@example.com","Imperial College of Science"

As in that post, in each example, we'll overwrite the contents of a piece of Python code saved to our hard drive as "script.py" and run the entire script.

Again, the first two lines of most of our examples will be:

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample1.csv')

This tells our code-running software that we will be using the "Pandas" module and imports data from our .CSV file into a "DataFrame" that we give a nickname of "df" for referring to later in our code.


Removing Columns

Let's get rid of the "Email" & "Company" columns. They're just clutter for the task at hand.
(With certain jobs, I prefer to leave "name" fields in the data, even if it won't be imported back into Salesforce, so I can spot-check the output of my script more easily.)

This code:

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample1.csv')
print('---The contents of "df" right after import---')
print(df)
df = df.drop(['Email','Company'], axis=1)
print('---The contents of "df" after removing two columns---')
print(df)

Produces this text output:

---The contents of "df" right after import---
      Id    First      Last           Email                      Company
0   5829    Jimmy    Buffet  jb@example.com                          RCA
1   2894  Shirley  Chisholm  sc@example.com       United States Congress
2    294  Marilyn    Monroe  mm@example.com                          Fox
3  30829    Cesar    Chavez  cc@example.com          United Farm Workers
4    827  Vandana     Shiva  vs@example.com                     Navdanya
5   9284   Andrea     Smith  as@example.com     University of California
6    724   Albert    Howard  ah@example.com  Imperial College of Science
---The contents of "df" after removing two columns---
      Id    First      Last
0   5829    Jimmy    Buffet
1   2894  Shirley  Chisholm
2    294  Marilyn    Monroe
3  30829    Cesar    Chavez
4    827  Vandana     Shiva
5   9284   Andrea     Smith
6    724   Albert    Howard

Lines 3 & 6 just make it easier to read our script's output.

Line 4 displays the contents of our "DataFrame" that we called "df" immediately after generating it by importing our CSV file.

Line 5 of our script comes in two parts.

The first part is after the "equals" sign and means, "generate a new 'DataFrame' that is what you get when you strip columns named 'Email' and 'Company' out of the 'DataFrame' I've called 'df'."
(Trivia: Yesterday, I said that sometimes, certain operations in Pandas could have very similar ways of typing them depending on whether you wanted them to perform the behavior against a row or a column, and that sometimes, Pandas would just figure it out, whereas other times, you had to specify exactly what you meant to get the behavior you want. ".drop()" is one of those Pandas operations that needs things spelled out. In this code, the "axis=1" specifies that we want to drop columns, not rows.)

The second part is before + including the "equals" sign says, "You know that new 'DataFrame' you just made me? Save it over the old contents of 'df' - I don't need those anymore; I like this 'DataFrame' better. Whenever I refer to 'df' from now on, I'm referring to this new 'DataFrame' you just made me."

Line 7 displays the contents of the new version of "df," which now only has 3 columns.


Renaming Columns

When you insert records into the "CampaignMember" table, you have to provide a "ContactId" field & a "CampaignId" field. Since our data came straight out of the "Contact" table, its ID is still just called "Id." Let's rename it to "ContactId." And let's rename "First" to "First Name" and "Last" to "Last Name" while we're at it, just to show you how to rename multiple columns at once.

Code:

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample1.csv')
df = df.drop(['Email','Company'], axis=1)
df = df.rename(columns = {'Id':'ContactId', 'First':'First Name', 'Last':'Last Name'})
print('---The contents of "df" after removing two columns and renaming columns---')
print(df)

Text output:

---The contents of "df" after removing two columns and renaming columns---
   ContactId First Name Last Name
0       5829      Jimmy    Buffet
1       2894    Shirley  Chisholm
2        294    Marilyn    Monroe
3      30829      Cesar    Chavez
4        827    Vandana     Shiva
5       9284     Andrea     Smith
6        724     Albert    Howard

Line 3 was "line 5" in our previous example. It makes a new "DataFrame" with just 3 columns (right-side of the equals sign) and overwrites the value of "df" with this new "DataFrame" (the equals sign and to its left).

Line 4 works similarly.

To the right of the equals sign, the code says, "generate a new 'DataFrame' that looks just like the contents 'df' (as 'df' stands after line 3), only with the various columns renamed."

The second part is before + including the "equals" sign says, "You know that new 'DataFrame' you just made me? Save it over the old contents of 'df' - I don't need those anymore; I like this 'DataFrame' better. Whenever I refer to 'df' from now on, I'm referring to this new 'DataFrame' you just made me."

Line 5 displays a label for readability.

Line 6 displays the contents of the newest version of "df," which now has 3 columns, all of which have been renamed.

Adding A Column With A Fixed Value

Now let's add the "18-digit Id" of the Campaign we want to put all these people into. (I'm using "xyzzy" instead for legibility.)

Code:

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample1.csv')
df = df.drop(['Email','Company'], axis=1)
df = df.rename(columns = {'Id':'ContactId', 'First':'First Name', 'Last':'Last Name'})
df['CampaignId']='xyzzy'
print('---The contents of "df" after removing two columns, renaming columns, and adding a column.---')
print(df)

Text output:

---The contents of "df" after removing two columns and renaming columns---
   ContactId First Name Last Name CampaignId
0       5829      Jimmy    Buffet      xyzzy
1       2894    Shirley  Chisholm      xyzzy
2        294    Marilyn    Monroe      xyzzy
3      30829      Cesar    Chavez      xyzzy
4        827    Vandana     Shiva      xyzzy
5       9284     Andrea     Smith      xyzzy
6        724     Albert    Howard      xyzzy

This is the same as our previous code example, only I added a new line of code on line 5.

Note that we don't have the "df = df.something()" syntax anymore.

Instead, we put brackets on the "df" to the left of the equals sign and the value we wanted in every single row for this new column on the right of the equals sign.

That's just a difference in syntax (code-spelling & "grammar") to get used to in Pandas commands.

When you add a column, you say, "You know 'df'? Well, I'd like you to modify it, and here's how."

When you drop or rename a column, you say, "Take the old copy of 'df', make me something new based on it, and then overwrite 'df' with this 'something new.'"

(Trivia - actually, that's only sorta true. There is usually an "inPlace=true" flag you can add to "df.something()" operations to make them use less space on your computer (memory) while they run - in return, you drop the "df = " part of the code. However, it's not always reliable when you run the code, so if you're not having any problems with with my code examples, I don't recommend using this flag.

In other words, "df = df.drop(['Email','Company'], axis=1)" and "df.drop(['Email','Company'], axis=1, inplace=True)" get the same job done, and the 2nd is theoretically more efficient, but every once in a while with the 2nd version, when you run your code, the whole thing will crash with a random error.)

Anyway, line 5 says, "modify 'df' so that it has a new column called 'CampaignId,' and in every row in that new column, put a value of 'xyzzy'."

What's really interesting is that this is very similar to the syntax for referring to the contents of an existing column - the only difference is that the word we typed inside the square brackets didn't yet exist in our "DataFrame." Instead of giving us an error, though, Pandas just said, "Oh, you must want me to create that for you - just put an '=' to the right of your reference to this nonexistent column and a value for me to put in its rows, and I'll go right ahead & do that."

Line 6 displays a label for readability.

Line 7 displays the contents of the newest version of "df," which now has 4 columns, 3 of which are renamed from their original names, and 1 of which is brand new.


That's it for this post! Coming up, we'll work on more "variable" ways of adding/changing data in our DataFrame. (e.g. changing "Male" to "M" or adding a new column, but setting its value based on the values we find in existing columns.)

This is where Pandas can really start to get faster than Excel - in Excel, you would filter your CSV file, copy/paste a value over every matching row, filter it again, copy/paste again, etc.

For now, practice these, make sure you can read them, and ask in the comments if you have any questions about the code.


Table of Contents

Monday, August 22, 2016

Python for Salesforce Administrators - First Scripts - Import/Export CSVs, Filter Out Rows

Let's write a few scripts!

By the time you complete all of these exercises, you will be able to filter out unwanted rows from a CSV file and save the "good" rows as a new CSV file.

This can be very handy if you have a CSV file with so many rows that Excel freezes for 30 seconds every time you try to open or filter it.


Program 1: Displaying Text

Our first program is just one line long - I saved it on my hard drive as "c:\tempexamples\script.py" and am running it through the "play"-looking button in my "Spyder" IDE, where I am also editing it.
(Click here for playable sample...do not type your own company's data into this link!)

print('Hello World')

The text output, when I run it, looks like this:

Hello World

Preparing A CSV For The Next Exercises

Now I'm going to open Notepad and make myself a CSV file (when I do File->Save As, change the "Save as File Type" from .TXT to "All Files (*.*)") and save the file as "c:\tempexamples\sample1.csv".

Here's what that file contains (representing 5 columns worth of fields & 7 rows worth of data):

"Id","First","Last","Email","Company"
"5829","Jimmy","Buffet","jb@example.com","RCA"
"2894","Shirley","Chisholm","sc@example.com","United States Congress"
"294","Marilyn","Monroe","mm@example.com","Fox"
"30829","Cesar","Chavez","cc@example.com","United Farm Workers"
"827","Vandana","Shiva","vs@example.com","Navdanya"
"9284","Andrea","Smith","as@example.com","University of California"
"724","Albert","Howard","ah@example.com","Imperial College of Science"

Program 2: Reading a CSV

Now I'm going to replace the contents of "script.py" with the following 6 lines and run them:
(Click here for playable sample...do not type your own company's data into this link!)

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample1.csv')
print('---Here are all 7 lines---')
print(df)
print('---Here are the first 5 lines---')
print(df.head())

The text output is:

---Here are all 7 lines---
      Id    First      Last           Email                      Company
0   5829    Jimmy    Buffet  jb@example.com                          RCA
1   2894  Shirley  Chisholm  sc@example.com       United States Congress
2    294  Marilyn    Monroe  mm@example.com                          Fox
3  30829    Cesar    Chavez  cc@example.com          United Farm Workers
4    827  Vandana     Shiva  vs@example.com                     Navdanya
5   9284   Andrea     Smith  as@example.com     University of California
6    724   Albert    Howard  ah@example.com  Imperial College of Science
---Here are the first 5 lines---
      Id    First      Last           Email                 Company
0   5829    Jimmy    Buffet  jb@example.com                     RCA
1   2894  Shirley  Chisholm  sc@example.com  United States Congress
2    294  Marilyn    Monroe  mm@example.com                     Fox
3  30829    Cesar    Chavez  cc@example.com     United Farm Workers
4    827  Vandana     Shiva  vs@example.com                Navdanya

Let's talk about this program a little bit.

The first thing we did was declare that we would need to be able to write code using commands from the "pandas" module.

Next, we asked the Pandas module to execute its "read_csv()" operation against a CSV file on our hard drive.

We had to put double-backslashes in place because in Python, inside of a "string" (the piece of code between '' or "" punctuation representing "plain text, not code"), a single backslash means "the following character would normally be code, even inside a string, but it's just what it looks like in this case." To say, "I really do mean just use a backslash," therefore, we need to say "\\".

In this same line, we also save the output from Pandas "reading the CSV" to a variable I decided to call "df."

It doesn't have to be called "df." It could be called "myvariable."

But because the output from a "pandas.read_csv()" operation is, to Python, a special type of data store called a "DataFrame," when you have nothing better to call the variable you're saving it to, it's convention to call that variable "df."

(Note - you might also see code examples with something like "pd.read_csv()". When you import a module, you have an option to give it a different - e.g. shorter - name. Some examples online presume you hate typing and did your first line as "import pandas as pd" so you could type "pd" instead of "pandas". I don't find 6 characters too much to type, so my examples don't follow that convention.)

Our third line of code just put a caption on the screen.

Our fourth line of code said, "show me the contents of the variable "df" as output on my screen."

The spreadsheet-looking formatting and the addition of line numbers has to do with the sophistication under the covers of the way "DataFrame"s are programmed to behave (thanks, authors of the "pandas" module!)

Our fifth line of code just put a caption on the screen.

Our sixth line of code said, "show me the contents of the first five lines of the variable "df" as output on my screen."

This ".head()" operation that can be run from any "DataFrame" is handy when you have, say, a 60,000-line "DataFrame," you know all rows look similar, and all you want to do is check to see what it looks like.


Program 3: Filtering a CSV

Let's write another program.

Now I want to find out which rows of my CSV file include someone whose last name starts with a capital "c." I'm going to replace the contents of "script.py" with the following code and run it:
(Click here for playable sample...do not type your own company's data into this link!)

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample1.csv')
print('---What is in "Last" for each row?---')
print(df['Last'])
print('---For each row, does "Last" start with capital "C"?---')
print(df['Last'].str.startswith('C'))
print('---Show all columns, but only rows where "Last" starts with capital "C"---')
print(df[df['Last'].str.startswith('C')])

The output looks like this:

---What is in "Last" for each row?---
0      Buffet
1    Chisholm
2      Monroe
3      Chavez
4       Shiva
5       Smith
6      Howard
Name: Last, dtype: object
---For each row, does "Last" start with capital "C"?---
0    False
1     True
2    False
3     True
4    False
5    False
6    False
Name: Last, dtype: bool
---Show all columns, but only rows where "Last" starts with capital "C"---
      Id    First      Last           Email                 Company
1   2894  Shirley  Chisholm  sc@example.com  United States Congress
3  30829    Cesar    Chavez  cc@example.com     United Farm Workers

The first two lines of this example do the same thing as in our last example - load up the "pandas" module and save our CSV to a special data structure called a "DataFrame" and give that "DataFrame" a nickname of "df" for further reference in later code.

Lines 3, 5, & 7 just print labels so we can see where we are in the text output.

The call to "df['Last']" in line 4 asks our DataFrame to analyze its own "Last" column and give us back another special data structure called a "Series." We don't bother to save that "series" to a variable with a name - we just pass it to the "print()" command, and Pandas does a great job formatting it on our screen.

(Note that the header is missing - we typed in "'Last'" by hand, so in theory, we know it, and also that after we see all 7 rows (0-6, not 1-7!), we get a little bit of data about the series on the next line.

Line 6 is a copy of line 4, only we took advantage of the fact that "Series" data-structures have an operation called ".str.startswith()" that transforms every value into a True/False answering the question, "Does this value start with this?"

The text-output is formatted similarly to the output of line 4, although in the note at the end, it points out that our values are all of type "Boolean," which means true/false.

Line 8 gets a little crazy. Think of this like arithmetic, where you read parentheses inside out. Notice that the inside of line 8 is just the part of line 6 from inside the "print()", only now with a "df[]" surrounding it.

What this does is ask the DataFrame we nicknamed "df" to compare itself, row by row, to the "series" we generated with code from line 6, throw away the row if the corresponding (by line-number) value was "false," and keep the row if the corresponding value was "true."

We then put all that inside a "print()" statement, and we see a 2-line DataFrame (although the original line numbering is preserved) representing everyone from our CSV file with a last name that starts with a capital "C."

---

If you looked carefully, you might have noticed that in line 4, "df[SomethingSomethingSomething]" meant, "show me columns from the DataFrame where...," but in line 8, the outer usage of "df[SomethingSomethingSomething]" meant, "show me rows from the DataFrame where..."

Confusing?

I think so.

But basically, which way Pandas will interpret the command depends on what's inside the square-brackets.

In the first case, "inside the brackets" was a string - the text "'Last'"; in the second case, "inside the brackets" was a reference to a special data structure called a "Series."

In certain cases, Pandas doesn't actually know whether you're trying to filter by row or by column, so you have to add a comma and some extra "parameters" to let it know ... but the example-code you're copying from the internet will include such text if it's necessary.

My advice to Salesforce sysadmins is to simply copy/paste examples from the internet until they work and move on with your life.


Program 4: More Filtering

Let's get a little more complicated - now we're going to say, "either starting with a C or starting with an S" is fine."

Again, I replace the contents of "script.py" with the following code:
(Click here for playable sample...do not type your own company's data into this link!)

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample1.csv')
print('---For each row, does "Last" start with capital "C" or "S"?---')
print(df['Last'].str.startswith('C') | df['Last'].str.startswith('S'))
print('---Show all columns, but only rows where "Last" starts with capital "C" or "S"---')
print(df[df['Last'].str.startswith('C') | df['Last'].str.startswith('S')])

And here is the output:

---For each row, does "Last" start with capital "C" or "S"?---
0    False
1     True
2    False
3     True
4     True
5     True
6    False
Name: Last, dtype: bool
---Show all columns, but only rows where "Last" starts with capital "C" or "S"---
      Id    First      Last           Email                   Company
1   2894  Shirley  Chisholm  sc@example.com    United States Congress
3  30829    Cesar    Chavez  cc@example.com       United Farm Workers
4    827  Vandana     Shiva  vs@example.com                  Navdanya
5   9284   Andrea     Smith  as@example.com  University of California

It's a lot like lines 6 & 8 of our last program, only we added " | df['Last'].str.startswith('S')" before the final close-parenthesis.


Program 5: Writing To CSV

It's all very nice to be able to filter a spreadsheet and see your output, but as a Salesforce administrator, you probably need a new CSV file so you can do something else with it and send it back to Salesforce.

Let's take our 4-line output from the last example, strip off those meaningless row numbers, and write it to disk at "C:\tempexamples\filteredoutput.csv".

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample1.csv')
newdf = df[df['Last'].str.startswith('C') | df['Last'].str.startswith('S')]
newdf.to_csv('C:\\tempexamples\\filteredoutput.csv', index=False, quoting=1)

Check the folder where you told the ".to_csv()" operation to save the file. You should have a file called "filteredoutput.csv" whose contents look like this (when opened in Notepad - they'll probably look like a normal spreadsheet if you open the file in Excel):

"Id","First","Last","Email","Company"
"2894","Shirley","Chisholm","sc@example.com","United States Congress"
"30829","Cesar","Chavez","cc@example.com","United Farm Workers"
"827","Vandana","Shiva","vs@example.com","Navdanya"
"9284","Andrea","Smith","as@example.com","University of California"

In this 4-line program, again we said we wanted to use the "Pandas" module and then saved a "DataFrame" abstraction of our CSV file into a variable we named "df."

Next, because our manipulations of "df" were getting a little long-winded, we saved a copy of the output of those manipulations (which is also a DataFrame) to a variable we called "newdf."

One of the operations that all DataFrames come with is ".to_csv()". So we invoked that operation from "newdf" and gave it "parameter" values telling that operation where to save the file, not to bother writing the "row numbers" to the file, and to put quotes around each individual piece of data (between the commas).


You Did It

Now you're using Python and Pandas!

From here, your life is largely a matter of Googling other things you want to do and including the word "pandas" in your search. A lot of it will be cryptic, but not all of it is. Also try direct searching in StackOverflow's search box for less-cryptic results.

We'll cover more complicated operations, but this is already a great start.

As I mentioned at the top, you can already put your learning into action if you frequently manipulate CSV files that are annoyingly slow to open in Excel.

For more fun filtering Pandas, if you're used to SQL or SOQL, visit this web page and play around.

Or, if you're comfortable with code, peek at some more advanced operations and see if you can replicate them against your own data.


Table of Contents