Pages

Thursday, August 25, 2016

Python for Salesforce Administrators - Fancier Row-Filtering And Data-Editing

In our last post, we talked about "DataFrames" vs. "Series" in the Pandas module of the Python programming language. Please read it before you continue with these examples.

In this post, we'll do a deeper diver into row-filtering. We'll also explore different ways to change the values of cells in your "DataFrame" (an abstraction of a CSV file).

(Prep work: First, make sure you've created a ".CSV" file like the one described in "First Scripts". I called mine "sample1.csv". As in that post, in each example, we'll overwrite the contents of a piece of Python code saved to our hard drive as "script.py" and run the entire script.)


Previously, we saw that Python/Pandas commands structured as a "true/false-series-making-thing inside 'df[]'" generates a DataFrame with the same 5 columns as "df", and just the rows where the "true/false" series is "true," as in this snippet of code:

SeriesShowingWhichRowsHaveLastNameS = df['Last'].str.startswith('S')
df[SeriesShowingWhichRowsHaveLastNameS]

Filtering Rows Based On Multiple Columns At Once

We can get fancier than that. Now we'll put two similar-but-different "true/false" Series together and use them as the filter instead.

In this example, our output will be a DataFrame showing us all rows from our main DataFrame "df" where the last name starts with "S" OR the first name starts with "M."

Code:

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample1.csv')
SeriesShowingWhichRowsHaveLastNameS = df['Last'].str.startswith('S')
SeriesShowingWhichRowsHaveFirstNameM = df['First'].str.startswith('M')
SeriesShowingRowsWhereEitherIsTrue = SeriesShowingWhichRowsHaveLastNameS | SeriesShowingWhichRowsHaveFirstNameM
print('---SeriesShowingWhichRowsHaveLastNameS contents:---')
print(SeriesShowingWhichRowsHaveLastNameS)
print('---SeriesShowingWhichRowsHaveFirstNameM contents:---')
print(SeriesShowingWhichRowsHaveFirstNameM)
print('---SeriesWhereEitherIsTrue contents:---')
print(SeriesShowingRowsWhereEitherIsTrue)
print('---DataFrame Where Either Is True:---')
print(df[SeriesShowingRowsWhereEitherIsTrue])

Output text:

---SeriesShowingWhichRowsHaveLastNameS contents:---
0    False
1    False
2    False
3    False
4     True
5     True
6    False
Name: Last, dtype: bool
---SeriesShowingWhichRowsHaveFirstNameM contents:---
0    False
1    False
2     True
3    False
4    False
5    False
6    False
Name: First, dtype: bool
---SeriesWhereEitherIsTrue contents:---
0    False
1    False
2     True
3    False
4     True
5     True
6    False
dtype: bool
---DataFrame Where Either Is True:---
     Id    First    Last           Email                   Company
2   294  Marilyn  Monroe  mm@example.com                       Fox
4   827  Vandana   Shiva  vs@example.com                  Navdanya
5  9284   Andrea   Smith  as@example.com  University of California

In this code, we make 2 Series (lines 3 & 4) that each filter the contents of one column.

Then we put them together into a 3rd Series with an "or" ("|") operator that can combine series (line 5).

Then we take a look at what's inside these Series (lines 6-11).

Note how the output from the "combined" Series we made in line 5 has 3 rows set to "True" - these are the 3 rows where either of the previous two Series had a value set to "True."

Finally, we call our trusty "df[]" on the "combined" series and take a look at it (lines 12-13).

Sure enough, we have a 3-line extract of "df" on the screen as a new DataFrame!


At this point, I could write another blog post about what exactly should go into the building-block Series you use for filters (e.g. the right-hand side of the "=" in lines 3 & 4 of the previous code example).

You can get way crazier than you can in Excel. You're definitely not limited to "starts with a capital S."

However, that'd be a pretty big diversion, so I'm not going to address it now.

If I forget to come back to the subject somewhere in this series, please post in the comments and remind me. But know that it won't be comprehensive - if you can imagine it, Google for it!


Editing Data - Modifying Every Row In 1 Column

Now let's get into writing data, not just reading it.

Remember how when we were just reading data from the DataFrame "df," "df['Last']" produced a Series but "df[CodeThatMakesATrueFalseSeries]" or "df[['Last']]" or "df[['First','Last']]" produced a DataFrame?

Similarly, when it's used for writing data (it's on the left side of an "="), exactly what you put into the "[]" impacts how Pandas behaves.

When writing data to "df," on the left-hand side of an "=" command, "df['Last']", "df[['Last']]", and "df[['First','Last']]" all behave similarly.

They all take whatever is on the right-hand side of the "=" and overwrite every row of the specified columns with that value. Take a look.

Code:

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample1.csv')
df['Last']='aaa'
df[['Email']]='bbb'
df[['Id','Company']]='ccc'
print(df)

Output text:

    Id    First Last Email Company
0  ccc    Jimmy  aaa   bbb     ccc
1  ccc  Shirley  aaa   bbb     ccc
2  ccc  Marilyn  aaa   bbb     ccc
3  ccc    Cesar  aaa   bbb     ccc
4  ccc  Vandana  aaa   bbb     ccc
5  ccc   Andrea  aaa   bbb     ccc
6  ccc   Albert  aaa   bbb     ccc

Editing Data - Adding Columns With Every Row The Same

When writing data to "df," on the left-hand side of an "=" command, "df['Last']", "df[['Last']]", and "df[['First','Last']]" behave differently from each other when the columns you specify don't exist in "df".

"df['NewColumnName'] = 'zzz'" will add a column called "NewColumnName" to our DataFrame nicknamed "df" and fill every row of that column with the text"zzz."

Code:

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample1.csv')
df['NewColumn'] = 'zzz'
print(df)

Output text:

      Id    First      Last           Email                      Company  NewColumn
0   5829    Jimmy    Buffet  jb@example.com                          RCA        zzz
1   2894  Shirley  Chisholm  sc@example.com       United States Congress        zzz
2    294  Marilyn    Monroe  mm@example.com                          Fox        zzz
3  30829    Cesar    Chavez  cc@example.com          United Farm Workers        zzz
4    827  Vandana     Shiva  vs@example.com                     Navdanya        zzz
5   9284   Andrea     Smith  as@example.com     University of California        zzz
6    724   Albert    Howard  ah@example.com  Imperial College of Science        zzz

(Don't worry if your output is getting so wide that your Python-running software wraps the output around to the next line, which you can detect by the presence of a "\" floating out in space to the right of your data just before it jumps to the next line. It'll look fine when you export your DataFrame to CSV.)

"df[['NewColumnName']] = 'zzz'" or "df[['NewColumnName1', 'NewColumnName2']] = 'zzz'", however, will produce errors.

Code:

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample1.csv')
df[['NewColumn']] = 'zzz'
print(df)

Output text:

Traceback (most recent call last):
   ...(lots of garble)...
KeyError: "['NewColumn'] not in index"

The takeaway: Only the one-column-name-at-a-time approach can add a new column to your DataFrame.


Editing Data - Modifying Every Column In Certain Rows

What about "df[CodeThatMakesATrueFalseSeries]"? How does it behave on the left side of an "="?

Code:

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample1.csv')
SeriesShowingWhichRowsHaveLastNameS = df['Last'].str.startswith('S')
df[SeriesShowingWhichRowsHaveLastNameS]='aaa'
print(df)

Output text:

      Id    First      Last           Email                      Company
0   5829    Jimmy    Buffet  jb@example.com                          RCA
1   2894  Shirley  Chisholm  sc@example.com       United States Congress
2    294  Marilyn    Monroe  mm@example.com                          Fox
3  30829    Cesar    Chavez  cc@example.com          United Farm Workers
4    aaa      aaa       aaa             aaa                          aaa
5    aaa      aaa       aaa             aaa                          aaa
6    724   Albert    Howard  ah@example.com  Imperial College of Science

Hmm...not sure how useful this is. (If we really didn't want a row, why not just filter it out, rather than filling it in with random things?) But at least now we know how it behaves.

"df[CodeThatMakesATrueFalseSeries]" takes whatever is on the right-hand side of the "=" and overwrites every column of "df" with with that value in the rows where the "Series" has a value of true.


Editing Data - Modifying Certain Columns In Certain Rows

What we often want, when editing data in a CSV file, is to make changes to values in a certain column but only for certain rows. Examples:

  • We want to inspect every row's value in a "Suffix" column and, if the value for that row is "M.D." or "Ph.D.", change that row's value in the "Prefix" column to "Dr."
  • We want to inspect every row's value in the "Email" column and, if the value for that row is obviously bad, change that row's value in the "Email" column to something better.
    (The "Email" example requires fancier "filter" and fancier "data-change" commands than we are going to cover right now, but the gist of solving these two problems is the same.)

Let's think through this.

We've seen that we can use "df[...]" to display just rows or just columns of a DataFrame called "df." We've seen that we can use "df[...]" to change just rows or just columns of a DataFrame called "df" if we put it on the left-hand side of an "=".

Is there any way we could just cleverly combine all this?
(The answer is "no," but let's see for ourselves.)

Can we...just do "df[...][...]" = 'zzz'"?

Code:

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample1.csv')
SeriesShowingWhichRowsHaveLastNameS = df['Last'].str.startswith('S')
df[SeriesShowingWhichRowsHaveLastNameS]['Email']='ccc'

Output text:

C:/tempexamples/script.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  print(((df[SeriesShowingWhichRowsHaveLastNameS])['Email']='ccc'))

Nope. Well ... what if we put "()" parentheses around 's around df[SeriesShowingWhichRowsHaveLastNameS] to make it clear that it's a "DataFrame" and thus should be able to have its own "[]" after it?

Code:

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample1.csv')
SeriesShowingWhichRowsHaveLastNameS = df['Last'].str.startswith('S')
(df[SeriesShowingWhichRowsHaveLastNameS])['Email']='ccc'

Output text:

C:/tempexamples/script.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  print(((df[SeriesShowingWhichRowsHaveLastNameS])['Email']='ccc'))

Nope. Same error. It looks like Pandas didn't need the parentheses for clarification.

The problem seems to be that Pandas just won't let any old "YourDataFrameHere[]" be on the left-hand side of an "=".

Certain DataFrames that came out of other DataFrames are considered "slices" and aren't allowed to be written with the "YourDataFrameHere[] = 'zzz'" style of commands.

(Trivia: those "slices" can, however, be read with the "YourDataFrameHere[]" syntax. Try changing line 4 of the previous example to "print(df[SeriesShowingWhichRowsHaveLastNameS]['Email'])" and see what you get. Also note whether it is a "Series" or a "DataFrame" and ponder why. If you have any questions, ask me in the comments.)

---

It looks like we're going to need something more complicated than "YourDataFrameHere[] = 'zzz'" to do this "both-row-and-column-based" editing.

We need something that lets us clearly specify which part of our code is filtering our rows, and which part of our code is specifying how to change the matching columns.

Enter the "df.loc[]" command.

This command asks for TWO values:

  1. A "true/false Series" for using as a filter against the rows of "df" (e.g. our "SeriesShowingWhichRowsHaveLastNameS" series)
  2. A column name (or list of column names) in need of displaying/editing)

To edit values, the whole "df.loc[...,...]" command goes on the left side of an "=", with whatever value you put on the right side of the "=" being what value you want filled in in the matching rows.

Note that you can only put one value on the right-hand side of the "=", so if you have different things to put in different parts of your DataFrame nicknamed "df," you'll need to do back-to-back "df.loc[...,...] = 'zzz'" statements, one apiece.

Let's have a look - we'll do the same "aaa bbb ccc" replacement as two examples ago, only we'll just do it to people whose last names start with "S":

Code:

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample1.csv')
SeriesShowingWhichRowsHaveLastNameS = df['Last'].str.startswith('S')
df.loc[SeriesShowingWhichRowsHaveLastNameS,'Last'] = 'aaa'
df.loc[SeriesShowingWhichRowsHaveLastNameS,['Email']] = 'bbb'
df.loc[SeriesShowingWhichRowsHaveLastNameS,['Id','Company']] = 'ccc'
print(df)

Output text:

      Id    First      Last           Email                      Company
0   5829    Jimmy    Buffet  jb@example.com                          RCA
1   2894  Shirley  Chisholm  sc@example.com       United States Congress
2    294  Marilyn    Monroe  mm@example.com                          Fox
3  30829    Cesar    Chavez  cc@example.com          United Farm Workers
4    ccc  Vandana       aaa             bbb                          ccc
5    ccc   Andrea       aaa             bbb                          ccc
6    724   Albert    Howard  ah@example.com  Imperial College of Science

The Takeaway:

  • For simple, every-column changes, you can use a simple "df[...] = 'zzz'" syntax.
  • For "depends on row AND column" changes, you need to say "df.loc[YourFilteringSeries,AColumnNameOrFew] = 'blahblahblah'".

Editing Data - Adding A Column With Data Only In Certain Rows

We've used the simpler style "df['NewColumnName'] = 'zzz'" before to add a "constant" value to our DataFrame "df."

But what would we do if we wanted to add a column that was full of blank values, except in certain rows?

Again, we can use our "df.loc[...,...]" style.

Just like "df['NewColumnName'] = 'zzz'" works, but ""df[['NewColumnName']] = 'zzz'" and "df['NewColumnName1','NewColumnName2'] = 'zzz'" don't work, "df.loc[...,...]" has similar restrictions.

If a column doesn't yet exist, the second value passed to "df.loc[...,...]" has to be a plain-old single column name, not a list of column names.

If we want to add multiple similar columns, we'll have to do have back-to-back statements, one apiece.

In other words, this won't work:

df.loc[SeriesShowingWhichRowsHaveLastNameS,['NewColumn1','NewColumn2']] = 'zzz'

But this will:

df.loc[SeriesShowingWhichRowsHaveLastNameS,'NewColumn1'] = 'zzz'
df.loc[SeriesShowingWhichRowsHaveLastNameS,'NewColumn2'] = 'zzz'

Let's see it in action.

Example code:

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample1.csv')
SeriesShowingWhichRowsHaveLastNameS = df['Last'].str.startswith('S')
df.loc[SeriesShowingWhichRowsHaveLastNameS,'NewColumn1'] = 'yyy'
df.loc[SeriesShowingWhichRowsHaveLastNameS,'NewColumn2'] = 'zzz'
print(df)

Output text:

      Id    First      Last           Email                      Company  NewColumn1 NewColumn2  
0   5829    Jimmy    Buffet  jb@example.com                          RCA         NaN        NaN  
1   2894  Shirley  Chisholm  sc@example.com       United States Congress         NaN        NaN 
2    294  Marilyn    Monroe  mm@example.com                          Fox         NaN        NaN 
3  30829    Cesar    Chavez  cc@example.com          United Farm Workers         NaN        NaN 
4    827  Vandana     Shiva  vs@example.com                     Navdanya         yyy        zzz 
5   9284   Andrea     Smith  as@example.com     University of California         yyy        zzz 
6    724   Albert    Howard  ah@example.com  Imperial College of Science         NaN        NaN 

"NaN" is Pandas-language for, "nothing here - not even an empty-piece-of-text. Just NOTHING."

The "NaN" spots will come out as blank areas (as will "empty-pieces-of-text") when you export your DataFrame to CSV, so don't worry about them in your DataFrame.


Editing Data - Adding Complexly-Filled-In Columns

  • Q: What if we want a "NewColumn1" + "NewColumn2" where "people whose last name starts with S" get "yyy," "people whose first name starts with A or M" get "zzz," and "everyone else" gets a blank value?
  • A: We put together the previous two examples. First we add the columns, then we edit them.

But look carefully at our data - "Andrea Smith" falls into both categories. So whether Andrea ends up with "yyy" or "zzz" will depend on the last command we run that modifies Andrea's data. In other words, it will depend on the order in which we type our commands.

Here's one example where she ends up with "zzz" (because that's the command we run last):

Code:

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample1.csv')
SeriesShowingWhichRowsHaveLastNameS = df['Last'].str.startswith('S')
SeriesShowingWhichRowsHaveFirstNameAOrM = df['First'].str.startswith('A') | df['First'].str.startswith('M')
df.loc[SeriesShowingWhichRowsHaveLastNameS,'NewColumn1'] = 'yyy' # Column name doesn't exist yet
print('---How "df" looks after adding NewColumn1---')
print(df)
df.loc[SeriesShowingWhichRowsHaveLastNameS,'NewColumn2'] = 'yyy' # Column name doesn't exist yet
print('---How "df" looks after adding NewColumn2---')
print(df)
df.loc[SeriesShowingWhichRowsHaveFirstNameAOrM,['NewColumn1','NewColumn2']] = 'zzz' # Column name already exists now
print('---How "df" looks after editing NewColumn1 & NewColumn2---')
print(df)

Output text:

---How "df" looks after adding NewColumn1---
      Id    First      Last           Email                      Company  NewColumn1 
0   5829    Jimmy    Buffet  jb@example.com                          RCA         NaN 
1   2894  Shirley  Chisholm  sc@example.com       United States Congress         NaN 
2    294  Marilyn    Monroe  mm@example.com                          Fox         NaN 
3  30829    Cesar    Chavez  cc@example.com          United Farm Workers         NaN 
4    827  Vandana     Shiva  vs@example.com                     Navdanya         yyy 
5   9284   Andrea     Smith  as@example.com     University of California         yyy 
6    724   Albert    Howard  ah@example.com  Imperial College of Science         NaN 
---How "df" looks after adding NewColumn2---
      Id    First      Last           Email                      Company  NewColumn1  NewColumn2  
0   5829    Jimmy    Buffet  jb@example.com                          RCA         NaN         NaN  
1   2894  Shirley  Chisholm  sc@example.com       United States Congress         NaN         NaN 
2    294  Marilyn    Monroe  mm@example.com                          Fox         NaN         NaN 
3  30829    Cesar    Chavez  cc@example.com          United Farm Workers         NaN         NaN 
4    827  Vandana     Shiva  vs@example.com                     Navdanya         yyy         yyy 
5   9284   Andrea     Smith  as@example.com     University of California         yyy         yyy 
6    724   Albert    Howard  ah@example.com  Imperial College of Science         NaN         NaN 
---How "df" looks after editing NewColumn1 & NewColumn2---
      Id    First      Last           Email                      Company  NewColumn1  NewColumn2  
0   5829    Jimmy    Buffet  jb@example.com                          RCA         NaN         NaN  
1   2894  Shirley  Chisholm  sc@example.com       United States Congress         NaN         NaN 
2    294  Marilyn    Monroe  mm@example.com                          Fox         zzz         zzz 
3  30829    Cesar    Chavez  cc@example.com          United Farm Workers         NaN         NaN 
4    827  Vandana     Shiva  vs@example.com                     Navdanya         yyy         yyy 
5   9284   Andrea     Smith  as@example.com     University of California         zzz         zzz 
6    724   Albert    Howard  ah@example.com  Imperial College of Science         zzz         zzz 

Simple as that!

Take a look at how the data changed in the new columns, and for whom it changed to what, over the 3 commands.

Just to make sure you're still awake, I threw in a few tricks we've learned before in the code.

For example, we know we can "OR"-together various series with the "|" operator, so on Line 4, instead of giving nicknames to "df['First'].str.startswith('A')" & "df['First'].str.startswith('M')" and putting the nicknames together with "|", I just put them together directly with "|" and saved the resulting series to an "A or M" nickname.

Furthermore, we know that "NewColumn1" & "NewColumn2" don't yet exist in "df" by line 5, so we can't combine them into a list for the 2nd "..." of a "df.loc[...,...] = 'yyy'" command. Each one has to be added to "df" separately.

However, both columns exist by the time we get to line 11, both columns exist. And we just so happen to want to put the same fixed value (right-hand side of the "=") into both columns. So in that case, we can pass a list to the 2nd "..." of "df.loc[...,...]"

(If we wanted different things in each column - say, "zzz" in NewColumn1 and capital-"ZZZ" in NewColumn2, we would have to do it on separate lines.)

This whole "write a different command on a new line for each different way you want to edit your data" approach isn't actually too different from how you probably do things in Excel.

  • You would normally add "Filtering" to the CSV file, expose/hide various rows, copy/paste a value into a certain column for all exposed rows, then re-filter the spreadsheet some other way, do it again, etc.
  • The difference is that with Python & Pandas, once you've written the code, it can run much faster than you can click your way around Excel.

Editing Data: Putting Complicated Values Into The Cells

In Excel, when you're "filtering and copying/pasting," you aren't always using a fixed value. Oftentimes, you use a formula to compute the value that should go into each cell.

You can do this in Pandas, too, with the ".apply(lambda ...)" operation.

".apply(lambda x : CodeThatMakesAValueGoesHere)" is something that can go onto the end of a Series (or code that generates a Series).

It generates a new Series with the same number of items as the Series it's been tacked onto.

To build the contents of that new Series, ".apply()" looks at each member of the old Series. While it's looking at any given member, it decides to call that member "x."

Based on the contents of "CodeThatMakesAValueGoesHere," it decides (perhaps based on that member's value) what should be in the corresponding position for the new Series instead of "x."

Here's a really simple example that isn't very useful:

Code:

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample1.csv')
SeriesShowingWhichRowsHaveLastNameS = df['Last'].str.startswith('S') # True/False-typed series
print('---Contents of "SeriesShowingWhichRowsHaveLastNameS":---')
print(SeriesShowingWhichRowsHaveLastNameS)
SeriesWithZZZ = SeriesShowingWhichRowsHaveLastNameS.apply(lambda x: 'zzz')
print('---Contents of "SeriesWithZZZ":---')
print(SeriesWithZZZ)

Output text:

---Contents of "SeriesShowingWhichRowsHaveLastNameS":---
0    False
1    False
2    False
3    False
4     True
5     True
6    False
Name: Last, dtype: bool
---Contents of "SeriesWithZZZ":---
0    zzz
1    zzz
2    zzz
3    zzz
4    zzz
5    zzz
6    zzz
Name: Last, dtype: object

You can see that even though we had all sorts of nice True/False values in "SeriesShowingWhichRowsHaveLastNameS", "SeriesShowingWhichRowsHaveLastNameS.apply(lambda x: 'zzz')" generated a Series that just blasted over all of them and put "zzz" in every row.

---

Let's replace our "'zzz'", which served as our contents of "CodeThatMakesAValueGoesHere" in our last example, with something fancier.

This is a special kind of if-then-else code in Python that can be inserted into your code wherever you would normally just type a value like "'zzz'".

Instead of "'zzz'", we'll say "'YesLastS' if x==True else 'NotLastS'".

We can use "x" in this code, because that's how "lambda x: CodeThatMakesAValueGoesHere" works. You can refer to "x" in your "CodeThatMakesAValueGoesHere."
(You can also give "x" a cooler name - just remember to refer to it by the same name on both sides of the colon.)

Our "'YesLastS' if x==True else 'NotLastS'" is a little hard to read, but all it means is, "If the value of whatever Pandas currently has in "X" is true, then generate a value of "YesLastS." Otherwise, generate a value of "NotLastS."

Let's see what our new code does.

Code:

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample1.csv')
SeriesShowingWhichRowsHaveLastNameS = df['Last'].str.startswith('S') # True/False-typed series
print('---Contents of "SeriesShowingWhichRowsHaveLastNameS":---')
print(SeriesShowingWhichRowsHaveLastNameS)
SeriesWithYesNoLastNameS = SeriesShowingWhichRowsHaveLastNameS.apply(lambda x: 'YesLastS' if x==True else 'NotLastS')
print('---Contents of "SeriesWithYesNoLastNameS":---')
print(SeriesWithYesNoLastNameS)

Output text:

---Contents of "SeriesShowingWhichRowsHaveLastNameS":---
0    False
1    False
2    False
3    False
4     True
5     True
6    False
Name: Last, dtype: bool
---Contents of "SeriesWithYesNoLastNameS":---
0    NotLastS
1    NotLastS
2    NotLastS
3    NotLastS
4    YesLastS
5    YesLastS
6    NotLastS
Name: Last, dtype: object

Great! We have now generated a Series (an ordinary data-containing one, not a "true/false"-containing one) where position-numbers corresponding to records from "df" whose last name starts with "S" say "YesLastS" and other records say "NotLastS".
The code that did this was "SeriesShowingWhichRowsHaveLastNameS.apply(lambda x: 'YesLastS' if x==True else 'NotLastS')"
"SeriesShowingWhichRowsHaveLastNameS" is a "true/false"-containing Series, and then we asked ".apply(lambda...)" to go through that series, one by one, and generate a new series depending on what was found in each position of "SeriesShowingWhichRowsHaveLastNameS."

---

  • Q: Why did we need such a fancy Series?
  • A: Because a "Series" can go on the right-hand side of an "=".

Watch. This is a long example, so read all the code and look at all the output carefully. It combines everything we've learned in this post.

Code:

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample1.csv')
df = df.drop(['Id','Email','Company'], axis=1)
print('---Contents of "df" after dropping columns:---')
print(df)
SeriesShowingWhichRowsHaveLastNameS = df['Last'].str.startswith('S') # True/False-typed series
SeriesShowingWhichRowsHaveFirstNameA = df['First'].str.startswith('A') # True/False-typed series
SeriesWithYesNoLastNameS = SeriesShowingWhichRowsHaveLastNameS.apply(lambda x: 'YesLastS' if x==True else 'NotLastS') # Series with data in it
MiscSeries=pandas.Series(['aaa','bbb','ccc','ddd','eee','fff','ggg'], index=[0,1,2,3,4,5,6], name='AnyOldName')
print('---Contents of "SeriesShowingWhichRowsHaveLastNameS":---')
print(SeriesShowingWhichRowsHaveLastNameS)
print('---Contents of "SeriesShowingWhichRowsHaveFirstNameA":---')
print(SeriesShowingWhichRowsHaveFirstNameA)
print('---Contents of "SeriesWithYesNoLastNameS":---')
print(SeriesWithYesNoLastNameS)
print('---Contents of "MiscSeries"---')
print(MiscSeries)
df['NewCol1'] = SeriesWithYesNoLastNameS # Add "NewCol1" & set all rows to right-hand side
print('---Contents of "df" after adding "NewCol1":---')
print(df)
df.loc[SeriesShowingWhichRowsHaveFirstNameA,'NewCol2'] = SeriesWithYesNoLastNameS # Add "NewCol2" & set only certain rows to right-hand side
print('---Contents of "df" after adding "NewCol2":---')
print(df)
df['NewCol3'] = MiscSeries
print('---Contents of "df" after adding "NewCol3":---')
print(df)
df.loc[SeriesShowingWhichRowsHaveFirstNameA,'NewCol4'] = MiscSeries
print('---Contents of "df" after adding "NewCol4":---')
print(df)

Output text:

---Contents of "df" after dropping columns:---
     First      Last
0    Jimmy    Buffet
1  Shirley  Chisholm
2  Marilyn    Monroe
3    Cesar    Chavez
4  Vandana     Shiva
5   Andrea     Smith
6   Albert    Howard
---Contents of "SeriesShowingWhichRowsHaveLastNameS":---
0    False
1    False
2    False
3    False
4     True
5     True
6    False
Name: Last, dtype: bool
---Contents of "SeriesShowingWhichRowsHaveFirstNameA":---
0    False
1    False
2    False
3    False
4    False
5     True
6     True
Name: First, dtype: bool
---Contents of "SeriesWithYesNoLastNameS":---
0    NotLastS
1    NotLastS
2    NotLastS
3    NotLastS
4    YesLastS
5    YesLastS
6    NotLastS
Name: Last, dtype: object
---Contents of "MiscSeries"---
0    aaa
1    bbb
2    ccc
3    ddd
4    eee
5    fff
6    ggg
Name: AnyOldName, dtype: object
---Contents of "df" after adding "NewCol1":---
     First      Last   NewCol1
0    Jimmy    Buffet  NotLastS
1  Shirley  Chisholm  NotLastS
2  Marilyn    Monroe  NotLastS
3    Cesar    Chavez  NotLastS
4  Vandana     Shiva  YesLastS
5   Andrea     Smith  YesLastS
6   Albert    Howard  NotLastS
---Contents of "df" after adding "NewCol2":---
     First      Last   NewCol1   NewCol2
0    Jimmy    Buffet  NotLastS       NaN
1  Shirley  Chisholm  NotLastS       NaN
2  Marilyn    Monroe  NotLastS       NaN
3    Cesar    Chavez  NotLastS       NaN
4  Vandana     Shiva  YesLastS       NaN
5   Andrea     Smith  YesLastS  YesLastS
6   Albert    Howard  NotLastS  NotLastS
---Contents of "df" after adding "NewCol3":---
     First      Last   NewCol1   NewCol2 NewCol3
0    Jimmy    Buffet  NotLastS       NaN     aaa
1  Shirley  Chisholm  NotLastS       NaN     bbb
2  Marilyn    Monroe  NotLastS       NaN     ccc
3    Cesar    Chavez  NotLastS       NaN     ddd
4  Vandana     Shiva  YesLastS       NaN     eee
5   Andrea     Smith  YesLastS  YesLastS     fff
6   Albert    Howard  NotLastS  NotLastS     ggg
---Contents of "df" after adding "NewCol4":---
     First      Last   NewCol1   NewCol2 NewCol3 NewCol4
0    Jimmy    Buffet  NotLastS       NaN     aaa     NaN
1  Shirley  Chisholm  NotLastS       NaN     bbb     NaN
2  Marilyn    Monroe  NotLastS       NaN     ccc     NaN
3    Cesar    Chavez  NotLastS       NaN     ddd     NaN
4  Vandana     Shiva  YesLastS       NaN     eee     NaN
5   Andrea     Smith  YesLastS  YesLastS     fff     fff
6   Albert    Howard  NotLastS  NotLastS     ggg     ggg

Both our line that adds a column called "NewCol1" and our line that adds a column called "NewCol2" show different values for "Andrea Smith" (whose last name begins with "S") and "Albert Howard" (whose last name doesn't).

The difference is how we added the new lines.

  1. With our simple "df['NewCol1'] = ..." command, we said, "add whatever is on the right-hand side of the '=' to every row in the new column."
    • Since "whatever was on the right-hand side" was a "Series" instead of a simple value like "'zzz'", "the values from positions 0-6" got filled in for "rows 0-6" of the new column, respectively.
      (You can also see this in action with the code that adds "NewCol3" based on the contents of "MiscSeries").
      • Q: "But wait...what if the Series on the right-hand side had had a different number of values than there are rows in the DataFrame "df"?
      • A: Python would complain or you'd get unexpected data back. Don't.
  2. On the other hand, with our fancy "df.loc[SeriesShowingWhichRowsHaveFirstNameA,'NewCol2'] = ..." command, we said:
    "Add whatever is on the right-hand side of the '=' to the rows in the new column where the 'df' DataFrame's first name starts with 'A'. And in the rows that don't, just leave the cells blank."
    (You can also see this in action in with the code that adds "NewCol4" based on the contents of "MiscSeries" AND based on "SeriesShowingWhichRowsHaveFirstNameA".)

That's a wrap for this post's "fancy filtering and editing."

We are probably ready to move on to "combining two CSV files" - stay tuned.


Table of Contents

No comments:

Post a Comment