In our last post, we talked about "DataFrames" vs. "Series" in the Pandas module of the Python programming language. Please read it before you continue with these examples.
In this post, we'll do a deeper diver into row-filtering. We'll also explore different ways to change the values of cells in your "DataFrame" (an abstraction of a CSV file).
(Prep work: First, make sure you've created a ".CSV" file like the one described in "First Scripts". I called mine "sample1.csv". As in that post, in each example, we'll overwrite the contents of a piece of Python code saved to our hard drive as "script.py" and run the entire script.)
Previously, we saw that Python/Pandas commands structured as a "true/false-series-making-thing inside 'df[]'" generates a DataFrame with the same 5 columns as "df", and just the rows where the "true/false" series is "true," as in this snippet of code:
SeriesShowingWhichRowsHaveLastNameS = df['Last'].str.startswith('S') df[SeriesShowingWhichRowsHaveLastNameS]
Filtering Rows Based On Multiple Columns At Once
We can get fancier than that. Now we'll put two similar-but-different "true/false" Series together and use them as the filter instead.
In this example, our output will be a DataFrame showing us all rows from our main DataFrame "df" where the last name starts with "S" OR the first name starts with "M."
Code:
import pandas df = pandas.read_csv('C:\\tempexamples\\sample1.csv') SeriesShowingWhichRowsHaveLastNameS = df['Last'].str.startswith('S') SeriesShowingWhichRowsHaveFirstNameM = df['First'].str.startswith('M') SeriesShowingRowsWhereEitherIsTrue = SeriesShowingWhichRowsHaveLastNameS | SeriesShowingWhichRowsHaveFirstNameM print('---SeriesShowingWhichRowsHaveLastNameS contents:---') print(SeriesShowingWhichRowsHaveLastNameS) print('---SeriesShowingWhichRowsHaveFirstNameM contents:---') print(SeriesShowingWhichRowsHaveFirstNameM) print('---SeriesWhereEitherIsTrue contents:---') print(SeriesShowingRowsWhereEitherIsTrue) print('---DataFrame Where Either Is True:---') print(df[SeriesShowingRowsWhereEitherIsTrue])
Output text:
---SeriesShowingWhichRowsHaveLastNameS contents:--- 0 False 1 False 2 False 3 False 4 True 5 True 6 False Name: Last, dtype: bool ---SeriesShowingWhichRowsHaveFirstNameM contents:--- 0 False 1 False 2 True 3 False 4 False 5 False 6 False Name: First, dtype: bool ---SeriesWhereEitherIsTrue contents:--- 0 False 1 False 2 True 3 False 4 True 5 True 6 False dtype: bool ---DataFrame Where Either Is True:--- Id First Last Email Company 2 294 Marilyn Monroe mm@example.com Fox 4 827 Vandana Shiva vs@example.com Navdanya 5 9284 Andrea Smith as@example.com University of California
In this code, we make 2 Series (lines 3 & 4) that each filter the contents of one column.
Then we put them together into a 3rd Series with an "or" ("|") operator that can combine series (line 5).
Then we take a look at what's inside these Series (lines 6-11).
Note how the output from the "combined" Series we made in line 5 has 3 rows set to "True" - these are the 3 rows where either of the previous two Series had a value set to "True."
Finally, we call our trusty "df[]" on the "combined" series and take a look at it (lines 12-13).
Sure enough, we have a 3-line extract of "df" on the screen as a new DataFrame!
At this point, I could write another blog post about what exactly should go into the building-block Series you use for filters (e.g. the right-hand side of the "=" in lines 3 & 4 of the previous code example).
You can get way crazier than you can in Excel. You're definitely not limited to "starts with a capital S."
However, that'd be a pretty big diversion, so I'm not going to address it now.
If I forget to come back to the subject somewhere in this series, please post in the comments and remind me. But know that it won't be comprehensive - if you can imagine it, Google for it!
Editing Data - Modifying Every Row In 1 Column
Now let's get into writing data, not just reading it.
Remember how when we were just reading data from the DataFrame "df," "df['Last']" produced a Series but "df[CodeThatMakesATrueFalseSeries]" or "df[['Last']]" or "df[['First','Last']]" produced a DataFrame?
Similarly, when it's used for writing data (it's on the left side of an "="), exactly what you put into the "[]" impacts how Pandas behaves.
When writing data to "df," on the left-hand side of an "=" command, "df['Last']", "df[['Last']]", and "df[['First','Last']]" all behave similarly.
They all take whatever is on the right-hand side of the "=" and overwrite every row of the specified columns with that value. Take a look.
Code:
import pandas df = pandas.read_csv('C:\\tempexamples\\sample1.csv') df['Last']='aaa' df[['Email']]='bbb' df[['Id','Company']]='ccc' print(df)
Output text:
Id First Last Email Company 0 ccc Jimmy aaa bbb ccc 1 ccc Shirley aaa bbb ccc 2 ccc Marilyn aaa bbb ccc 3 ccc Cesar aaa bbb ccc 4 ccc Vandana aaa bbb ccc 5 ccc Andrea aaa bbb ccc 6 ccc Albert aaa bbb ccc
Editing Data - Adding Columns With Every Row The Same
When writing data to "df," on the left-hand side of an "=" command, "df['Last']", "df[['Last']]", and "df[['First','Last']]" behave differently from each other when the columns you specify don't exist in "df".
"df['NewColumnName'] = 'zzz'" will add a column called "NewColumnName" to our DataFrame nicknamed "df" and fill every row of that column with the text"zzz."
Code:
import pandas df = pandas.read_csv('C:\\tempexamples\\sample1.csv') df['NewColumn'] = 'zzz' print(df)
Output text:
Id First Last Email Company NewColumn 0 5829 Jimmy Buffet jb@example.com RCA zzz 1 2894 Shirley Chisholm sc@example.com United States Congress zzz 2 294 Marilyn Monroe mm@example.com Fox zzz 3 30829 Cesar Chavez cc@example.com United Farm Workers zzz 4 827 Vandana Shiva vs@example.com Navdanya zzz 5 9284 Andrea Smith as@example.com University of California zzz 6 724 Albert Howard ah@example.com Imperial College of Science zzz
(Don't worry if your output is getting so wide that your Python-running software wraps the output around to the next line, which you can detect by the presence of a "\" floating out in space to the right of your data just before it jumps to the next line. It'll look fine when you export your DataFrame to CSV.)
"df[['NewColumnName']] = 'zzz'" or "df[['NewColumnName1', 'NewColumnName2']] = 'zzz'", however, will produce errors.
Code:
import pandas df = pandas.read_csv('C:\\tempexamples\\sample1.csv') df[['NewColumn']] = 'zzz' print(df)
Output text:
Traceback (most recent call last): ...(lots of garble)... KeyError: "['NewColumn'] not in index"
The takeaway: Only the one-column-name-at-a-time approach can add a new column to your DataFrame.
Editing Data - Modifying Every Column In Certain Rows
What about "df[CodeThatMakesATrueFalseSeries]"? How does it behave on the left side of an "="?
Code:
import pandas df = pandas.read_csv('C:\\tempexamples\\sample1.csv') SeriesShowingWhichRowsHaveLastNameS = df['Last'].str.startswith('S') df[SeriesShowingWhichRowsHaveLastNameS]='aaa' print(df)
Output text:
Id First Last Email Company 0 5829 Jimmy Buffet jb@example.com RCA 1 2894 Shirley Chisholm sc@example.com United States Congress 2 294 Marilyn Monroe mm@example.com Fox 3 30829 Cesar Chavez cc@example.com United Farm Workers 4 aaa aaa aaa aaa aaa 5 aaa aaa aaa aaa aaa 6 724 Albert Howard ah@example.com Imperial College of Science
Hmm...not sure how useful this is. (If we really didn't want a row, why not just filter it out, rather than filling it in with random things?) But at least now we know how it behaves.
"df[CodeThatMakesATrueFalseSeries]" takes whatever is on the right-hand side of the "=" and overwrites every column of "df" with with that value in the rows where the "Series" has a value of true.
Editing Data - Modifying Certain Columns In Certain Rows
What we often want, when editing data in a CSV file, is to make changes to values in a certain column but only for certain rows. Examples:
- We want to inspect every row's value in a "Suffix" column and, if the value for that row is "M.D." or "Ph.D.", change that row's value in the "Prefix" column to "Dr."
- We want to inspect every row's value in the "Email" column and, if the value for that row is obviously bad, change that row's value in the "Email" column to something better.
(The "Email" example requires fancier "filter" and fancier "data-change" commands than we are going to cover right now, but the gist of solving these two problems is the same.)
Let's think through this.
We've seen that we can use "df[...]" to display just rows or just columns of a DataFrame called "df." We've seen that we can use "df[...]" to change just rows or just columns of a DataFrame called "df" if we put it on the left-hand side of an "=".
Is there any way we could just cleverly combine all this?
(The answer is "no," but let's see for ourselves.)
Can we...just do "df[...][...]" = 'zzz'"?
Code:
import pandas df = pandas.read_csv('C:\\tempexamples\\sample1.csv') SeriesShowingWhichRowsHaveLastNameS = df['Last'].str.startswith('S') df[SeriesShowingWhichRowsHaveLastNameS]['Email']='ccc'
Output text:
C:/tempexamples/script.py:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy print(((df[SeriesShowingWhichRowsHaveLastNameS])['Email']='ccc'))
Nope. Well ... what if we put "()" parentheses around 's around df[SeriesShowingWhichRowsHaveLastNameS] to make it clear that it's a "DataFrame" and thus should be able to have its own "[]" after it?
Code:
import pandas df = pandas.read_csv('C:\\tempexamples\\sample1.csv') SeriesShowingWhichRowsHaveLastNameS = df['Last'].str.startswith('S') (df[SeriesShowingWhichRowsHaveLastNameS])['Email']='ccc'
Output text:
C:/tempexamples/script.py:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy print(((df[SeriesShowingWhichRowsHaveLastNameS])['Email']='ccc'))
Nope. Same error. It looks like Pandas didn't need the parentheses for clarification.
The problem seems to be that Pandas just won't let any old "YourDataFrameHere[]" be on the left-hand side of an "=".
Certain DataFrames that came out of other DataFrames are considered "slices" and aren't allowed to be written with the "YourDataFrameHere[] = 'zzz'" style of commands.
(Trivia: those "slices" can, however, be read with the "YourDataFrameHere[]" syntax. Try changing line 4 of the previous example to "print(df[SeriesShowingWhichRowsHaveLastNameS]['Email'])" and see what you get. Also note whether it is a "Series" or a "DataFrame" and ponder why. If you have any questions, ask me in the comments.)
---
It looks like we're going to need something more complicated than "YourDataFrameHere[] = 'zzz'" to do this "both-row-and-column-based" editing.
We need something that lets us clearly specify which part of our code is filtering our rows, and which part of our code is specifying how to change the matching columns.
Enter the "df.loc[]" command.
This command asks for TWO values:
- A "true/false Series" for using as a filter against the rows of "df" (e.g. our "SeriesShowingWhichRowsHaveLastNameS" series)
- A column name (or list of column names) in need of displaying/editing)
To edit values, the whole "df.loc[...,...]" command goes on the left side of an "=", with whatever value you put on the right side of the "=" being what value you want filled in in the matching rows.
Note that you can only put one value on the right-hand side of the "=", so if you have different things to put in different parts of your DataFrame nicknamed "df," you'll need to do back-to-back "df.loc[...,...] = 'zzz'" statements, one apiece.
Let's have a look - we'll do the same "aaa bbb ccc" replacement as two examples ago, only we'll just do it to people whose last names start with "S":
Code:
import pandas df = pandas.read_csv('C:\\tempexamples\\sample1.csv') SeriesShowingWhichRowsHaveLastNameS = df['Last'].str.startswith('S') df.loc[SeriesShowingWhichRowsHaveLastNameS,'Last'] = 'aaa' df.loc[SeriesShowingWhichRowsHaveLastNameS,['Email']] = 'bbb' df.loc[SeriesShowingWhichRowsHaveLastNameS,['Id','Company']] = 'ccc' print(df)
Output text:
Id First Last Email Company 0 5829 Jimmy Buffet jb@example.com RCA 1 2894 Shirley Chisholm sc@example.com United States Congress 2 294 Marilyn Monroe mm@example.com Fox 3 30829 Cesar Chavez cc@example.com United Farm Workers 4 ccc Vandana aaa bbb ccc 5 ccc Andrea aaa bbb ccc 6 724 Albert Howard ah@example.com Imperial College of Science
The Takeaway:
- For simple, every-column changes, you can use a simple "df[...] = 'zzz'" syntax.
- For "depends on row AND column" changes, you need to say "df.loc[YourFilteringSeries,AColumnNameOrFew] = 'blahblahblah'".
Editing Data - Adding A Column With Data Only In Certain Rows
We've used the simpler style "df['NewColumnName'] = 'zzz'" before to add a "constant" value to our DataFrame "df."
But what would we do if we wanted to add a column that was full of blank values, except in certain rows?
Again, we can use our "df.loc[...,...]" style.
Just like "df['NewColumnName'] = 'zzz'" works, but ""df[['NewColumnName']] = 'zzz'" and "df['NewColumnName1','NewColumnName2'] = 'zzz'" don't work, "df.loc[...,...]" has similar restrictions.
If a column doesn't yet exist, the second value passed to "df.loc[...,...]" has to be a plain-old single column name, not a list of column names.
If we want to add multiple similar columns, we'll have to do have back-to-back statements, one apiece.
In other words, this won't work:
df.loc[SeriesShowingWhichRowsHaveLastNameS,['NewColumn1','NewColumn2']] = 'zzz'
But this will:
df.loc[SeriesShowingWhichRowsHaveLastNameS,'NewColumn1'] = 'zzz' df.loc[SeriesShowingWhichRowsHaveLastNameS,'NewColumn2'] = 'zzz'
Let's see it in action.
Example code:
import pandas df = pandas.read_csv('C:\\tempexamples\\sample1.csv') SeriesShowingWhichRowsHaveLastNameS = df['Last'].str.startswith('S') df.loc[SeriesShowingWhichRowsHaveLastNameS,'NewColumn1'] = 'yyy' df.loc[SeriesShowingWhichRowsHaveLastNameS,'NewColumn2'] = 'zzz' print(df)
Output text:
Id First Last Email Company NewColumn1 NewColumn2 0 5829 Jimmy Buffet jb@example.com RCA NaN NaN 1 2894 Shirley Chisholm sc@example.com United States Congress NaN NaN 2 294 Marilyn Monroe mm@example.com Fox NaN NaN 3 30829 Cesar Chavez cc@example.com United Farm Workers NaN NaN 4 827 Vandana Shiva vs@example.com Navdanya yyy zzz 5 9284 Andrea Smith as@example.com University of California yyy zzz 6 724 Albert Howard ah@example.com Imperial College of Science NaN NaN
"NaN" is Pandas-language for, "nothing here - not even an empty-piece-of-text. Just NOTHING."
The "NaN" spots will come out as blank areas (as will "empty-pieces-of-text") when you export your DataFrame to CSV, so don't worry about them in your DataFrame.
Editing Data - Adding Complexly-Filled-In Columns
- Q: What if we want a "NewColumn1" + "NewColumn2" where "people whose last name starts with S" get "yyy," "people whose first name starts with A or M" get "zzz," and "everyone else" gets a blank value?
- A: We put together the previous two examples. First we add the columns, then we edit them.
But look carefully at our data - "Andrea Smith" falls into both categories. So whether Andrea ends up with "yyy" or "zzz" will depend on the last command we run that modifies Andrea's data. In other words, it will depend on the order in which we type our commands.
Here's one example where she ends up with "zzz" (because that's the command we run last):
Code:
import pandas df = pandas.read_csv('C:\\tempexamples\\sample1.csv') SeriesShowingWhichRowsHaveLastNameS = df['Last'].str.startswith('S') SeriesShowingWhichRowsHaveFirstNameAOrM = df['First'].str.startswith('A') | df['First'].str.startswith('M') df.loc[SeriesShowingWhichRowsHaveLastNameS,'NewColumn1'] = 'yyy' # Column name doesn't exist yet print('---How "df" looks after adding NewColumn1---') print(df) df.loc[SeriesShowingWhichRowsHaveLastNameS,'NewColumn2'] = 'yyy' # Column name doesn't exist yet print('---How "df" looks after adding NewColumn2---') print(df) df.loc[SeriesShowingWhichRowsHaveFirstNameAOrM,['NewColumn1','NewColumn2']] = 'zzz' # Column name already exists now print('---How "df" looks after editing NewColumn1 & NewColumn2---') print(df)
Output text:
---How "df" looks after adding NewColumn1--- Id First Last Email Company NewColumn1 0 5829 Jimmy Buffet jb@example.com RCA NaN 1 2894 Shirley Chisholm sc@example.com United States Congress NaN 2 294 Marilyn Monroe mm@example.com Fox NaN 3 30829 Cesar Chavez cc@example.com United Farm Workers NaN 4 827 Vandana Shiva vs@example.com Navdanya yyy 5 9284 Andrea Smith as@example.com University of California yyy 6 724 Albert Howard ah@example.com Imperial College of Science NaN ---How "df" looks after adding NewColumn2--- Id First Last Email Company NewColumn1 NewColumn2 0 5829 Jimmy Buffet jb@example.com RCA NaN NaN 1 2894 Shirley Chisholm sc@example.com United States Congress NaN NaN 2 294 Marilyn Monroe mm@example.com Fox NaN NaN 3 30829 Cesar Chavez cc@example.com United Farm Workers NaN NaN 4 827 Vandana Shiva vs@example.com Navdanya yyy yyy 5 9284 Andrea Smith as@example.com University of California yyy yyy 6 724 Albert Howard ah@example.com Imperial College of Science NaN NaN ---How "df" looks after editing NewColumn1 & NewColumn2--- Id First Last Email Company NewColumn1 NewColumn2 0 5829 Jimmy Buffet jb@example.com RCA NaN NaN 1 2894 Shirley Chisholm sc@example.com United States Congress NaN NaN 2 294 Marilyn Monroe mm@example.com Fox zzz zzz 3 30829 Cesar Chavez cc@example.com United Farm Workers NaN NaN 4 827 Vandana Shiva vs@example.com Navdanya yyy yyy 5 9284 Andrea Smith as@example.com University of California zzz zzz 6 724 Albert Howard ah@example.com Imperial College of Science zzz zzz
Simple as that!
Take a look at how the data changed in the new columns, and for whom it changed to what, over the 3 commands.
Just to make sure you're still awake, I threw in a few tricks we've learned before in the code.
For example, we know we can "OR"-together various series with the "|" operator, so on Line 4, instead of giving nicknames to "df['First'].str.startswith('A')" & "df['First'].str.startswith('M')" and putting the nicknames together with "|", I just put them together directly with "|" and saved the resulting series to an "A or M" nickname.
Furthermore, we know that "NewColumn1" & "NewColumn2" don't yet exist in "df" by line 5, so we can't combine them into a list for the 2nd "..." of a "df.loc[...,...] = 'yyy'" command. Each one has to be added to "df" separately.
However, both columns exist by the time we get to line 11, both columns exist. And we just so happen to want to put the same fixed value (right-hand side of the "=") into both columns. So in that case, we can pass a list to the 2nd "..." of "df.loc[...,...]"
(If we wanted different things in each column - say, "zzz" in NewColumn1 and capital-"ZZZ" in NewColumn2, we would have to do it on separate lines.)
This whole "write a different command on a new line for each different way you want to edit your data" approach isn't actually too different from how you probably do things in Excel.
- You would normally add "Filtering" to the CSV file, expose/hide various rows, copy/paste a value into a certain column for all exposed rows, then re-filter the spreadsheet some other way, do it again, etc.
- The difference is that with Python & Pandas, once you've written the code, it can run much faster than you can click your way around Excel.
Editing Data: Putting Complicated Values Into The Cells
In Excel, when you're "filtering and copying/pasting," you aren't always using a fixed value. Oftentimes, you use a formula to compute the value that should go into each cell.
You can do this in Pandas, too, with the ".apply(lambda ...)" operation.
".apply(lambda x : CodeThatMakesAValueGoesHere)" is something that can go onto the end of a Series (or code that generates a Series).
It generates a new Series with the same number of items as the Series it's been tacked onto.
To build the contents of that new Series, ".apply()" looks at each member of the old Series. While it's looking at any given member, it decides to call that member "x."
Based on the contents of "CodeThatMakesAValueGoesHere," it decides (perhaps based on that member's value) what should be in the corresponding position for the new Series instead of "x."
Here's a really simple example that isn't very useful:
Code:
import pandas df = pandas.read_csv('C:\\tempexamples\\sample1.csv') SeriesShowingWhichRowsHaveLastNameS = df['Last'].str.startswith('S') # True/False-typed series print('---Contents of "SeriesShowingWhichRowsHaveLastNameS":---') print(SeriesShowingWhichRowsHaveLastNameS) SeriesWithZZZ = SeriesShowingWhichRowsHaveLastNameS.apply(lambda x: 'zzz') print('---Contents of "SeriesWithZZZ":---') print(SeriesWithZZZ)
Output text:
---Contents of "SeriesShowingWhichRowsHaveLastNameS":--- 0 False 1 False 2 False 3 False 4 True 5 True 6 False Name: Last, dtype: bool ---Contents of "SeriesWithZZZ":--- 0 zzz 1 zzz 2 zzz 3 zzz 4 zzz 5 zzz 6 zzz Name: Last, dtype: object
You can see that even though we had all sorts of nice True/False values in "SeriesShowingWhichRowsHaveLastNameS", "SeriesShowingWhichRowsHaveLastNameS.apply(lambda x: 'zzz')" generated a Series that just blasted over all of them and put "zzz" in every row.
---
Let's replace our "'zzz'", which served as our contents of "CodeThatMakesAValueGoesHere" in our last example, with something fancier.
This is a special kind of if-then-else code in Python that can be inserted into your code wherever you would normally just type a value like "'zzz'".
Instead of "'zzz'", we'll say "'YesLastS' if x==True else 'NotLastS'".
We can use "x" in this code, because that's how "lambda x: CodeThatMakesAValueGoesHere" works. You can refer to "x" in your "CodeThatMakesAValueGoesHere."
(You can also give "x" a cooler name - just remember to refer to it by the same name on both sides of the colon.)
Our "'YesLastS' if x==True else 'NotLastS'" is a little hard to read, but all it means is, "If the value of whatever Pandas currently has in "X" is true, then generate a value of "YesLastS." Otherwise, generate a value of "NotLastS."
Let's see what our new code does.
Code:
import pandas df = pandas.read_csv('C:\\tempexamples\\sample1.csv') SeriesShowingWhichRowsHaveLastNameS = df['Last'].str.startswith('S') # True/False-typed series print('---Contents of "SeriesShowingWhichRowsHaveLastNameS":---') print(SeriesShowingWhichRowsHaveLastNameS) SeriesWithYesNoLastNameS = SeriesShowingWhichRowsHaveLastNameS.apply(lambda x: 'YesLastS' if x==True else 'NotLastS') print('---Contents of "SeriesWithYesNoLastNameS":---') print(SeriesWithYesNoLastNameS)
Output text:
---Contents of "SeriesShowingWhichRowsHaveLastNameS":--- 0 False 1 False 2 False 3 False 4 True 5 True 6 False Name: Last, dtype: bool ---Contents of "SeriesWithYesNoLastNameS":--- 0 NotLastS 1 NotLastS 2 NotLastS 3 NotLastS 4 YesLastS 5 YesLastS 6 NotLastS Name: Last, dtype: object
Great! We have now generated a Series (an ordinary data-containing one, not a "true/false"-containing one) where position-numbers corresponding to records from "df" whose last name starts with "S" say "YesLastS" and other records say "NotLastS".
The code that did this was "SeriesShowingWhichRowsHaveLastNameS.apply(lambda x: 'YesLastS' if x==True else 'NotLastS')"
"SeriesShowingWhichRowsHaveLastNameS" is a "true/false"-containing Series, and then we asked ".apply(lambda...)" to go through that series, one by one, and generate a new series depending on what was found in each position of "SeriesShowingWhichRowsHaveLastNameS."
---
- Q: Why did we need such a fancy Series?
- A: Because a "Series" can go on the right-hand side of an "=".
Watch. This is a long example, so read all the code and look at all the output carefully. It combines everything we've learned in this post.
Code:
import pandas df = pandas.read_csv('C:\\tempexamples\\sample1.csv') df = df.drop(['Id','Email','Company'], axis=1) print('---Contents of "df" after dropping columns:---') print(df) SeriesShowingWhichRowsHaveLastNameS = df['Last'].str.startswith('S') # True/False-typed series SeriesShowingWhichRowsHaveFirstNameA = df['First'].str.startswith('A') # True/False-typed series SeriesWithYesNoLastNameS = SeriesShowingWhichRowsHaveLastNameS.apply(lambda x: 'YesLastS' if x==True else 'NotLastS') # Series with data in it MiscSeries=pandas.Series(['aaa','bbb','ccc','ddd','eee','fff','ggg'], index=[0,1,2,3,4,5,6], name='AnyOldName') print('---Contents of "SeriesShowingWhichRowsHaveLastNameS":---') print(SeriesShowingWhichRowsHaveLastNameS) print('---Contents of "SeriesShowingWhichRowsHaveFirstNameA":---') print(SeriesShowingWhichRowsHaveFirstNameA) print('---Contents of "SeriesWithYesNoLastNameS":---') print(SeriesWithYesNoLastNameS) print('---Contents of "MiscSeries"---') print(MiscSeries) df['NewCol1'] = SeriesWithYesNoLastNameS # Add "NewCol1" & set all rows to right-hand side print('---Contents of "df" after adding "NewCol1":---') print(df) df.loc[SeriesShowingWhichRowsHaveFirstNameA,'NewCol2'] = SeriesWithYesNoLastNameS # Add "NewCol2" & set only certain rows to right-hand side print('---Contents of "df" after adding "NewCol2":---') print(df) df['NewCol3'] = MiscSeries print('---Contents of "df" after adding "NewCol3":---') print(df) df.loc[SeriesShowingWhichRowsHaveFirstNameA,'NewCol4'] = MiscSeries print('---Contents of "df" after adding "NewCol4":---') print(df)
Output text:
---Contents of "df" after dropping columns:--- First Last 0 Jimmy Buffet 1 Shirley Chisholm 2 Marilyn Monroe 3 Cesar Chavez 4 Vandana Shiva 5 Andrea Smith 6 Albert Howard ---Contents of "SeriesShowingWhichRowsHaveLastNameS":--- 0 False 1 False 2 False 3 False 4 True 5 True 6 False Name: Last, dtype: bool ---Contents of "SeriesShowingWhichRowsHaveFirstNameA":--- 0 False 1 False 2 False 3 False 4 False 5 True 6 True Name: First, dtype: bool ---Contents of "SeriesWithYesNoLastNameS":--- 0 NotLastS 1 NotLastS 2 NotLastS 3 NotLastS 4 YesLastS 5 YesLastS 6 NotLastS Name: Last, dtype: object ---Contents of "MiscSeries"--- 0 aaa 1 bbb 2 ccc 3 ddd 4 eee 5 fff 6 ggg Name: AnyOldName, dtype: object ---Contents of "df" after adding "NewCol1":--- First Last NewCol1 0 Jimmy Buffet NotLastS 1 Shirley Chisholm NotLastS 2 Marilyn Monroe NotLastS 3 Cesar Chavez NotLastS 4 Vandana Shiva YesLastS 5 Andrea Smith YesLastS 6 Albert Howard NotLastS ---Contents of "df" after adding "NewCol2":--- First Last NewCol1 NewCol2 0 Jimmy Buffet NotLastS NaN 1 Shirley Chisholm NotLastS NaN 2 Marilyn Monroe NotLastS NaN 3 Cesar Chavez NotLastS NaN 4 Vandana Shiva YesLastS NaN 5 Andrea Smith YesLastS YesLastS 6 Albert Howard NotLastS NotLastS ---Contents of "df" after adding "NewCol3":--- First Last NewCol1 NewCol2 NewCol3 0 Jimmy Buffet NotLastS NaN aaa 1 Shirley Chisholm NotLastS NaN bbb 2 Marilyn Monroe NotLastS NaN ccc 3 Cesar Chavez NotLastS NaN ddd 4 Vandana Shiva YesLastS NaN eee 5 Andrea Smith YesLastS YesLastS fff 6 Albert Howard NotLastS NotLastS ggg ---Contents of "df" after adding "NewCol4":--- First Last NewCol1 NewCol2 NewCol3 NewCol4 0 Jimmy Buffet NotLastS NaN aaa NaN 1 Shirley Chisholm NotLastS NaN bbb NaN 2 Marilyn Monroe NotLastS NaN ccc NaN 3 Cesar Chavez NotLastS NaN ddd NaN 4 Vandana Shiva YesLastS NaN eee NaN 5 Andrea Smith YesLastS YesLastS fff fff 6 Albert Howard NotLastS NotLastS ggg ggg
Both our line that adds a column called "NewCol1" and our line that adds a column called "NewCol2" show different values for "Andrea Smith" (whose last name begins with "S") and "Albert Howard" (whose last name doesn't).
The difference is how we added the new lines.
- With our simple "df['NewCol1'] = ..." command, we said, "add whatever is on the right-hand side of the '=' to every row in the new column."
-
Since "whatever was on the right-hand side" was a "Series" instead of a simple value like "'zzz'", "the values from positions 0-6" got filled in for "rows 0-6" of the new column, respectively.
(You can also see this in action with the code that adds "NewCol3" based on the contents of "MiscSeries").- Q: "But wait...what if the Series on the right-hand side had had a different number of values than there are rows in the DataFrame "df"?
- A: Python would complain or you'd get unexpected data back. Don't.
-
Since "whatever was on the right-hand side" was a "Series" instead of a simple value like "'zzz'", "the values from positions 0-6" got filled in for "rows 0-6" of the new column, respectively.
-
On the other hand, with our fancy "df.loc[SeriesShowingWhichRowsHaveFirstNameA,'NewCol2'] = ..." command, we said:
"Add whatever is on the right-hand side of the '=' to the rows in the new column where the 'df' DataFrame's first name starts with 'A'. And in the rows that don't, just leave the cells blank."
(You can also see this in action in with the code that adds "NewCol4" based on the contents of "MiscSeries" AND based on "SeriesShowingWhichRowsHaveFirstNameA".)
That's a wrap for this post's "fancy filtering and editing."
We are probably ready to move on to "combining two CSV files" - stay tuned.
Table of Contents
- Introduction & Table of Contents
- Before You Begin Programming
- Installing Python 3 on Windows
- First Scripts - Import/Export CSVs, Filter Out Rows
- Rename/Drop/Add Columns
- Recognizing Pandas DataFrames vs. Series (Important!)
- Fancier Row-Filtering and Data-Editing
- Combining Multiple Tables
- Filtering Rows By Maximum Date Per Group
- Introduction to XML and JSON
No comments:
Post a Comment