Pages

Thursday, August 25, 2016

Python for Salesforce Administrators - Recognizing Pandas DataFrames vs. Series

In our last post, we covered a few code examples that let you manipulate the contents of a ".CSV" (spreadsheet-like plain-text) file by renaming columns, dropping unwanted columns, and adding new columns.

Before we move on into any fancier filtering and modification tricks, we need to talk about the difference between a "DataFrame" and a "Series" in Python's Pandas module.


DataFrame vs. Series: What They Are

  • A "DataFrame" is 2-dimensional - it has both rows & columns. Think of it like an Excel spreadsheet. Its rows are numbered by Pandas behind the scenes. ("Numbered" - for our purposes. It's more complicated than this, but don't worry about it.)
  • A "Series" is 1-dimensional. The concept of "row" & "column" don't even make sense - it's just a list. Although the items in the list are numbered by Pandas behind the scenes. ("Numbered - for our purposes. It's more complicated than this, but don't worry about it.)

Why Do We Care?

Sometimes, they can be used in code interchangeably. For example, you can put either inside a "print()" command and your software for running Python programs will figure out what to do.

Other times, they can't.

Because they can't, it's time to learn to recognize when you're dealing with a "DataFrame" and when you're dealing with a "Series."


How To Recognize A DataFrame vs. A Series

When in doubt, print it out.

Take a look in the following examples.

(Prep work: First, make sure you've created a ".CSV" file like the one described in "First Scripts". I called mine "sample1.csv". As in that post, in each example, we'll overwrite the contents of a piece of Python code saved to our hard drive as "script.py" and run the entire script.)

Memorize These Clues: What A DataFrame Looks Like:

This code:

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample1.csv')
WhatAmI = df
print(WhatAmI)

Generates this text output:

      Id    First      Last           Email                      Company
0   5829    Jimmy    Buffet  jb@example.com                          RCA
1   2894  Shirley  Chisholm  sc@example.com       United States Congress
2    294  Marilyn    Monroe  mm@example.com                          Fox
3  30829    Cesar    Chavez  cc@example.com          United Farm Workers
4    827  Vandana     Shiva  vs@example.com                     Navdanya
5   9284   Andrea     Smith  as@example.com     University of California
6    724   Albert    Howard  ah@example.com  Imperial College of Science

Lines 1 & 2 are our classic "get our CSV file into Pandas, and give it a nickname of "df" (make sure you've made the CSV file from our first exercises).

In line 3, we create a new nickname called "WhatAmI", and we store a reference to the current contents of "df" in it.
(Trivia - in this case, we're literally talking about 2 nicknames for the same "DataFrame." Data-changing operations against either will impact them both.)

In line 4, we print out the contents of "WhatAmI" so we can see what kind of data it is - "DataFrame" or "Series."

There are three clues here that "WhatAmI" is a "DataFrame:"

  1. (Except above the "row-numbers",) There are column labels with nice neat familiar names.
  2. There is nothing displayed on our screen immediately after the line labeled "6."
  3. We have more than 1 column's worth of data
    (some DataFrames are only 1 column, but only DataFrames can have more than 1 column)

Memorize These Clues: What A Series Looks Like:

This code:

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample1.csv')
WhatAmI = df['Last']
print(WhatAmI)

Generates this text output:

0      Buffet
1    Chisholm
2      Monroe
3      Chavez
4       Shiva
5       Smith
6      Howard
Name: Last, dtype: object

The code is the same as in the previous example, only in Line 3, we store "the value of the column 'Last' from the DataFrame 'df'" into "WhatAmI."

There are three clues here that "WhatAmI" is a "Series:"

  1. Besides the row numbers, there is only 1 vertical line's worth of data.
  2. There are no column labels anymore.
  3. On our screen immediately after the line labeled "6," there is a line of text announcing that the thing we're looking at has a Name of "Last" and what type of data Pandas has decided is contained within the thing we're looking at.

Because of the row-numbers, it's easy to fail to notice that this is just a "position-numbered list," not a "row-numbered table," but that's how we need to think of it when we start putting complicated commands together.


Got that?

Review the clues again and make sure you see how they apply to the output-text examples.

On your screen, the "Series" looks like a 2-dimensional table because of the position-numbers.

The important things to remember are that:

  1. A "Series" is not actually 2-dimensional ... it's just a position-numbered list
  2. A "Series" is fundamentally different from a "DataFrame" as far as being able to use it in "Pandas" code is concerned

More Code Examples: Series vs. DataFrame

Let's look at another example of a series. You've seen the value that we're putting into this "WhatAmI" before (line 3, to the right of the "=") - it was part of our very first exercises.

Code:

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample1.csv')
WhatAmI = df['Last'].str.startswith('S')
print(WhatAmI)

Output text:

0    False
1    False
2    False
3    False
4     True
5     True
6    False
Name: Last, dtype: bool

Again, we now know that "df['Last'].str.startswith('S')" produced a "Series" because there's just 1 vertical line worth of data, because there are no column-labels, and because at the end of the output, we have a little "Name" & "Type" announcement.

---

Two examples ago, when we looked at "df['Last']", we saw that it produced a "series."

What if we put something besides "'Last'" into those "[]" brackets after the reference to our main DataFrame "df"?

In our last example, we were just playing with "df['Last'].str.startswith('S')". We know it produces a series.

Some questions to explore:

  • Will our code run if we put something that produces a "Series" (like "df['Last'].str.startswith('S')") inside the brackets of "df[]"?
  • What type of data will the result be? "Series" or "DataFrame?"

Let's find out.

Code:

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample1.csv')
WeKnowThisProducesASeries = df['Last'].str.startswith('S')
WhatAmI = df[WeKnowThisProducesASeries] # (Note the lack of single-quotes inside the [] - we are now referring to our "nickname" for some data we stored, not telling Pandas how to spell the name of a column.)
print(WhatAmI)

Output text:

     Id    First   Last           Email                   Company
4   827  Vandana  Shiva  vs@example.com                  Navdanya
5  9284   Andrea  Smith  as@example.com  University of California

Nice column-labels, nothing after the end of our data talking about "Name" & "Type"...this is a "DataFrame"!

(For readability, I broke "Line 3" into two parts, saving our "series" from the previous example with a new nickname and then referring to that nickname on the next line.
A single "line 3" of "WhatAmI = df[df['Last'].str.startswith('S')]" would have been equivalent and produced the same results. Try it!)

---

Let's look at one more example - I want to show you how subtle the differences in the way you spell Pandas commands can be.

Code:

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample1.csv')
WhatAmI1 = df['Last']
WhatAmI2 = df[['Last']]
WhatAmI3 = df[['First','Last']]
print('---WhatAmI1 contents (a 7-item Series):---')
print(WhatAmI1)
print('---WhatAmI2 contents (a 7-row, 1-column DataFrame):---')
print(WhatAmI2)
print('---WhatAmI3 contents (a 7-row, 2-column DataFrame):---')
print(WhatAmI3)

Output text:

---WhatAmI1 contents (a 7-item Series):---
0      Buffet
1    Chisholm
2      Monroe
3      Chavez
4       Shiva
5       Smith
6      Howard
Name: Last, dtype: object
---WhatAmI2 contents (a 7-row, 1-column DataFrame):---
       Last
0    Buffet
1  Chisholm
2    Monroe
3    Chavez
4     Shiva
5     Smith
6    Howard
---WhatAmI3 contents (a 7-row, 2-column DataFrame):---
     First      Last
0    Jimmy    Buffet
1  Shirley  Chisholm
2  Marilyn    Monroe
3    Cesar    Chavez
4  Vandana     Shiva
5   Andrea     Smith
6   Albert    Howard

This example has 3 "WhatAmI"s to examine. Note how subtle the differences are between #1 & #2.

  • In #1, you put "'Last'" between "df[]" brackets and get a Series.
  • In #2, you put "['Last']" between "df[]" brackets and get a DataFrame.

Woah! Why does Pandas do this?

"WhatAmI" #3 gives us a better view into what's going on with the "double-square-bracket" style of coding.

In Python, "[something, something, something]" is the basic syntax for a list of something-or-other.

In these code examples, the outer [] is attached to a reference to our main DataFrame "df" and gets interpreted as, "do whatever Pandas judges appropriate based on what's inside the 'df[]' brackets."

The inner "[]" brackets, however, are part of these "what's inside" the outer brackets. And these inner "[]" brackets specify that "what's inside" is a list of column-names, not just a single column-name.

In #2, the "list of column names" just so happens to have exactly 1 thing in it.

But in #3, we see that Pandas has decided that whenever "what's inside df[]" is a list of things, it would be a good idea to generate a DataFrame with those two names as column-headers and the corresponding data from "df."

If you think about it, it makes sense that putting a list of column names inside "df[]" would make Pandas generate a DataFrame as output instead of a Series:

  • Series can't have "multiple columns." Only "DataFrames" can.
  • How else would it make sense for Pandas to interpret you saying that you want to deal with "multiple columns from" our main DataFrame "df"?

Observations On "YourDataFrameNicknameHere[]":

Note that we have now seen a total of three ways that Pandas interprets the "[]" command after a reference to a DataFrame like "df," depending on what we put inside it.

  1. We can put 'ColumnNameHere' inside it.
    We'll get back a "Series" showing us the contents of "ColumnNameHere" from the "DataFrame" we nicknamed "df."
     
  2. We can put code that generates a "Series" full of True/False values inside it.
    We'll get back a "DataFrame" showing data from all of "df"'s columns, but only its rows where the "Series" values were "True."
     
  3. We can put ['ColumName1Here','ColumnName2Here'] inside it.
    We'll get back a "DataFrame" showing data from all of "df"'s rows, but only from "df"'s columns named in the list.

If you don't yet understand all this, or still have trouble recognizing "DataFrames" vs. "Series," please review all the examples and, if needed, ask for help in the comments.


Bigger Picture: Why Do We Care?

One day, you'll want to do something that isn't in these examples.

You'll find something close on Google or StackOverflow, but you won't know exactly how it works.

If you can learn to isolate and dissect chunks of code, put them into "print()" statements, and figure out what kinds of data they produce, then you are better-equipped to figure out what kinds of data the code around them require.

This will better let you "swap out chunks of code" from the example to suit your own needs.

Even I have to do this.

Much of what I'm teaching you in this and future blog posts came from studying this StackOverflow post's answers until I fully understood what everything in the answers does.

I had to write and run at least 40 lines of code, none of which were exactly in the examples of that StackOverflow post, before I was ready to write this blog post.

Not every command I played with even does anything useful. I just wanted to see what would come out the other side so I'd be better-equipped to write example code for you.

We'll be back to "fancy tricks against table-style data" in the next post!


Table of Contents

No comments:

Post a Comment