Pages

Monday, August 22, 2016

Python for Salesforce Administrators - First Scripts - Import/Export CSVs, Filter Out Rows

Let's write a few scripts!

By the time you complete all of these exercises, you will be able to filter out unwanted rows from a CSV file and save the "good" rows as a new CSV file.

This can be very handy if you have a CSV file with so many rows that Excel freezes for 30 seconds every time you try to open or filter it.


Program 1: Displaying Text

Our first program is just one line long - I saved it on my hard drive as "c:\tempexamples\script.py" and am running it through the "play"-looking button in my "Spyder" IDE, where I am also editing it.
(Click here for playable sample...do not type your own company's data into this link!)

print('Hello World')

The text output, when I run it, looks like this:

Hello World

Preparing A CSV For The Next Exercises

Now I'm going to open Notepad and make myself a CSV file (when I do File->Save As, change the "Save as File Type" from .TXT to "All Files (*.*)") and save the file as "c:\tempexamples\sample1.csv".

Here's what that file contains (representing 5 columns worth of fields & 7 rows worth of data):

"Id","First","Last","Email","Company"
"5829","Jimmy","Buffet","jb@example.com","RCA"
"2894","Shirley","Chisholm","sc@example.com","United States Congress"
"294","Marilyn","Monroe","mm@example.com","Fox"
"30829","Cesar","Chavez","cc@example.com","United Farm Workers"
"827","Vandana","Shiva","vs@example.com","Navdanya"
"9284","Andrea","Smith","as@example.com","University of California"
"724","Albert","Howard","ah@example.com","Imperial College of Science"

Program 2: Reading a CSV

Now I'm going to replace the contents of "script.py" with the following 6 lines and run them:
(Click here for playable sample...do not type your own company's data into this link!)

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample1.csv')
print('---Here are all 7 lines---')
print(df)
print('---Here are the first 5 lines---')
print(df.head())

The text output is:

---Here are all 7 lines---
      Id    First      Last           Email                      Company
0   5829    Jimmy    Buffet  jb@example.com                          RCA
1   2894  Shirley  Chisholm  sc@example.com       United States Congress
2    294  Marilyn    Monroe  mm@example.com                          Fox
3  30829    Cesar    Chavez  cc@example.com          United Farm Workers
4    827  Vandana     Shiva  vs@example.com                     Navdanya
5   9284   Andrea     Smith  as@example.com     University of California
6    724   Albert    Howard  ah@example.com  Imperial College of Science
---Here are the first 5 lines---
      Id    First      Last           Email                 Company
0   5829    Jimmy    Buffet  jb@example.com                     RCA
1   2894  Shirley  Chisholm  sc@example.com  United States Congress
2    294  Marilyn    Monroe  mm@example.com                     Fox
3  30829    Cesar    Chavez  cc@example.com     United Farm Workers
4    827  Vandana     Shiva  vs@example.com                Navdanya

Let's talk about this program a little bit.

The first thing we did was declare that we would need to be able to write code using commands from the "pandas" module.

Next, we asked the Pandas module to execute its "read_csv()" operation against a CSV file on our hard drive.

We had to put double-backslashes in place because in Python, inside of a "string" (the piece of code between '' or "" punctuation representing "plain text, not code"), a single backslash means "the following character would normally be code, even inside a string, but it's just what it looks like in this case." To say, "I really do mean just use a backslash," therefore, we need to say "\\".

In this same line, we also save the output from Pandas "reading the CSV" to a variable I decided to call "df."

It doesn't have to be called "df." It could be called "myvariable."

But because the output from a "pandas.read_csv()" operation is, to Python, a special type of data store called a "DataFrame," when you have nothing better to call the variable you're saving it to, it's convention to call that variable "df."

(Note - you might also see code examples with something like "pd.read_csv()". When you import a module, you have an option to give it a different - e.g. shorter - name. Some examples online presume you hate typing and did your first line as "import pandas as pd" so you could type "pd" instead of "pandas". I don't find 6 characters too much to type, so my examples don't follow that convention.)

Our third line of code just put a caption on the screen.

Our fourth line of code said, "show me the contents of the variable "df" as output on my screen."

The spreadsheet-looking formatting and the addition of line numbers has to do with the sophistication under the covers of the way "DataFrame"s are programmed to behave (thanks, authors of the "pandas" module!)

Our fifth line of code just put a caption on the screen.

Our sixth line of code said, "show me the contents of the first five lines of the variable "df" as output on my screen."

This ".head()" operation that can be run from any "DataFrame" is handy when you have, say, a 60,000-line "DataFrame," you know all rows look similar, and all you want to do is check to see what it looks like.


Program 3: Filtering a CSV

Let's write another program.

Now I want to find out which rows of my CSV file include someone whose last name starts with a capital "c." I'm going to replace the contents of "script.py" with the following code and run it:
(Click here for playable sample...do not type your own company's data into this link!)

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample1.csv')
print('---What is in "Last" for each row?---')
print(df['Last'])
print('---For each row, does "Last" start with capital "C"?---')
print(df['Last'].str.startswith('C'))
print('---Show all columns, but only rows where "Last" starts with capital "C"---')
print(df[df['Last'].str.startswith('C')])

The output looks like this:

---What is in "Last" for each row?---
0      Buffet
1    Chisholm
2      Monroe
3      Chavez
4       Shiva
5       Smith
6      Howard
Name: Last, dtype: object
---For each row, does "Last" start with capital "C"?---
0    False
1     True
2    False
3     True
4    False
5    False
6    False
Name: Last, dtype: bool
---Show all columns, but only rows where "Last" starts with capital "C"---
      Id    First      Last           Email                 Company
1   2894  Shirley  Chisholm  sc@example.com  United States Congress
3  30829    Cesar    Chavez  cc@example.com     United Farm Workers

The first two lines of this example do the same thing as in our last example - load up the "pandas" module and save our CSV to a special data structure called a "DataFrame" and give that "DataFrame" a nickname of "df" for further reference in later code.

Lines 3, 5, & 7 just print labels so we can see where we are in the text output.

The call to "df['Last']" in line 4 asks our DataFrame to analyze its own "Last" column and give us back another special data structure called a "Series." We don't bother to save that "series" to a variable with a name - we just pass it to the "print()" command, and Pandas does a great job formatting it on our screen.

(Note that the header is missing - we typed in "'Last'" by hand, so in theory, we know it, and also that after we see all 7 rows (0-6, not 1-7!), we get a little bit of data about the series on the next line.

Line 6 is a copy of line 4, only we took advantage of the fact that "Series" data-structures have an operation called ".str.startswith()" that transforms every value into a True/False answering the question, "Does this value start with this?"

The text-output is formatted similarly to the output of line 4, although in the note at the end, it points out that our values are all of type "Boolean," which means true/false.

Line 8 gets a little crazy. Think of this like arithmetic, where you read parentheses inside out. Notice that the inside of line 8 is just the part of line 6 from inside the "print()", only now with a "df[]" surrounding it.

What this does is ask the DataFrame we nicknamed "df" to compare itself, row by row, to the "series" we generated with code from line 6, throw away the row if the corresponding (by line-number) value was "false," and keep the row if the corresponding value was "true."

We then put all that inside a "print()" statement, and we see a 2-line DataFrame (although the original line numbering is preserved) representing everyone from our CSV file with a last name that starts with a capital "C."

---

If you looked carefully, you might have noticed that in line 4, "df[SomethingSomethingSomething]" meant, "show me columns from the DataFrame where...," but in line 8, the outer usage of "df[SomethingSomethingSomething]" meant, "show me rows from the DataFrame where..."

Confusing?

I think so.

But basically, which way Pandas will interpret the command depends on what's inside the square-brackets.

In the first case, "inside the brackets" was a string - the text "'Last'"; in the second case, "inside the brackets" was a reference to a special data structure called a "Series."

In certain cases, Pandas doesn't actually know whether you're trying to filter by row or by column, so you have to add a comma and some extra "parameters" to let it know ... but the example-code you're copying from the internet will include such text if it's necessary.

My advice to Salesforce sysadmins is to simply copy/paste examples from the internet until they work and move on with your life.


Program 4: More Filtering

Let's get a little more complicated - now we're going to say, "either starting with a C or starting with an S" is fine."

Again, I replace the contents of "script.py" with the following code:
(Click here for playable sample...do not type your own company's data into this link!)

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample1.csv')
print('---For each row, does "Last" start with capital "C" or "S"?---')
print(df['Last'].str.startswith('C') | df['Last'].str.startswith('S'))
print('---Show all columns, but only rows where "Last" starts with capital "C" or "S"---')
print(df[df['Last'].str.startswith('C') | df['Last'].str.startswith('S')])

And here is the output:

---For each row, does "Last" start with capital "C" or "S"?---
0    False
1     True
2    False
3     True
4     True
5     True
6    False
Name: Last, dtype: bool
---Show all columns, but only rows where "Last" starts with capital "C" or "S"---
      Id    First      Last           Email                   Company
1   2894  Shirley  Chisholm  sc@example.com    United States Congress
3  30829    Cesar    Chavez  cc@example.com       United Farm Workers
4    827  Vandana     Shiva  vs@example.com                  Navdanya
5   9284   Andrea     Smith  as@example.com  University of California

It's a lot like lines 6 & 8 of our last program, only we added " | df['Last'].str.startswith('S')" before the final close-parenthesis.


Program 5: Writing To CSV

It's all very nice to be able to filter a spreadsheet and see your output, but as a Salesforce administrator, you probably need a new CSV file so you can do something else with it and send it back to Salesforce.

Let's take our 4-line output from the last example, strip off those meaningless row numbers, and write it to disk at "C:\tempexamples\filteredoutput.csv".

import pandas
df = pandas.read_csv('C:\\tempexamples\\sample1.csv')
newdf = df[df['Last'].str.startswith('C') | df['Last'].str.startswith('S')]
newdf.to_csv('C:\\tempexamples\\filteredoutput.csv', index=False, quoting=1)

Check the folder where you told the ".to_csv()" operation to save the file. You should have a file called "filteredoutput.csv" whose contents look like this (when opened in Notepad - they'll probably look like a normal spreadsheet if you open the file in Excel):

"Id","First","Last","Email","Company"
"2894","Shirley","Chisholm","sc@example.com","United States Congress"
"30829","Cesar","Chavez","cc@example.com","United Farm Workers"
"827","Vandana","Shiva","vs@example.com","Navdanya"
"9284","Andrea","Smith","as@example.com","University of California"

In this 4-line program, again we said we wanted to use the "Pandas" module and then saved a "DataFrame" abstraction of our CSV file into a variable we named "df."

Next, because our manipulations of "df" were getting a little long-winded, we saved a copy of the output of those manipulations (which is also a DataFrame) to a variable we called "newdf."

One of the operations that all DataFrames come with is ".to_csv()". So we invoked that operation from "newdf" and gave it "parameter" values telling that operation where to save the file, not to bother writing the "row numbers" to the file, and to put quotes around each individual piece of data (between the commas).


You Did It

Now you're using Python and Pandas!

From here, your life is largely a matter of Googling other things you want to do and including the word "pandas" in your search. A lot of it will be cryptic, but not all of it is. Also try direct searching in StackOverflow's search box for less-cryptic results.

We'll cover more complicated operations, but this is already a great start.

As I mentioned at the top, you can already put your learning into action if you frequently manipulate CSV files that are annoyingly slow to open in Excel.

For more fun filtering Pandas, if you're used to SQL or SOQL, visit this web page and play around.

Or, if you're comfortable with code, peek at some more advanced operations and see if you can replicate them against your own data.


Table of Contents

No comments:

Post a Comment