Let's write a few scripts!
By the time you complete all of these exercises, you will be able to filter out unwanted rows from a CSV file and save the "good" rows as a new CSV file.
This can be very handy if you have a CSV file with so many rows that Excel freezes for 30 seconds every time you try to open or filter it.
Program 1: Displaying Text
Our first program is just one line long - I saved it on my hard drive as "c:\tempexamples\script.py" and am running it through the "play"-looking button in my "Spyder" IDE, where I am also editing it.
(Click here for playable sample...do not type your own company's data into this link!)
print('Hello World')
The text output, when I run it, looks like this:
Hello World
Preparing A CSV For The Next Exercises
Now I'm going to open Notepad and make myself a CSV file (when I do File->Save As, change the "Save as File Type" from .TXT to "All Files (*.*)") and save the file as "c:\tempexamples\sample1.csv".
Here's what that file contains (representing 5 columns worth of fields & 7 rows worth of data):
"Id","First","Last","Email","Company" "5829","Jimmy","Buffet","jb@example.com","RCA" "2894","Shirley","Chisholm","sc@example.com","United States Congress" "294","Marilyn","Monroe","mm@example.com","Fox" "30829","Cesar","Chavez","cc@example.com","United Farm Workers" "827","Vandana","Shiva","vs@example.com","Navdanya" "9284","Andrea","Smith","as@example.com","University of California" "724","Albert","Howard","ah@example.com","Imperial College of Science"
Program 2: Reading a CSV
Now I'm going to replace the contents of "script.py" with the following 6 lines and run them:
(Click here for playable sample...do not type your own company's data into this link!)
import pandas df = pandas.read_csv('C:\\tempexamples\\sample1.csv') print('---Here are all 7 lines---') print(df) print('---Here are the first 5 lines---') print(df.head())
The text output is:
---Here are all 7 lines--- Id First Last Email Company 0 5829 Jimmy Buffet jb@example.com RCA 1 2894 Shirley Chisholm sc@example.com United States Congress 2 294 Marilyn Monroe mm@example.com Fox 3 30829 Cesar Chavez cc@example.com United Farm Workers 4 827 Vandana Shiva vs@example.com Navdanya 5 9284 Andrea Smith as@example.com University of California 6 724 Albert Howard ah@example.com Imperial College of Science ---Here are the first 5 lines--- Id First Last Email Company 0 5829 Jimmy Buffet jb@example.com RCA 1 2894 Shirley Chisholm sc@example.com United States Congress 2 294 Marilyn Monroe mm@example.com Fox 3 30829 Cesar Chavez cc@example.com United Farm Workers 4 827 Vandana Shiva vs@example.com Navdanya
Let's talk about this program a little bit.
The first thing we did was declare that we would need to be able to write code using commands from the "pandas" module.
Next, we asked the Pandas module to execute its "read_csv()" operation against a CSV file on our hard drive.
We had to put double-backslashes in place because in Python, inside of a "string" (the piece of code between '' or "" punctuation representing "plain text, not code"), a single backslash means "the following character would normally be code, even inside a string, but it's just what it looks like in this case." To say, "I really do mean just use a backslash," therefore, we need to say "\\".
In this same line, we also save the output from Pandas "reading the CSV" to a variable I decided to call "df."
It doesn't have to be called "df." It could be called "myvariable."
But because the output from a "pandas.read_csv()" operation is, to Python, a special type of data store called a "DataFrame," when you have nothing better to call the variable you're saving it to, it's convention to call that variable "df."
(Note - you might also see code examples with something like "pd.read_csv()". When you import a module, you have an option to give it a different - e.g. shorter - name. Some examples online presume you hate typing and did your first line as "import pandas as pd" so you could type "pd" instead of "pandas". I don't find 6 characters too much to type, so my examples don't follow that convention.)
Our third line of code just put a caption on the screen.
Our fourth line of code said, "show me the contents of the variable "df" as output on my screen."
The spreadsheet-looking formatting and the addition of line numbers has to do with the sophistication under the covers of the way "DataFrame"s are programmed to behave (thanks, authors of the "pandas" module!)
Our fifth line of code just put a caption on the screen.
Our sixth line of code said, "show me the contents of the first five lines of the variable "df" as output on my screen."
This ".head()" operation that can be run from any "DataFrame" is handy when you have, say, a 60,000-line "DataFrame," you know all rows look similar, and all you want to do is check to see what it looks like.
Program 3: Filtering a CSV
Let's write another program.
Now I want to find out which rows of my CSV file include someone whose last name starts with a capital "c." I'm going to replace the contents of "script.py" with the following code and run it:
(Click here for playable sample...do not type your own company's data into this link!)
import pandas df = pandas.read_csv('C:\\tempexamples\\sample1.csv') print('---What is in "Last" for each row?---') print(df['Last']) print('---For each row, does "Last" start with capital "C"?---') print(df['Last'].str.startswith('C')) print('---Show all columns, but only rows where "Last" starts with capital "C"---') print(df[df['Last'].str.startswith('C')])
The output looks like this:
---What is in "Last" for each row?--- 0 Buffet 1 Chisholm 2 Monroe 3 Chavez 4 Shiva 5 Smith 6 Howard Name: Last, dtype: object ---For each row, does "Last" start with capital "C"?--- 0 False 1 True 2 False 3 True 4 False 5 False 6 False Name: Last, dtype: bool ---Show all columns, but only rows where "Last" starts with capital "C"--- Id First Last Email Company 1 2894 Shirley Chisholm sc@example.com United States Congress 3 30829 Cesar Chavez cc@example.com United Farm Workers
The first two lines of this example do the same thing as in our last example - load up the "pandas" module and save our CSV to a special data structure called a "DataFrame" and give that "DataFrame" a nickname of "df" for further reference in later code.
Lines 3, 5, & 7 just print labels so we can see where we are in the text output.
The call to "df['Last']" in line 4 asks our DataFrame to analyze its own "Last" column and give us back another special data structure called a "Series." We don't bother to save that "series" to a variable with a name - we just pass it to the "print()" command, and Pandas does a great job formatting it on our screen.
(Note that the header is missing - we typed in "'Last'" by hand, so in theory, we know it, and also that after we see all 7 rows (0-6, not 1-7!), we get a little bit of data about the series on the next line.
Line 6 is a copy of line 4, only we took advantage of the fact that "Series" data-structures have an operation called ".str.startswith()" that transforms every value into a True/False answering the question, "Does this value start with this?"
The text-output is formatted similarly to the output of line 4, although in the note at the end, it points out that our values are all of type "Boolean," which means true/false.
Line 8 gets a little crazy. Think of this like arithmetic, where you read parentheses inside out. Notice that the inside of line 8 is just the part of line 6 from inside the "print()", only now with a "df[]" surrounding it.
What this does is ask the DataFrame we nicknamed "df" to compare itself, row by row, to the "series" we generated with code from line 6, throw away the row if the corresponding (by line-number) value was "false," and keep the row if the corresponding value was "true."
We then put all that inside a "print()" statement, and we see a 2-line DataFrame (although the original line numbering is preserved) representing everyone from our CSV file with a last name that starts with a capital "C."
---
If you looked carefully, you might have noticed that in line 4, "df[SomethingSomethingSomething]" meant, "show me columns from the DataFrame where...," but in line 8, the outer usage of "df[SomethingSomethingSomething]" meant, "show me rows from the DataFrame where..."
Confusing?
I think so.
But basically, which way Pandas will interpret the command depends on what's inside the square-brackets.
In the first case, "inside the brackets" was a string - the text "'Last'"; in the second case, "inside the brackets" was a reference to a special data structure called a "Series."
In certain cases, Pandas doesn't actually know whether you're trying to filter by row or by column, so you have to add a comma and some extra "parameters" to let it know ... but the example-code you're copying from the internet will include such text if it's necessary.
My advice to Salesforce sysadmins is to simply copy/paste examples from the internet until they work and move on with your life.
Program 4: More Filtering
Let's get a little more complicated - now we're going to say, "either starting with a C or starting with an S" is fine."
Again, I replace the contents of "script.py" with the following code:
(Click here for playable sample...do not type your own company's data into this link!)
import pandas df = pandas.read_csv('C:\\tempexamples\\sample1.csv') print('---For each row, does "Last" start with capital "C" or "S"?---') print(df['Last'].str.startswith('C') | df['Last'].str.startswith('S')) print('---Show all columns, but only rows where "Last" starts with capital "C" or "S"---') print(df[df['Last'].str.startswith('C') | df['Last'].str.startswith('S')])
And here is the output:
---For each row, does "Last" start with capital "C" or "S"?--- 0 False 1 True 2 False 3 True 4 True 5 True 6 False Name: Last, dtype: bool ---Show all columns, but only rows where "Last" starts with capital "C" or "S"--- Id First Last Email Company 1 2894 Shirley Chisholm sc@example.com United States Congress 3 30829 Cesar Chavez cc@example.com United Farm Workers 4 827 Vandana Shiva vs@example.com Navdanya 5 9284 Andrea Smith as@example.com University of California
It's a lot like lines 6 & 8 of our last program, only we added " | df['Last'].str.startswith('S')" before the final close-parenthesis.
Program 5: Writing To CSV
It's all very nice to be able to filter a spreadsheet and see your output, but as a Salesforce administrator, you probably need a new CSV file so you can do something else with it and send it back to Salesforce.
Let's take our 4-line output from the last example, strip off those meaningless row numbers, and write it to disk at "C:\tempexamples\filteredoutput.csv".
import pandas df = pandas.read_csv('C:\\tempexamples\\sample1.csv') newdf = df[df['Last'].str.startswith('C') | df['Last'].str.startswith('S')] newdf.to_csv('C:\\tempexamples\\filteredoutput.csv', index=False, quoting=1)
Check the folder where you told the ".to_csv()" operation to save the file. You should have a file called "filteredoutput.csv" whose contents look like this (when opened in Notepad - they'll probably look like a normal spreadsheet if you open the file in Excel):
"Id","First","Last","Email","Company" "2894","Shirley","Chisholm","sc@example.com","United States Congress" "30829","Cesar","Chavez","cc@example.com","United Farm Workers" "827","Vandana","Shiva","vs@example.com","Navdanya" "9284","Andrea","Smith","as@example.com","University of California"
In this 4-line program, again we said we wanted to use the "Pandas" module and then saved a "DataFrame" abstraction of our CSV file into a variable we named "df."
Next, because our manipulations of "df" were getting a little long-winded, we saved a copy of the output of those manipulations (which is also a DataFrame) to a variable we called "newdf."
One of the operations that all DataFrames come with is ".to_csv()". So we invoked that operation from "newdf" and gave it "parameter" values telling that operation where to save the file, not to bother writing the "row numbers" to the file, and to put quotes around each individual piece of data (between the commas).
You Did It
Now you're using Python and Pandas!
From here, your life is largely a matter of Googling other things you want to do and including the word "pandas" in your search. A lot of it will be cryptic, but not all of it is. Also try direct searching in StackOverflow's search box for less-cryptic results.
We'll cover more complicated operations, but this is already a great start.
As I mentioned at the top, you can already put your learning into action if you frequently manipulate CSV files that are annoyingly slow to open in Excel.
For more fun filtering Pandas, if you're used to SQL or SOQL, visit this web page and play around.
Or, if you're comfortable with code, peek at some more advanced operations and see if you can replicate them against your own data.
Table of Contents
- Introduction & Table of Contents
- Before You Begin Programming
- Installing Python 3 on Windows
- First Scripts - Import/Export CSVs, Filter Out Rows
- Rename/Drop/Add Columns
- Recognizing Pandas DataFrames vs. Series (Important!)
- Fancier Row-Filtering and Data-Editing
- Combining Multiple Tables
- Filtering Rows By Maximum Date Per Group
- Introduction to XML and JSON
No comments:
Post a Comment