Pages

Tuesday, April 4, 2017

Python for Salesforce Administrators - Introduction to XML and JSON

XML and JSON are like each other, but not like CSV

We've talked about how useful Python can be for processing table-style data stored in "CSV" plain-text files.

The key properties of table-style data are that:

  1. The "table" always has a certain number of columns
  2. Every single row in the table has the exact same set of "columns" (the exact same "keys") as every other row and as the table at large
  3. Every single row in the table is capable of having a value in each of these columns ... a "blank" placeholder still needs to be indicated if it doesn't.
  4. Every single row in the table can only have one value in each of these columns. (If you have only one "First Name" column, no row can have two first names.)
  5. Each conceptual "item" in the data is represented by a "row"
  6. Each conceptual "item" (row) can have no more than 1 level of "key-value" properties (a "column header" being the "key" and a given cell beneath a column header, in a specific row, being the "value").
    Example: There's no such thing as having the notion of a "Name" that breaks down at a lower level into "First Name" and "Last Name."
    Sure, the database you're exporting a "CSV" file from might have an automatically computed / "formula" field called "Name" that is just a space-separated merge of "First Name" and "Last Name." But the exported "CSV" file itself -- a plain-text representation of your data -- will show "Name," "First Name," and "Last Name" as independent "columns" whose values are at an equal "level" to each other.

There are other styles of data that can be stored in plain-text files as well.

The two main problems with table-style data that alternative textual representations of data try to get around are:

  1. Giving each conceptual "item" in the data key-value properties that are "nested" inside each other
  2. Letting each conceptual "item" in the data have "keys" that have nothing to do with the "keys" that other conceptual "items" in the data have

A plain-text file where punctuation indicates the start/end of each conceptual "item" in the data, and where the "keys" (and their values) inside of each conceptual "item" are also indicated by careful use of punctuation, can handle both of these requirements.

The two most common formats today are "XML" and "JSON." Plain-text exports of your current Salesforce configuration are often formatted in either of these styles.

  • In both formats, conceptual "items" can have more conceptual "items" nested inside of them.
  • In both formats, there is a way (in XML, 2) of defining the "keys and their values" possessed by each conceptual "item"
  • In XML, the word for the thing representing a conceptual "item" is an "element."
  • In JSON, the word for the thing representing a conceptual "item" is an "object."
  • Despite the "element" vs. "object" linguistic difference, XML and JSON represent the same type of data (nested data where each conceptual "item" gets to define its own "keys" and specify values for them).

We'll have a lot of examples in this post.

  • To view my XML examples graphically, paste them here and click "Tree View".
  • To view my JSON examples graphically, paste them here and click "Tree View".

XML

The punctuation that XML uses to define the beginning and end of an "element" is a "tagset." It looks like this:

<Person></Person>

As you can see, it's the same word, surrounded by less-than and greater-than signs, with the one indicating the "end" of the element starting with a forward-slash.

Each piece in the greater-than or less-than signs is considered a "tag," hence "tagset" for the notion of including them both (kind of like "parenthesis" versus "a set of parentheses").

The fact that the tagset exists in your text file means that it exists as a conceptual "item" in your data.

It doesn't matter whether or not there's anything typed between the tags (after the first greater-than, before the last less-than). This is now a conceptual item that "exists" in your data, simply because the tagset exists.

If it doesn't have anything between the tags, you can think of it a little like a row full of nothing but commas in a CSV file. It's there ... it's just blank.

(In fact, there's even a fancy shortcut for typing such tagsets: this single tag is equivalent to a tagset with nothing in the middle like the one above -- note the forward-slash before the greater-than sign:)

<Person/>

Note that already, though, our "blank" element has one big thing different about it than a row in a CSV file does: it has a name! Rows don't have names. We'll come back to this, but this is why XML has two ways of indicating an element's "keys and their values." By giving each element a name, XML allows the element itself to be used as a "key" definition for the larger context inside of which the element finds itself.

Here's an example:

<Shirt>
 <Color></Color>
 <Fabric></Fabric>
</Shirt>

There are 3 conceptual "items," or "elements," in this data, each of which has a name.

All 3 can stand alone as "elements" in the grammar of XML. Analogy:

You can write an English sentence that has multiple complete sentences inside of it; to write a sentence with multiple complete sentences inside, simply separate the two with semicolons.

However, the fact that the elements named "color" and "fabric" are nested between the tags of the element named "shirt" means that they are also indicating that this particular shirt has keys named "color" and "fabric" (the values to both of which are currently blank).

The line breaks and tabs aren't necessary in XML (even for saying where "color" stops and "fabric" begins), but they help humans read XML.

Now might be a good time to show you the other way of indicating that a particular shirt has a "color" and "fabric," but that their values are blank:

<Shirt Color="" Fabric=""></Shirt>
Or, in shortcut notation, since there's now nothing inside the "Shirt" tagset:
<Shirt Color="" Fabric=""/>

Note that this isn't always treated EXACTLY the same as our nested-tag example when it comes to programming languages that read XML. Some software might argue that there's more of a "nothingness" in the nested-tags example (it truly doesn't have a color), and there's more of a "value without any letters in it"-ness in the inside-the-Shirt-opening-tag example. Just sometimes, though, and that's often you, the programmer, deciding to make that distinction.

The big difference, though, is that in this case, "color" and "fabric" are not standalone elements.
They are "attributes" of the element named "Shirt".

You can't put more standalone "elements" between the quotes after the "=" of an "attribute." You're done. Only a plain-text value can go there.

You can't do this:

<Shirt Color="" Fabric="<Washable></Washable>"></Shirt>

But you can do this:

<Shirt>
 <Color></Color>
 <Fabric>
  <Washable></Washable>
 </Fabric>
</Shirt>

You also can't give any element more than one "attribute" of the same name, whereas you can nest as many same-named "elements" inside of an element as you like.

You can't do this:

<Shirt Color="" Color="" Fabric=""></Shirt>

But you can do this:

<Shirt>
 <Color></Color>
 <Color></Color>
 <Fabric>
  <Washable></Washable>
 </Fabric>
</Shirt>

Those are the main differences between the two ways XML gives you to define key-value pairs on an element.

  1. Attributes are, by definition, "the end of the line" when it comes to the "key" definitions attached to an element (and can't conflict with each others' names)
  2. The names of elements nested inside an element also serve as "key" definitions for the outer element, but they're "fuzzier" than attributes.
    This is usually considered a good thing.
    This desire for "fuzziness" & "repeatability" & "nested-ness" is one of the two reasons people choose a "nested" data format instead of a table-style format to represent their data.

A word of warning: this is also valid code:

<Shirt Color="" Fabric="">
 <Color></Color>
 <Color></Color>
 <Fabric>
  <Washable></Washable>
 </Fabric>
</Shirt>

A human might look at the shirt above and think it has 3 colors and 2 fabrics. It's probably better to think of it the way the computer thinks of it -- that the shirt above has 1 color attribute, 1 fabric attribute, 2 full-on elements nested within it each named "color," and 1 full-on element nested within it named "fabric."


Now let's give our shirt some "values!"

First of all, it's essential to remember that in a way, all these example' elements "keys" already had values. The values for the "keys" were just blank, or they were other elements**.

**(Think about the examples where an element named "washable" was nested inside of an element named "fabric" which was nested inside of an element named "shirt." The value for the "shirt" element's "fabric" key wasn't exactly blank -- the value for that key was more like: "an element called 'washable.'")

But when I say "values" for keys like "color" or "fabric" or "washable," you're probably thinking about things like the word "blue" or the word "red" or the word "leather" or the word "yes" or the word "no." So let's talk about those.

In XML, any given "element" can have exactly 0 or 1 plain-text "value" (like "leather" or "blue") between the tags that show where its boundaries are.
The only other thing that can go "inside" the element besides its (optional) plain-text "value" is more elements.

Here's a really simple element with a plain-text value:

<Shape>
 Rectangle
</Shape>

It's not common for the "outer-most" elements in an XML-formatted piece of data to have values--especially because valid XML always has just 1 outermost element. (If you don't care, you can just make up a name like "RootElement" for the tagset that holds all the "elements" you actually think of as your data.).

But even sometimes "2nd-outer-most" elements don't have values. Particularly when they represent some sort of abstract real-world object with a lot of complexity that you want to capture, they have 0 values but a lot of elements nested inside them.

Although you could describe a fleet of cars like this:

<RootElement>
 <Car>
  First car's Vehicle Identification Number here
 </Car>
 <Car>
  Second car's Vehicle Identification Number here
 </Car>
</RootElement>

The above code implies that the conceptual "items" that you've given names of "car" truly are their Vehicle Identification Numbers. Yet they're really not, are they? They're heavy chunks of steel taking up space in the real world. There isn't really a plain-text value that captures what they are. Therefore, you won't really see a lot of XML like that. Although the word "Car" is, technically, a "key" to each "element's" "value," in this case, it doesn't quite make sense to give "Car" a "value."

Here's a more realistic way of writing the data, using nested elements to show that each car has 1 "key" of "VIN" and that the "value" for that "VIN" key is filled in on both cars:

<RootElement>
 <Car>
  <VIN>
   First car's Vehicle Identification Number here
  </VIN>
 </Car>
 <Car>
  <VIN>
  Second car's Vehicle Identification Number here
  </VIN>
 </Car>
</RootElement>

Here's another realistic way of expressing the same idea, only using attributes to show each car's key & values:

<RootElement>
 <Car VIN="First car's Vehicle Identification Number here">
 </Car>
 <Car VIN="Second car's Vehicle Identification Number here">
 </Car>
</RootElement>

Or, for short (using attributes):

<RootElement>
 <Car VIN="First car's Vehicle Identification Number here"/>
 <Car VIN="Second car's Vehicle Identification Number here"/>
</RootElement>

In this little example, we're actually just dealing with multiple conceptual "items," each of which has the exact same keys as each other, and which has just 1 value per key, so remember that table-style (CSV) data could've easily represented the same data -- in this case, we've just got a 1-column CSV file:

"VIN"
"First car's Vehicle Identification Number here"
"Second car's Vehicle Identification Number here"

I digress -- but it's good to recognize what's going on in your data, and which types of plain-text files are capable of representing it.

Going back to our shirt example, let's say that our data set includes just 1 shirt, that it's "blue and red and green" and made out of "leather and cotton" and that the leather isn't washable but the cotton is.
Our XML representation of our data might look like this:

<Shirt>
 <Color>
  Blue
 </Color>
 <Color>
  Red
 </Color>
 <Color>
  Green
 </Color>
 <Fabric>
  Leather
  <Washable>
  No
  </Washable>
 </Fabric>
 <Fabric>
  Cotton
  <Washable>
  Yes
  </Washable>
 </Fabric>
</Shirt>

There isn't really a good way to represent that concept of what traits the shirt possesses in a single row of a CSV file, is there? This is where XML and JSON shine!

Read the XML above carefully. What you have is:

  • 1 element with a name of "shirt" that has 0 plain-text "values," but has 5 more elements nested inside of it
  • 3 elements with a name of "color" (nested inside the one named "shirt"), none of which have any elements nested inside of them, but each of which have 1 plain-text value
  • 2 elements with a name of "fabric" (nested inside the one named "shirt"), each of which have exactly 1 plain-text value, and each of which also have 1 more element nested inside of them
  • 2 elements with a name of "washable" (nested inside various ones named "fabric"), none of which have any elements nested inside of them, but each of which have 1 plain-text value

There also isn't really a good way to represent this shirt using "attributes" on the "shirt" tagset (because it has multiple colors and multiple fabrics, and because the fabrics have nested elements of their own). However, since each "fabric" only has exactly 1 "washable" key & value, you could use attributes for that as follows:

<Shirt>
 <Color>
  Blue
 </Color>
 <Color>
  Red
 </Color>
 <Color>
  Green
 </Color>
 <Fabric Washable="No">
  Leather
 </Fabric>
 <Fabric Washable="Yes">
  Cotton
 </Fabric>
</Shirt>

The choice is up to you, depending on which way you think it's easier to fetch/modify the values using code and which way you think it's easier for humans to read.

Another choice that's up to you is whether "Leather" is what the fabric truly is (the way "blue" is an adjective and therefore describes what the color truly is), or whether the notion of a fabric is too fuzzy in the real world to capture in a single word (like with our car) and should've been a key-value pair with a key like "name."
It's the same choice we had to make when deciding whether a car was its VIN or whether it had a VIN.
Outer-ward elements representing complex concepts usually just have key-value pairs (like with our car or our shirt examples).
For elements at "deeper" levels of nesting, you'll need to decide whether they "are" something (the optional 1 plain-text value they get) or whether they merely "have" things (nested elements & attributes).
That's a judgment call for you to make based on how your data is going to be used. All organization involves judgment calls trading flexibility against simplicity.
When it comes to writing software to process XML someone else already wrote, it's good to be able to recognize which judgment call they made (because the programming-language commands for extracting the two styles of writing key-value pairs are different).


JSON

The punctuation that JSON uses to define the beginning and end of an "object" is a set of "curly braces." It looks like this:

{}

  • Q: Hey waiddaminute -- that "object" doesn't have a name! I thought you said JSON "objects" and XML "elements" were pretty equivalent in terms of both being representations of conceptual "items" in the same style of organizing data!
  • A: Similar. But not the same. Good catch. JSON doesn't give objects a name. Nor do they get an optional "plain-text value" standing apart from anything nested inside of them. JSON objects look pretty different from XML elements.

Also, you can't just put JSON objects back-to-back the way you can put XML elements back-to-back; this isn't valid JSON:

{}
{}

Instead, you have to put them inside square-brackets and separate them with commas (remember not to put a comma after the last one, since nothing comes next--easy copy/paste mistake when you're putting each one on its own line). This is how you show 2 JSON objects at the same level as each other:

[
 {},
 {}
]

Note that the line breaks and tabs, however, are still for human benefit only.

Also, you don't have to include them inside any sort of "RootElement" container. That right there is valid JSON.

But getting back to our "JSON objects don't have names" problem ... what's the equivalent of this XML in JSON?

<Person/>

There isn't an exact translation, but one representation could be:

{
 "type": "Person"
}

In other words, you're making up an "attribute" (a "key") for the JSON object, calling it "type," and giving it a "value" of "Person" (yup, JSON objects have attributes, and like XML element attributes, you can only use a given attribute-name once!) You could have called it anything -- "type" is nothing special.

Similarly, either this XML:

<Shirt>
 <Color></Color>
 <Fabric></Fabric>
</Shirt>

Or this XML:

<Shirt Color="" Fabric=""></Shirt>

Might become this JSON:

{
 "type" : "Shirt",
 "Color" : null,
 "Fabric" : null
}

(Though, getting back to that thing I mentioned earlier about whether an empty tagset is somehow "emptier" than an empty attribute set of quotes, you might argue that the values of this JSON object's "Color" & "Fabric" attributes/keys should be two quotes in a row, rather than the special keyword 'null'. More than I want to get into right now, but software than processes JSON would see the two differently -- null is emptier than the empty-quotes.)

The biggest difference from XML that the lack of names in JSON introduces is this notion of having to make up your own keyword for the name if you really think it needs a name.

Also, as far as how-to-type-JSON, "attribute" names & values on a JSON object are separated from each other with a colon, and there needs to be a comma between attribute name-and-value pairs. (Again, don't forget not to put a comma after the last attribute name-value pair!)

Furthermore, attribute names are often inside quotes in JSON, and the value can be something that doesn't have quotes around it (we haven't gotten there yet).

Sometimes your data might not need names. If you have a bunch of conceptual "items" back-to-back at the same level of nesting, and none of them need a "plain-text value" representing what they truly are, and all of them just have key-value pairs describing what they "have," JSON is a lot shorter to type. (Especially if you take out all the tabs & line breaks I'm putting in to make this blog readable. JSON authors love to take out line breaks & tabs -- if you run into such code, paste it here and click "Beautify" to read it more easily ... just make sure the "code" you're punching in isn't confidential company information!) Consider this example.

Here's some XML representing a fleet of cars, each of which have different sets of key-value traits we care about tracking (we'll use "attribute" style and "tagset-with-nothing-inside shorthand here), but all of which are cars.

<RootElement>
 <Car color="blue" trim="chrome" trunk="hatchback"/>
 <Car appeal="sporty" doors="2"/>
 <Car doors="4" color="red" make="Ford"/>
</RootElement>

Maybe we already know they're all cars based on the context of our data. A JSON equivalent, without forcing each one to have a silly attribute like "type" (with a value of "car"), could be:

[
 {
  "color" : "blue",
  "trim" : "chrome",
  "trunk" : "hatchback"
 },
 {
  "appeal" : "sporty",
  "doors" : "2"
 },
 {
  "doors" : "4",
  "color" : "red",
  "make" : "Ford"
 }
]

So far, JSON doesn't look much more concise. But that note I made about "attribute values in JSON don't have to be in double-quotes" earlier is where its power really lies. Let's take a look at a shirt with two colors (blue, red) and 1 fabric (nylon). Here's the XML:

<Shirt>
 <Color>blue</Color>
 <Color>red</Color>
 <Fabric>nylon</Fabric>
</Shirt>

And here's some similar JSON (note that I didn't bother to force it to be called "shirt" and that I decided that saying "colors" would make more sense than "color"):

{
 "Colors" : ["blue","red"],
 "Fabric" : "nylon"
}

We're using the same square-brackets-and-commas notation to make a list out of "blue" and "red" that we used to put a bunch of JSON objects together into a data set full of cars. The entirety of the set of brackets is the value of this JSON object's attribute/key called "colors."

As you can see, XML and JSON get pretty different when it comes to writing down the fact that a conceptual "item" in your data set has multiple "keys" all with the same name, each with a different value.

  1. XML works like I just described it (multiple key-pair values, where the key names are the same as each other). JSON doesn't allow that. That's something special to XML. If you like that about XML, there you go--use XML. :-)
  2. In JSON, you just give your object one "key" (maybe making the name of that "key" a plural noun) and you shove all those values into a list.

So, to recap the kinds of "value" you can give an attribute/"key" belonging to a JSON "object" (conceptual item):

  • We've seen that the "value" for a given attribute/"key" of a JSON object can be a piece of plain text (put it between double-quotes).
  • We've seen it be the keyword 'null' (no quotes around it).
  • We've seen it be a list (square-brackets, with commas separating multiple values).
    Note that you can certainly just have one value inside the square brackets, or they can even be empty if the list is empty. Just keep in mind that it's still a list of values to a computer, even if it's empty or a list of size 1.
  • We're about to see that it can be another JSON "object" (we're about to nest things, just like we did in XML!)

Let's go back to our data set that includes just 1 shirt, which is "blue and red and green" and made out of "leather and cotton," where the leather isn't washable but the cotton is.
A JSON representation of our data might look like this:

{
 "type" : "Shirt",
 "colors" : ["blue","red","green"],
 "fabrics" :
  [
   {
    "type" : "Leather",
    "washable" : "No"
   },
   {
    "type" : "Cotton",
    "washable" : "Yes"
   }
  ]
}

As a reminder, here was a short XML version of the same shirt:

<Shirt>
 <Color>
  Blue
 </Color>
 <Color>
  Red
 </Color>
 <Color>
  Green
 </Color>
 <Fabric Washable="No">
  Leather
 </Fabric>
 <Fabric Washable="Yes">
  Cotton
 </Fabric>
</Shirt>

What I notice the most is:

  • JSON is awkward when conceptual "items" in our data need to have attributes and a meaningful name of their own (like "shirt" and "leather" and "cotton")
  • XML is overly wordy by making conceptual "items" out of things that already were true "attributes" of what we humans really think of as our data's conceptual "items" (like "color" -- we're probably not really thinking of it as an "item" in our data set, because "color" is no more complicated than its value(s) -- it doesn't have any other traits about the color that we need to describe).

What's In It For You?

In the end, as a Salesforce administrator, what matters most is being able to recognize which kind of data the Salesforce servers have given you (or in which format the servers expect to receive data from you).

You don't exactly get to argue with Salesforce about which format they should have picked.

  • When you're picking apart data Salesforce sent you, you just need to know how to "parse" the file to extract the data of interest to you.
  • When you're composing a file to send Salesforce, you'll model it after an existing file they sent you (or technical documentation) to figure out exactly how they want you to arrange the details.

In future posts, we'll talk about writing Python code that can do both of these tasks.

  • Understanding the relationship between the textual representation of the data and what it means will be crucial to those exercises.
  • Understanding when you're looking at data that can be made "flat" and "consistent from one conceptual item in the data to the next" (that is, when you can interpret it like a table/CSV) and when you're looking at data that really doesn't have those characteristics is crucial as well. Remember how I "diverted" and pointed out that our fleet of cars with nothing but VINs could have just as easily been a 1-column CSV file? Ideally, you want to be able to make that kind of commentary, too, about entire XML/JSON files or about fragments of them.
    Remember to play with the graphical viewers (XML, JSON)!

Hopefully, this blog post will help you with both tasks by better understanding what the example data says and how it's shaped when you see it.


Table of Contents

No comments:

Post a Comment