Difference between revisions of "How to Read JSON Files"

From LMU BioDB 2017
Jump to: navigation, search
(Start writing out JSON page.)
 
(Finish up JSON page, as a mirror of the XML page.)
 
Line 54: Line 54:
 
== Specific Parts ==
 
== Specific Parts ==
  
A JSON file consists of three primary parts, each expressed in a very specific manner.
+
A JSON file consists of three primary parts, each expressed in a very specific manner. Of these parts, objects and lists can mix and match in any combination and to nearly any depth (e.g., lists with lists, objects within lists, lists within objects, objects within objects, objects within lists within objects, lists within objects within lists, etc.). Properties are strictly associated with objects.
  
 
=== Objects ===
 
=== Objects ===
Line 62: Line 62:
 
Computers don’t care about spacing within JSON, but humans can read JSON a lot more easily with proper spacing. For human consumption, the braces of an object are typically on their own lines (as above), with the properties indented by a couple of spaces from the braces.
 
Computers don’t care about spacing within JSON, but humans can read JSON a lot more easily with proper spacing. For human consumption, the braces of an object are typically on their own lines (as above), with the properties indented by a couple of spaces from the braces.
  
=== Lists ===
+
=== Properties ===
  
''Lists'' or ''arrays'' represent collections of objects
+
Objects by themselves are actually quite meaningless; there is nothing to say about just an object in its own right. What makes an object useful are its ''properties''—named values that are associated with the object. Properties are expressed within the braces with its name name, sometimes enclosed in quotes, followed by a colon ''':''', followed by its value. Commas ''',''' separate properties from each other.
  
<!--
+
In the example above, the single main object has a single property called ''organism'', which is itself an object. The ''organism'' object in turn has four properties, ''key'', ''name'', ''dbReference'', and ''lineage''.
Tags by themselves hint at the ''structure'' of XML data, but not the actual ''information'' within.  For example, here’s an XML representation for contact information:
 
<contact>
 
  <name></name>
 
  <email></email>
 
  <phone></phone>
 
  <address>
 
    <street></street>
 
    <city></city>
 
    <state></state>
 
    <zip></zip>
 
  </address>
 
  <birthday></birthday>
 
</contact>
 
As you can see, a full XML address book will have ''multiple versions'' of this block; the difference lies in the ''specific contact information'' within that block.  This is referred to as the ''content'' of the tag(s).  Essentially, ''content'' is any text that is in between tags:
 
<contact>
 
  <name>Clark Kent</name>
 
  <email>ckent@dailyplanet.com</email>
 
  <phone>(555) 555-5555</phone>
 
  <address>
 
    <street>344 Clinton St., Apt. 3B</street>
 
    <city>Metropolis</city>
 
    <state>NY</state>
 
    <zip>12345</zip>
 
  </address>
 
  <birthday>June 1, 1938</birthday>
 
</contact>
 
Note that you can’t have content ''in between'' tags.  This is incorrect:
 
<contact>
 
  <name>Bruce Wayne</name>
 
  owner of Wayne Enterprises
 
  <email>bwayne@wayneenterprises.com</email>
 
</contact>
 
Thus, tags either have plain text content in them, or other tags.  Never both.
 
  
=== Attributes ===
+
Properties can be indicated via "dot notation" (where the dot is none other than a period '''.'''). Thus, if we represent the object above as some variable ''obj'', the ''organism'' property of that object is ''obj.organism''. The ''key'' property within that object is then ''obj.organism.key''.
  
An alternative way to provide specific information in an XML file is through ''attributes''.  An attribute is a ''name="value"'' expression that is included ''inside'' a start or standalone tag:
+
=== Lists ===
  <phone withAreaCode="yes">(310) 338-5782</phone>
 
  <birthday format="mmddyyyy">01311970</birthday>
 
Attribute ''names'', like tag names, cannot have spaces.  Attribute values, in turn, must always be enclosed in double-quotes ('''"''').  An equals sign ('''=''') sits between these components.
 
 
 
When should something be an attribute vs. content?  There are no hard-and-fast rules.  The general approach, though, is that an attribute is information ''about'' the content in the tag, while the content is, well, the information of the tag itself.
 
 
 
== The XML Schema ==
 
  
In this page, you’ve seen two types of XML examples: one that looks like it holds gene, protein, or organism information of some sort, and another one that looks like a typical address book. How do you know what the tags mean?  This is where the XML ''schema'' comes in. An XML schema is a separate document that explains the tags and attributes for a particular type of XML document.  We won’t go into too much detail about the XML schema at this point, but suffice it to say that such things exist, so that readers of a particular XML file have an authoritative source for what the tags and attributes within that file might mean.
+
''Lists'' or ''arrays'' represent collections of objects or values. They are denoted by brackets '''[ ]''' with commas ''',''' separating the items in the list. Most lists are meant to contain items of the same type or structure, but JSON does not actually require that. Lists can have a number for the first item and an object for the next; that is permitted. But practically speaking, lists tend to hold uniformly-typed or -structured members.
  
When an XML document follows a particular schema, this is provided at the top of the file:
+
When talking about the contents of a list, it is sometimes convenient to refer to them by their ''ordinal position'' in the list (e.g., first slot, second slot, fifth slot, etc.). JSON lists start counting at 0, and that number is referred to as the ''index'' of an item in the list. Thus, in the JSON example above, the item at index 2 is the object whose ''taxon'' property is "Gammaproteobacteria." Or, combining property notation with indexing, we enclose an index in brackets '''[ ]'''. Thus, continuing the example of using ''obj'' to represent the example JSON object above, the "Gammaproteobacteria" object is ''obj.organism.lineage[2]''. If you were to speak that out entirely, that means "the item at index 2 of the ''lineage'' property of the ''organism'' property of the object called ''obj''."
<uniprot xmlns="http://uniprot.org/uniprot"
 
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 
  xsi:schemaLocation="http://uniprot.org/uniprot http://www.uniprot.org/support/docs/uniprot.xsd">
 
Note that, like any other tag, this tag ends too, at the end of the XML file:
 
</uniprot>
 
More to come on the schema; for now, it’s hoped that you can at least scan some XML information and get an idea of the outline that it provides.
 
  
 
== The Concept’s the Thing ==
 
== The Concept’s the Thing ==
  
Recall that the '''L''' in XML stands for ''language''—meaning that, yes, there are ''other'' languages that may be used to express the same information, in the same way that languages like English, Spanish, or Mandarin Chinese can say the same things, but with different sights and sounds.  Similarly, there are other “formats” for communicating outlines.  For example:
+
Recall that the '''N''' in JSON stands for ''notation''—meaning that, yes, there are ''other'' notations that may be used to express the same information, in the same way that languages like English, Spanish, or Mandarin Chinese can say the same things, but with different sights and sounds.  Similarly, there are other “formats” for communicating outlines.  For example:
  
 
   <organism key="2">
 
   <organism key="2">
Line 141: Line 95:
 
Note how the outline you saw previously is recognizable here, even though it looks different.  The point here is that these “formats” and “languages” are ultimately meant to ''express'' some idea or concept.  The ultimate goal of any language or format is the ''accurate communication'' of ideas.
 
Note how the outline you saw previously is recognizable here, even though it looks different.  The point here is that these “formats” and “languages” are ultimately meant to ''express'' some idea or concept.  The ultimate goal of any language or format is the ''accurate communication'' of ideas.
  
(and yes, the language above is real—it is JSON, short for ''JavaScript Object Notation'')
+
(and yes, the language above is real—it is [[How to Read XML Files|XML, short for ''eXtensible Markup Language'']])
-->
 

Latest revision as of 04:59, 25 October 2017

To this point, you have been working with what are called “plain text” files and information — that is, information that is viewed as a simple sequence of symbols or characters (letters, numbers, punctuation, spaces, etc.), without any additional structure.

There are, however, other text “formats” that do impose a structure over the included data. One such format is called JSON (short for JavaScript Object Notation). This page seeks to introduce you to this type of text information.

Overall Concept

The core idea behind JSON data is that the information inside it can be thought of as an outline or tree. Our own wiki pages have outlines, in the form of either sections or bulleted lists:

  • Level 1, item 1
    • Level 2, item 1 (of level 1, item 1)
    • Level 2, item 2 (of level 1, item 1)
  • Level 1, item 2
  • Level 1, item 3
    • Level 2, item 1 (of level 1, item 3)
    • Level 2, item 2 (of level 1, item 3)
    • Level 2, item 3 (of level 1, item 3)
    • Level 2, item 4 (of level 1, item 3)

JSON also captures an outline; it just looks different. Here’s an example:

{
  organism: {
    key: "2",
    name: {
      type: "scientific",
      text: "Vibrio cholerae"
    },
    dbReference: {
      type: "NCBI Taxonomy",
      key: "3",
      id: "666"
    },
    lineage: [
      { taxon: "Bacteria" },
      { taxon: "Proteobacteria" },
      { taxon: "Gammaproteobacteria" },
      { taxon: "Vibrionales" },
      { taxon: "Vibrionaceae" },
      { taxon: "Vibrio" }
    ]
  }
}

This piece of JSON breaks down, roughly, to this outline:

  • The JSON is for a single object, represented by braces { }, that has one property, organism, which itself is an object
    • The organism object has a key property whose value is "2"
    • The name property is another object whose type is "scientific" and text is "Vibrio cholerae"
    • The dbReference property is an object whose type is "NCBI Taxonomy", key is "3", and id is "666"
    • lineage is a list of objects, indicated by the use of brackets [ ] rather than braces { }, where each object has a single taxon property…
      • taxon: "Bacteria"
      • taxon: "Proteobacteria"
      • taxon: "Gammaproteobacteria"
      • taxon: "Vibrionales"
      • taxon: "Vibrionaceae"
      • taxon: "Vibrio"

Even now, you might already be seeing a pattern in terms of how the JSON looks and what outline it represents. That’s one of the intentions of JSON: it’s meant to strike a balance between human readability and machine readability. The “human readability” part manifests in recognizable words (“name,” “lineage,” “taxon”), while “machine readability” comes in through some special symbols and rules.

Specific Parts

A JSON file consists of three primary parts, each expressed in a very specific manner. Of these parts, objects and lists can mix and match in any combination and to nearly any depth (e.g., lists with lists, objects within lists, lists within objects, objects within objects, objects within lists within objects, lists within objects within lists, etc.). Properties are strictly associated with objects.

Objects

Objects represent distinct, self-contained items or records of data. They begin with a left brace { followed by the object’s properties—names and values. Whereas humans are generally capable of figuring out where a piece of data starts and ends, computers need more help. Thus, every object { has a matching }.

Computers don’t care about spacing within JSON, but humans can read JSON a lot more easily with proper spacing. For human consumption, the braces of an object are typically on their own lines (as above), with the properties indented by a couple of spaces from the braces.

Properties

Objects by themselves are actually quite meaningless; there is nothing to say about just an object in its own right. What makes an object useful are its properties—named values that are associated with the object. Properties are expressed within the braces with its name name, sometimes enclosed in quotes, followed by a colon :, followed by its value. Commas , separate properties from each other.

In the example above, the single main object has a single property called organism, which is itself an object. The organism object in turn has four properties, key, name, dbReference, and lineage.

Properties can be indicated via "dot notation" (where the dot is none other than a period .). Thus, if we represent the object above as some variable obj, the organism property of that object is obj.organism. The key property within that object is then obj.organism.key.

Lists

Lists or arrays represent collections of objects or values. They are denoted by brackets [ ] with commas , separating the items in the list. Most lists are meant to contain items of the same type or structure, but JSON does not actually require that. Lists can have a number for the first item and an object for the next; that is permitted. But practically speaking, lists tend to hold uniformly-typed or -structured members.

When talking about the contents of a list, it is sometimes convenient to refer to them by their ordinal position in the list (e.g., first slot, second slot, fifth slot, etc.). JSON lists start counting at 0, and that number is referred to as the index of an item in the list. Thus, in the JSON example above, the item at index 2 is the object whose taxon property is "Gammaproteobacteria." Or, combining property notation with indexing, we enclose an index in brackets [ ]. Thus, continuing the example of using obj to represent the example JSON object above, the "Gammaproteobacteria" object is obj.organism.lineage[2]. If you were to speak that out entirely, that means "the item at index 2 of the lineage property of the organism property of the object called obj."

The Concept’s the Thing

Recall that the N in JSON stands for notation—meaning that, yes, there are other notations that may be used to express the same information, in the same way that languages like English, Spanish, or Mandarin Chinese can say the same things, but with different sights and sounds. Similarly, there are other “formats” for communicating outlines. For example:

 <organism key="2">
   <name type="scientific">Vibrio cholerae</name>
   <dbReference type="NCBI Taxonomy" key="3" id="666"/>
   <lineage>
     <taxon>Bacteria</taxon>
     <taxon>Proteobacteria</taxon>
     <taxon>Gammaproteobacteria</taxon>
     <taxon>Vibrionales</taxon>
     <taxon>Vibrionaceae</taxon>
     <taxon>Vibrio</taxon>
   </lineage>
 </organism>

Note how the outline you saw previously is recognizable here, even though it looks different. The point here is that these “formats” and “languages” are ultimately meant to express some idea or concept. The ultimate goal of any language or format is the accurate communication of ideas.

(and yes, the language above is real—it is XML, short for eXtensible Markup Language)