# Visualizing Datatypes: Images (raster & vector), Tabular data

1. Images
   1. Raster images: Quantifying the badness of Stitch with math!
   1. EXTRA: Vector images: using Python to make diagrams
1. Tabular data
   1. EXTRA: with Python's `csv` library
   1. with Pandas (https://pandas.pydata.org/)
   1. EXTRA: with NumPy (https://numpy.org/) (probably not time though... maybe next time?)
   
Note: items labeled with an "EXTRA" means that we are unlikely to get to them in class, but they are here as additional info for you!


Let's start this notebook by importing a few things:

In [None]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
matplotlib.rcParams["font.family"] = "sans-serif" # note, could also use like "sans-serif" others, just google

# lets also import numpy
import numpy as np

The `%matplotlib inline` is needed to make plots appear "inline", i.e. in this notebook and is not needed in ALL installs, but we put it in to be on the safe side!

# 1. Images

Depending on interpretation images can be considered as spatial or field data data (raster images), or tree/hierarchical data (vector images).

## 1.A Raster images

We'll explore how to manipulate raster images -- where you are given color information at each pixel (x,y) coordinates.

Link to image: https://uiuc-ischool-dataviz.github.io/spring2019online/week01/images/stitch_reworked.png

To mainpulate images, we'll import a function from the `PILLOW` library like last time:

In [None]:
import PIL.Image as Image # this imports *only* the Image function from the PIL module

Now I'll import my image.  This assumes you have downloaded the image from our class website and have saved it somewhere where you remember the path.  I'm on a Mac, so mine defaults to the downloads folder.

In [None]:
im = Image.open("/Users/jnaiman/Downloads/stitch_reworked.png")

We can take a quick look at this image inline (i.e. inside this Jupyter notebook) like so:

In [None]:
im

Hi Stitch! :D

We can also transform this image into data using NumPy like so:

In [None]:
im_data = np.array(im)
im_data # I can just put this line right here at the end and it will print out this data "inline" as well

We can then check out some features of this dataset.  For example, what is the shape of this dataset?

In [None]:
im_data.shape

This is a 483x430 image with 4 color channels: (R=Red, G=Green, B=Blue, A=Alpha), where here "Alpha" means opacity.

One represenation of RGB color combinations can be seen in a typical "color wheel":

<img src="https://i.pinimg.com/originals/b7/45/3a/b7453aedcbd060c8b842d85f27c083fb.jpg" width="400px">

What are the unique numbers in this dataset? (i.e. this dataset's values w/o any repeats)

In [None]:
np.unique(im_data)

#### ASIDE:
The above lists the unique numbers in this dataset.  If we wanted to check color by color? 

In [None]:
channel_labels = ['R', 'G', 'B', 'A']
for i in range(im_data.shape[2]): # this loops over the last entry of the shape array, so the #4
    print('channel=', channel_labels[i], 
          'unique values=', np.unique( im_data[:,:,i] ) ) # print the unique elements in each channel

This tells us some interesting things!  Unless there are weird combinations, it looks like we have only 3 colors here.  We also only have 2 opacity channels -- either a pixel is opaque (255) or completely invisible (0).

We can double check the uniqe colors by once again apply `np.unique` but using the "axis" argument to tell it what axis to look down.  This is a bit of Python "magic" so feel free to just take my word for it right now, or you can read more details here: https://stackoverflow.com/questions/24780697/numpy-unique-list-of-colors-in-the-image

#### END ASIDE

To see how many unique RGBA combos, we can make use of the `np.unique` function.  Let's try using it the naive way:

In [None]:
np.unique(im_data)

We'll, this isn't quite what we want -- we want a list of the RGBA combs, not all the unique numbers in the array as a whole.  To do this, we have to mess with the shape of the array we give `np.unique`:

In [None]:
im_data.shape

Let's line up this array instead of as x/y pixels as a list of pixels's RGBA colors:

In [None]:
im_data.reshape(-1, im_data.shape[2])

Now we can apply `np.unique` to this reshaped array and specify the specifc axis (the first or 0th axis):

In [None]:
np.unique(im_data.reshape(-1, im_data.shape[2]), axis=0)

We can also display this image again, but since we have the data, we can use `matplotlib` to plot which will give us nice things like axis marks.  If we recall what we did last time with `fig, ax = plt.subplots...`:

In [None]:
fig, ax = plt.subplots(figsize=(10,10)) # generate a "fig" and "ax" object

ax.imshow(im_data) # use imshow function with "ax" object

plt.show() # this gets rid of the print memory address thing

Here, its a bit hard to see that there is white in the interior part of Stitch and then it's transparent outside.  We can modify our plot to show this, by making a gray background.  We can "cheat" into this gray background by multiplying our data with 0 and then giving it an overall "gray" RGBA sequence:

In [None]:
im_data*0 + 125

In [None]:
fig, ax = plt.subplots(figsize=(10,10)) # generate a "fig" and "ax" object

ax.imshow(im_data*0+125) # this first plots a gray image underneath (all RGBA components = 0.5)

plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(10,10)) # generate a "fig" and "ax" object

ax.imshow(im_data*0+125) # this first plots a gray image underneath (all RGBA components = 0.5)
# Note: you can also specify colors in RGB space in the range 0.0-1.0 like we are doing here
#       instead of 0-255 like we did above!

ax.imshow(im_data) # then we plot our Stitch image over the top

plt.show() # this gets rid of the print memory address thing

If we recall from the lecture, we talked a bit about how to use this visualization to figure out how to quantify the goodness or badness of Stitch.  Let's play with this idea a bit more now.

Let's start by counting up all the pixels that are "good" in Stitch.  If we see our image above, this is denoted by the white parts of the head an ears.  

**NOTE: this is an example of *filtering* our data!**

In color space, white is denoted by (255, 255, 255, 255) so we will create a filtering "mask" for these values by making sure all color channels have the value of 255.

For example, we can make a boolean mask that only looks for when the R channel, the first in the RGBA channels, is at the maximum value = 255

In [None]:
reds_good_mask = im_data[:,:,0] == 255

In [None]:
reds_good_mask

Its mostly false because most of the image does NOT have a 255 red channel.  But we can also see which parts of the image are:

In [None]:
im_data[reds_good_mask]

In [None]:
np.unique(im_data[reds_good_mask].reshape(-1, im_data.shape[2]), axis=0)

So, this is, in theory ONLY looking for pixels that have 255 in the red channel, and we are lucky in this case that this is only associated with one color -- the color white.  But, to be sure, we really want to make a check for ALL of the RGBA channels and make a combined mask for all of them.  With boolean masks this looks like:

In [None]:
reds_good_mask = im_data[:,:,0] == 255
greens_good_mask = im_data[:,:,1] == 255
blues_good_mask = im_data[:,:,2] == 255
alphas_good_mask = im_data[:,:,3] == 255

# pixel_mask_good is the combined boolean mask that will check for ALL conditions
pixel_mask_good = reds_good_mask & greens_good_mask & blues_good_mask & alphas_good_mask
# Note the "\" is a line continuation character --> make sure you don't have anything, even a space after it!

Now, using this mask, let's count up the number of "good" pixels.  

We do this by first selecting out only the good pixels:

In [None]:
good_pixels = im_data[pixel_mask_good]

And then we find out the length of this array which is simply the total number of "good" pixels:

In [None]:
ngood = len(good_pixels)
ngood

We can even plot this part of our dataset by creating a masked Stitch image that will just take this "good" part out:

In [None]:
im_data_masked_good = im_data.copy() # first we make a copy of our original dataset to modify
im_data_masked_good[~pixel_mask_good] = 0 # then we set everything that is *NOT* a good pixel to 0 so it will show up gray
# Note here the "~" is the opposite mask of our good pixel mask

In [None]:
fig, ax = plt.subplots(figsize=(10,10)) # generate a "fig" and "ax" object

ax.imshow(im_data*0.0+0.5) # plot gray background as before

ax.imshow(im_data_masked_good) # then we plot our Stitch image over the top

plt.show() # this gets rid of the print memory address thing

We can do the same thing for the "bad" pixels.  This is the color RGBA combo of: (126, 22, 33, 255)

In [None]:
pixel_mask_bad = (im_data[:,:,0] == 126) & \
                  (im_data[:,:,1] == 22) & \
                  (im_data[:,:,2] == 33) & \
                  (im_data[:,:,3] == 255)
# Note the "\" is a line continuation character --> make sure you don't have anything, even a space after it!

In [None]:
nbad = len(im_data[pixel_mask_bad])
nbad

And also plot:

In [None]:
im_data_masked_bad = im_data.copy() # first we make a copy of our original dataset to modify
im_data_masked_bad[~pixel_mask_bad] = 0 # then we set everything that is *NOT* a good pixel to 0 so it will show up gray
# Note here the "~" is the opposite mask of our good pixel mask

fig, ax = plt.subplots(figsize=(10,10)) # generate a "fig" and "ax" object

ax.imshow(im_data*0.0+0.5) # plot gray background as before

ax.imshow(im_data_masked_bad) # then we plot our Stitch image over the top

plt.show() # this gets rid of the print memory address thing

We can then calculate the goodness/badness ratio as their respective numbers divided by the total number of interior pixels: 

$\rm{goodness \, \%} = \frac{ngood}{ngood+nbad}$

$\rm{badness \, \%} = \frac{nbad}{ngood+nbad}$

In [None]:
total = ngood + nbad
badness = nbad / total
goodness = ngood/  total
print(badness, goodness)

So, it looks like about 77% bad and 23% good, by volume.  Does that match up with what you'd think from looking at the above figure?

We can now plot the goodness and badness levels with a little bar plot that may show these levels a bit more accurately.  We'll also add a little legend to show what is "goodness" and "badness" colors.

Note, there is a nice diagram of the named colors in Python below:

<img src="https://matplotlib.org/3.1.0/_images/sphx_glr_named_colors_003.png" width="600px">

In [None]:
fig, ax = plt.subplots(figsize=(8,8))

ax.bar([1], badness, [0.5], color='maroon', label="badness") # this just plots a bar centered at 1 with a width of 0.5

ax.bar([1], goodness, [0.5], color="steelblue", 
       bottom=badness, label="goodness") # this plots a bar *on top* of the badness bar

ax.set_xlim(0.0, 2.0) # to center around our bar

# since our x-axis are meaningless, we want to "hide" them (see week01's notebook):
ax.xaxis.set_visible(False)

plt.show()

What if we just counted pixes from our figure above? Looks like good changes to badness at ~150, image top is at ~75 pixels image bottom is at ~425 pixels:

In [None]:
# Let's remind ourselves a bit of what this image looks like
fig,ax = plt.subplots(figsize=(5,5))
ax.imshow(im)

ax.plot([0,450], [150, 150]) # approximate badness line
ax.plot([0,450], [75, 75]) # approximate top line
ax.plot([0,450], [425, 425]) # approximate bottom line

ax.set_xlim(0,450)

plt.show()

In [None]:
# so:
goodness_apparent = (75-150)/(75-425)

In [None]:
# what is badness, apparent
1.0-goodness_apparent

### A few more visualizations of this dataset:

#1: A pie chart

In [None]:
fig, ax = plt.subplots(figsize=(5,5))

ax.pie([badness,goodness]) # can also do a pie chart if we want I suppose :D
# note: this uses wedges!!

plt.show()

#2: Side-by-side histogram:

In [None]:
fig, ax = plt.subplots(figsize=(8,8))

# we'll turn these into arrays to make our lives easier down the road
labels = np.array(['badness', 'goodness'])
values = np.array([badness, goodness])
colors = np.array(['maroon', 'steelblue'])

myBarChart = ax.bar(labels, values) 

# set colors for each bar individually
for i in range(len(myBarChart)):
    myBarChart[i].set_color(colors[i])

plt.show()

Since we have the RGBA color values for the "goodness" and "badness" we can also plot these:

In [None]:
fig, ax = plt.subplots(figsize=(8,8))

labels = np.array(['badness', 'goodness'])
values = np.array([badness, goodness])
colors = np.array([(126, 22, 33, 255), (255, 255, 255, 255)])

myBarChart = ax.bar(labels, values) 

# set colors for each bar individually
for i in range(len(myBarChart)):
    myBarChart[i].set_color( colors[i]/255 )
    myBarChart[i].set_edgecolor('black') # because one of our colors is white
    myBarChart[i].set_linewidth(2) # so we can see the outlines clearly

plt.show()

#3. Bar charts of all colors

Looking at the number of unique colors in this image I can see that there are in fact 4 (we only used 2 to plot "goodness" and "badness"):

In [None]:
np.unique(im_data.reshape(-1, im_data.shape[2]), axis=0)

Let's try plotting all of the colors combining what we have used in this image manipulation lesson:

In [None]:
number_of_pixels_of_a_color = []
color_labels = []
colors = []
for icolor,rgba in enumerate(np.unique(im_data.reshape(-1, im_data.shape[2]), axis=0)):
    #print(icolor, rgba)
    
    # mask each channel
    reds_mask = im_data[:,:,0] == rgba[0]
    greens_mask = im_data[:,:,1] == rgba[1]
    blues_mask = im_data[:,:,2] == rgba[2]
    alphas_mask = im_data[:,:,3] == rgba[3]

    # combined mask
    pixel_mask = reds_mask & greens_mask & blues_mask & alphas_mask
    
    # grab number of pixels
    this_color_pixels = im_data[pixel_mask]
    number_of_pixels_of_a_color.append(len(this_color_pixels))
    # this could be done better...
    color_labels.append( 'Color #' + str(icolor) )
    
    colors.append( rgba/255 )

In [None]:
colors # again, these have to be re-scaled to 0-1

In [None]:
fig, ax = plt.subplots(figsize=(8,8))

myBarChart = ax.bar(color_labels, number_of_pixels_of_a_color) 
# set colors for each bar individually
for i in range(len(myBarChart)):
    myBarChart[i].set_color(colors[i])
    myBarChart[i].set_edgecolor('black') # because one of our colors is white
    myBarChart[i].set_linewidth(2) # so we can see the outlines clearly

plt.show()

**NOTE:** the above histograms are some examples of *data mutations* or *mutating our data*.

#### Bonus: Shuffling Stitch

In [None]:
im_shuffle = im_data.copy()

# shuffle horizontal
sub = im_data[0:50, :, :].copy()
im_shuffle[0:50,:,:] = im_data[150:200,:,:]
im_shuffle[150:200,:,:] = sub

sub = im_data[400:450, :, :].copy()
im_shuffle[300:350,:,:] = im_data[400:450,:,:]
im_shuffle[400:450,:,:] = sub

# shuffle vertical
sub = im_data[:, 0:50, :].copy()
im_shuffle[:, 0:50,:] = im_data[:, 150:200,:]
im_shuffle[:,150:200,:] = sub

fig, ax = plt.subplots(figsize=(10,10))
ax.imshow(im_shuffle)

# taking of axis and "tight layout"
ax.xaxis.set_visible(False)
ax.yaxis.set_visible(False)
ax.spines['right'].set_visible(False) # takes off right y-axis
ax.spines['left'].set_visible(False) # takes off left y-axis
ax.spines['top'].set_visible(False) # takes off the top x-axis
ax.spines['bottom'].set_visible(False) # takes off the bottom x-axis
fig.tight_layout()

plt.show() # show

fig.savefig('./images/shuffled_stitch.png')

## EXTRA: 1.B Vector images

Vector images are constructed with a set of "instructions" rather than color values at (x,y) coordinates.  This can make their construction adn manipulation easier to deal with since you can tweak these instructions and change the look of the diagram.  In contrast, to change elements of a raster image, you have to make changes pixel-by-pixel.

**NOTE:** since we will mostly be dealing with raster images in this course, we may skip this portion in class if we are short on time.

Let's re-do that diagram of the angular distribution of human vision we had in the slides from last lecture, in particular, the FOV image.

<img src="https://uiuc-ischool-dataviz.github.io/spring2020/week01/images/fov.png" width='600px'>

We'll first choose an edge color for our diagrams.  It looks like we are using black to draw lines:

In [None]:
edgecolor = 'black'

Let's also pick the colors for the large wedge which looks sort of blueish:

In [None]:
facecolor_totalFOV = "steelblue" # check out list of colors above

... and the colors of the binocular part of the FOV diagram:

In [None]:
facecolor_bincFOV = "darkorange" # check out list of colors above

The only color left looks to be the colors of the arrows which are grayish:

In [None]:
facecolor_arrow = "silver" # gray

Now we make use of some vector-drawing capabilities of `matplotlib`.  First we make a blue wedge that shows the full field of view:

In [None]:
totalFOV = matplotlib.patches.Wedge([0.0, 0.0], 1.0, 90 - (210/2.0), 90 + (210/2.0), # span of the wedge
                                    lw=2.0, 
                                    facecolor=facecolor_totalFOV, 
                                    edgecolor=edgecolor)

Now we'll build up diagrams by adding "patches" one by one, first let's try making a plot of just this FOV wedge:

In [None]:
fig, ax = plt.subplots(figsize=(10, 7), dpi=300)

ax.add_patch(totalFOV)

ax.set_xlim(-1.25, 1.25)
ax.set_ylim(-0.5, 1.25)

# This below takes off the tick marks and axis labels
ax.set_xticks([])
ax.set_yticks([])
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['bottom'].set_visible(False)

plt.show()

Let's add in the Binocular part and replot:

In [None]:
totalFOV = matplotlib.patches.Wedge([0.0, 0.0], 1.0, 90 - (210/2.0), 90 + (210/2.0), # span of the wedge
                                    lw=2.0, 
                                    facecolor=facecolor_totalFOV, 
                                    edgecolor=edgecolor)

binoc = matplotlib.patches.Wedge([0.0, 0.0], 1.0, 90 - (114/2.0), 90 + (114/2.0), 
                                 width=0.25, # so that it doesn't overlap totally with total FOV
                                 lw=2.0, 
                                 facecolor=facecolor_bincFOV, edgecolor=edgecolor)

In [None]:
fig, ax = plt.subplots(figsize=(10, 7), dpi=300)

ax.add_patch(totalFOV)
ax.add_patch(binoc)

ax.set_xlim(-1.25, 1.25)
ax.set_ylim(-0.5, 1.25)

# This below takes off the tick marks and axis labels
ax.set_xticks([])
ax.set_yticks([])
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['bottom'].set_visible(False)

plt.show()

... now add in the arrows:

In [None]:
totalFOV = matplotlib.patches.Wedge([0.0, 0.0], 1.0, 90 - (210/2.0), 90 + (210/2.0), # span of the wedge
                                    lw=2.0, 
                                    facecolor=facecolor_totalFOV, 
                                    edgecolor=edgecolor)

binoc = matplotlib.patches.Wedge([0.0, 0.0], 1.0, 90 - (114/2.0), 90 + (114/2.0), 
                                 width=0.25, # so that it doesn't overlap totally with total FOV
                                 lw=2.0, 
                                 facecolor=facecolor_bincFOV, edgecolor=edgecolor)

arrow = matplotlib.patches.Arrow(-1.10, 0.0, 0.0, 0.75, 
                                 width=0.25, edgecolor=edgecolor, 
                                 facecolor=facecolor_arrow)#, label="forward")



In [None]:
fig, ax = plt.subplots(figsize=(10, 7), dpi=300)

ax.add_patch(totalFOV)
ax.add_patch(binoc)
ax.add_patch(arrow)

ax.set_xlim(-1.25, 1.25)
ax.set_ylim(-0.5, 1.25)

# This below takes off the tick marks and axis labels
ax.set_xticks([])
ax.set_yticks([])
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['bottom'].set_visible(False)

plt.show()

Finally, let's re-do the whole thing, but with text this time:

In [None]:
totalFOV = matplotlib.patches.Wedge([0.0, 0.0], 1.0, 90 - (210/2.0), 90 + (210/2.0), # span of the wedge
                                    lw=2.0, 
                                    facecolor=facecolor_totalFOV, 
                                    edgecolor=edgecolor)

binoc = matplotlib.patches.Wedge([0.0, 0.0], 1.0, 90 - (114/2.0), 90 + (114/2.0), 
                                 width=0.25, # so that it doesn't overlap totally with total FOV
                                 lw=2.0, 
                                 facecolor=facecolor_bincFOV, edgecolor=edgecolor)

arrow = matplotlib.patches.Arrow(-1.10, 0.0, 0.0, 0.75, 
                                 width=0.25, edgecolor=edgecolor, 
                                 facecolor=facecolor_arrow)

In [None]:
fig, ax = plt.subplots(figsize=(10, 7), dpi=300)

ax.add_patch(totalFOV)
ax.add_patch(binoc)
ax.add_patch(arrow)

ax.set_xlim(-1.25, 1.25)
ax.set_ylim(-0.5, 1.25)

# Finally, lets overplot the arrow's notatoin
plt.text(-1.22, 0.25, "Forward", rotation=90, fontsize="xx-large")


# lets also add a legend to remind us what is what
ax.legend([totalFOV, binoc], ["Total FOV", "Binocular FOV"], fontsize="x-large")


# disappear the axis & ticks
ax.set_xticks([])
ax.set_yticks([])
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['bottom'].set_visible(False)

plt.show()

Finally, we can save this image in a vector format to use in vector applications:

In [None]:
fig.savefig('images/wedges_saved.svg') # this saves in a folder named "images" in the present 
#                                         directory - feel free to change as needed

## Take away
* so, that was a lot of effort (maybe) to make a diagram, *but* we can now go back and change things very easily 
* for example we can change all the colors **do this**, or we can change the size of the wedge
* the take away is that Python not only makes graphs, but it can also be used to make diagrams

# 2. Tabular Data 

Let's also try making some histograms from tabular data, in this case a CSV file.

Make sure you have the building inventory downloaded from the class website! https://uiuc-ischool-dataviz.github.io/spring2019online/week02/building_inventory.csv

## EXTRA: 2.A: Python's csv library

We'll use the `csv` library within Python:

In [None]:
import csv

Supply the full path to the CSV file:

In [None]:
f = open("/Users/jnaiman/Downloads/building_inventory.csv")

Here we are just going to read in our data. We can see its sort of in a weird format that isn't terribly intuative to look at.

In [None]:
f.seek(0) # start at the top of the file
for record in csv.reader(f):
    print(record)

We'll try formatting this data ourselves.  Let's fill up a dictonary with each column and then add data into it later.

In [None]:
f.seek(0)
reader = csv.reader(f)
header = next(reader) # this is just so that we grab only the data columns
header

Now, let's use these data headers to fill in a dictionary with data entries.

Refer to IS452's intro to dictionaries for reference: https://github.com/jnaiman/IS-452AO-Fall2019/blob/master/Lectures/Week-09-Dictionaries.ipynb

In [None]:
data = {} # empty dictonary
for col in header: # for every column name (key in dictionary) in header
    data[col] = [] # add a value to this key and give make it an empty list
data # now we have an empty dictonary with named entries ready to be filled

To fill the dictionary we are going to use the function `zip`.

Here is a little example:

In [None]:
a = ["hi", "there", "my", "friends"]
b = [9, 4, 1, 9]
for word, num in zip(a, b):
    print(word, num)

You can think of if kind of like "enumerate" that we used above, but its iterating over 2 lists here instead of a number and a list.

Let's use `zip` to fill up our dictionary:

In [None]:
# first, a call like before
f.seek(0)
reader = csv.reader(f)
header = next(reader)

# repeat our creation of our data dictionary & initialize to empty lists
data = {}
# fill column names as dictionary headings
for col in header:
    data[col] = []
    
# finally, fill lists within headers
for row in reader:
    for col, val in zip(header, row):
        data[col].append(val)

In [None]:
data

We can call our data dictionary like we would any other dictionary, by giving it a `key` and seeing what `values` come with that key.  For example, what values are associated with the `Zip code` key?

In [None]:
data['Zip code']

How many?

In [None]:
len(data['Zip code'])

How many unique ones?

In [None]:
np.unique(data['Zip code'])

Let's try another one:

In [None]:
len(data['Agency Name'])

We can also use keys() to list our dictionary names:

In [None]:
data.keys()

### Making plots using our `data` dictionary

If we want to aggretate our data, we can use something like the `collections` module and the `Counter` object (see IS452 Dictionaries week again for a reminder!)

In [None]:
import collections

For example, here we can create a counter for how many entries have particular agency names:

In [None]:
agency_counter = collections.Counter(data['Agency Name'])
agency_counter

We can use these counter objects to make plots like before.  Let's start with a histogram:

In [None]:
fig, ax = plt.subplots(figsize=(8,6))

ax.bar(agency_counter.keys(), agency_counter.values())

plt.show()

Wow is it hard to read those labels!  Let's try rotating them!

In [None]:
fig, ax = plt.subplots(figsize=(10,6))

ax.bar(agency_counter.keys(), agency_counter.values())
fig.autofmt_xdate(rotation=90)

# or:
#plt.xticks(rotation=90)

plt.show()

So that is starting to look pretty cool.  What about if we just want to plot the top 10 though since it looks like most of the interesting stuff happens there!

If we recall, we can do this with a counter object:

In [None]:
agency_counter.most_common?

In [None]:
agency_counter_top_10 = agency_counter.most_common(n=10)

In [None]:
agency_counter_top_10

Note that the above is now a list, not a dictionary so we have to reformat a bit if we want to plot things:

In [None]:
agency_counter_top_10[0]

In [None]:
agency_counter_top_10[0][0], agency_counter_top_10[0][1]

In [None]:
agency_name_top_10 = []; count = []
for ac in agency_counter_top_10:
    agency_name_top_10.append(ac[0])
    count.append(ac[1])

In [None]:
fig, ax = plt.subplots(figsize=(10,6))

ax.bar(agency_name_top_10, count)
fig.autofmt_xdate(rotation=90)

plt.show()

## 2.B Pandas

We can also do a lot of these things with the `pandas` library.

(This is something you can pip or anaconda install if you need to.)

In [None]:
import pandas as pd

In [None]:
buildings = pd.read_csv('/Users/jnaiman/Downloads/building_inventory.csv')

In [None]:
buildings
# formatting here is sort of nice

Pandas comes with a lot of nice built in functions like for example, we can easily count how many entries there are in this dataset:

In [None]:
# how many entries are there? as an iterable
buildings.index

We can print out slices of our dataset by index like so:

In [None]:
buildings.iloc[0:3] # subset by index

We can build up querys, like grab the agency name of the 100-110'th entries:

In [None]:
buildings.iloc[100:110]["Agency Name"] # grab 1-10 entries, and print out the Agency names of those entries

We can use NumPy-like functions, like counting how many unique agency names are in our dataset:

In [None]:
buildings["Agency Name"].nunique() # how many unique agencies

We can also do this with categorical data, like the building status:

In [None]:
buildings["Bldg Status"].unique() 

If you are used to R at all, the `describe` function is sort of like "summary" function, and basically giving some summary statistics for the numerical data in our dataset.

Note that some of these statistics don't make sense, for example the "mean" zip code doesn't make physical sense.

In [None]:
buildings.describe()

Instead of using `.iloc` before, we can filter our data by using `.loc` which allows us to pass filtering information.

For example, let's only look at buildings that have zero square footage:

In [None]:
buildings.loc[buildings["Square Footage"] == 0] # boolean operation inside means zero square footage

We can also filter for ongoing construction:

In [None]:
buildings.loc[buildings["Bldg Status"] == "In Progress"] # who is being built now?

There are also a lot of useful functions associated with our datasets, for example, we can plot the distribution of square footage:

In [None]:
buildings["Square Footage"].plot() 
plt.show()

What can we do with this plot?  What are our options?

Check out: https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DataFrame.plot.html

While the above was a "quick and dirty" plot, we can do fancier things:

In [None]:
buildings.plot(x = "Address", y="Square Footage", figsize=(20,6), rot=90)

You can also use Pandas to generate the plot and then give you back the matplotlib `ax` objects we've been dealing with before:

In [None]:
ax = buildings.plot(x = "Year Acquired", y="Square Footage", figsize=(20,6), rot=90, kind='scatter')

In [None]:
ax = buildings.plot(x = "Year Acquired", y="Square Footage", figsize=(20,6), rot=90, kind='scatter')
ax.set_xlim(1750, 2010)

There are also some useful sorting functions within Pandas.  The `groupby` function can seem a little nebulous, but its a way to sort of "re-index" our datasets.  Here we'll re-group our data by the building's status:

In [None]:
buildings.groupby("Bldg Status") # this doesn't do anything until you call it

Now actually do something with this object, here, just print out - you can see "abandon" is at the top - so i first lists off all of the abandoned buildings.

In [None]:
for grouped in buildings.groupby("Bldg Status"):
    print(grouped)

In [None]:
for status, df in buildings.groupby("Bldg Status"):
    print(status, df.shape[0])

We can also apply NumPy-like functions, for example `max`:

In [None]:
buildings.max()

In [None]:
buildings["Square Footage"].min()

There are several differnet options for *how* to read data with Pandas.  For example, we can tell Pandas what to do with empty entries, i.e. ones with a `NaN` tag.

In [None]:
pd.read_csv?

In [None]:
b = pd.read_csv("/Users/jnaiman/Downloads/building_inventory.csv",
           na_values = {'Square Footage': 0,
                       'Year Acquired': 0,
                       'Year Constructed': 0}) 
# specify what to do with incomplete entries, here this just says if any of these columns have a value 0, treat
#  as a NaN or not-a-number

In [None]:
b["Square Footage"].min()

In [None]:
b["Year Constructed"].min()

In [None]:
b["Year Acquired"].min()

In [None]:
b.loc[b["Year Acquired"] < 1800]

We can also mutate Pandas dataframes into new data with operations like sorting:

In [None]:
b2 = b.sort_values("Year Constructed")

In [None]:
b2.iloc[0] # this gives the oldest building - the one that was constructed in 1753

We can build up Pandas commands to get new sorts of aggregated data:

In [None]:
b.groupby("Year Acquired")["Square Footage"].sum()

We can then use that to make different plots:

In [None]:
aggregated_data = b.groupby("Year Acquired")["Square Footage"].sum()

In [None]:
aggregated_data.index

In [None]:
aggregated_data.values

In [None]:
plt.plot(aggregated_data.index, aggregated_data.values)

We can also use pandas plots to do this as well:

In [None]:
aggregated_data = b.groupby("Year Acquired")["Square Footage"].sum()
aggregated_data.plot()

We can aggregate in a bunch of different ways!

In [None]:
aggregated_data_average = b.groupby("Year Acquired")["Square Footage"].mean()
aggregated_data_average.plot()

Let's go bananas!

In [None]:
b.loc[b["Agency Name"] == "University of Illinois"].groupby("Year Acquired")["Square Footage"].sum()

# ASIDE: Python tips and tricks! We won't go through this in lecture but it's here if you want it!

* We've been playing around with a few complex things in Python, but lets take a step back for a moment and delve into how Python deals with data in a bit more detail

In [None]:
# initialize a
a = []

In [None]:
# take a gander at a
a
# hey look a is an empty list!

In [None]:
# we can mix types in our lists
a = [1, 2, "hey"]
# here we have a few integers and a string

In [None]:
# lets look at a again
a

In [None]:
# also, for our general purposes, we can call a string with a single or double quotes
'hey' == "hey"

In [None]:
# we can also easily add to our list with the append statement
a.append("there")
a

In [None]:
# returns an item at an index, & removes item, default is the last item
a.pop()

In [None]:
# now a is back to what we had before
a

In [None]:
# we can also grab elements of a by their indicies
a[1]

In [None]:
# note that indexing starts from 0 in python
a[0]

In [None]:
# the -# can be used to grab starting from the last element of the list
a[-1]

In [None]:
# the colon means "all the things"
a[:]

In [None]:
# we can also take subsets easily, for example, ignorning the first element of a
# this is a way to filter data
a[1:]

In [None]:
# can also take all but the last eleement
a[:-1]

In [None]:
# we can also combine these two things to grab from the first to the 2nd to last element
# in this case, the one element
a[1:-1]

In [None]:
# there are also some nice string manipulations we can do
#  like splitting a string into a list object
a = "this is a much longer list, where i have taken a sentence and split it based on the spaces".split()

In [None]:
a

In [None]:
# we can grab every other element in the list
a[::2]

In [None]:
# we can also reorder this list back-to-front
a[::-1]

In [None]:
# we can also update individual strings in this list
a[3] = 'sorta'

In [None]:
a

In [None]:
# now lets look quickly at some funny things about strings in Python
name = "jill"

In [None]:
name[0]

In [None]:
# this will produce an error
name[0] = 'J'

In [None]:
# have to use something like replace
name.replace("j","J")

In [None]:
# python also has stuff called dictionaries
d = {'bevier': 'building', 'green' : 'road', 'champaign': 'city'}

In [None]:
d

In [None]:
# here the "champaign" entry is of type "city"
d['champaign']

In [None]:
# its super easy to add to dictionaries, here we add an empty list
d['mylist'] = []

In [None]:
d

In [None]:
# we can add to this list in the usual way - with the above "append" function we used before
d['mylist'].append(10)

In [None]:
d

In [None]:
# there are these other cool objects called "sets"
myset = set()

In [None]:
myset

In [None]:
# lets check out some operations with sets, for example some movies I like
jill_movies = set(['last jedi', 'girls trip', 'frozen'])
# lets say we have another person named bob an these are the movies he likes
bob_movies = set(['last jedi', 'other movie'])

In [None]:
jill_movies

In [None]:
bob_movies

In [None]:
# we can create a set that is made up of my movies, but without those movies that appear in bob's movies list
jill_movies - bob_movies

In [None]:
jill_movies[0] # note we can't index

In [None]:
# we can take the union of sets
jill_movies.union(bob_movies)

In [None]:
# for some final string manipulation, we can use a thing called enumerate 
# to both count in a for loop and use an element of our list directly
for i, word in enumerate(reversed(a)):
    print(i, word.upper())

In [None]:
# continue and break are flow control statements
for i, word in enumerate(sorted(a)):
    if word == "and":
        continue
    if word == "it":
        break
    print(i, word.upper())

In [None]:
# also, we can use the "?" to figure out things we don't know, for example the reader
#  function from  the csv library
import csv

In [None]:
csv.reader?