{
"cells": [
{
"cell_type": "markdown",
"id": "a860826e",
"metadata": {},
"source": [
"# Prep Notebook, Week 12\n",
"\n",
"So, the last lecture we ended passing data through Python to Altair to output as vega-lite. What is the benefit to using Python for data analysis? Well, for some of us Python is our bestie and so we want to hang out with it the most. For others, the benefit is that we can do data cleaning in Python and then put the cleaned data into our plots.\n",
"\n",
"Let's work through a few examples:\n",
"\n",
"1. With the buildings dataset\n",
"1. With the corgis dataset"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "9225a116",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Intel MKL WARNING: Support of Intel(R) Streaming SIMD Extensions 4.2 (Intel(R) SSE4.2) enabled only processors has been deprecated. Intel oneAPI Math Kernel Library 2025.0 will require Intel(R) Advanced Vector Extensions (Intel(R) AVX) instructions.\n",
"Intel MKL WARNING: Support of Intel(R) Streaming SIMD Extensions 4.2 (Intel(R) SSE4.2) enabled only processors has been deprecated. Intel oneAPI Math Kernel Library 2025.0 will require Intel(R) Advanced Vector Extensions (Intel(R) AVX) instructions.\n"
]
}
],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import altair as alt\n",
"import matplotlib.pyplot as plt # just in case"
]
},
{
"cell_type": "markdown",
"id": "603ef36f",
"metadata": {},
"source": [
"## 1. Altair with the buildings dataset\n",
"\n",
"Ok! So one dataset we know has some cleaning that needs to happen is the buildings dataset, so let's read this in and take a look to remember:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "a6fd9dae",
"metadata": {},
"outputs": [],
"source": [
"data_url = 'https://github.com/UIUC-iSchool-DataViz/is445_data/raw/main/building_inventory.csv'\n",
"buildings = pd.read_csv(data_url)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "163d4d2a",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Agency Name
\n",
"
Location Name
\n",
"
Address
\n",
"
City
\n",
"
Zip code
\n",
"
County
\n",
"
Congress Dist
\n",
"
Congressional Full Name
\n",
"
Rep Dist
\n",
"
Rep Full Name
\n",
"
...
\n",
"
Bldg Status
\n",
"
Year Acquired
\n",
"
Year Constructed
\n",
"
Square Footage
\n",
"
Total Floors
\n",
"
Floors Above Grade
\n",
"
Floors Below Grade
\n",
"
Usage Description
\n",
"
Usage Description 2
\n",
"
Usage Description 3
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
Department of Natural Resources
\n",
"
Anderson Lake Conservation Area - Fulton County
\n",
"
Anderson Lake C.a.
\n",
"
Astoria
\n",
"
61501
\n",
"
Fulton
\n",
"
17
\n",
"
Cheri Bustos
\n",
"
93
\n",
"
Hammond Norine K.
\n",
"
...
\n",
"
In Use
\n",
"
1975
\n",
"
1975
\n",
"
144
\n",
"
1
\n",
"
1
\n",
"
0
\n",
"
Unusual
\n",
"
Unusual
\n",
"
Not provided
\n",
"
\n",
"
\n",
"
1
\n",
"
Department of Natural Resources
\n",
"
Anderson Lake Conservation Area - Fulton County
\n",
"
Anderson Lake C.a.
\n",
"
Astoria
\n",
"
61501
\n",
"
Fulton
\n",
"
17
\n",
"
Cheri Bustos
\n",
"
93
\n",
"
Hammond Norine K.
\n",
"
...
\n",
"
In Use
\n",
"
2004
\n",
"
2004
\n",
"
144
\n",
"
1
\n",
"
1
\n",
"
0
\n",
"
Unusual
\n",
"
Unusual
\n",
"
Not provided
\n",
"
\n",
"
\n",
"
2
\n",
"
Department of Natural Resources
\n",
"
Anderson Lake Conservation Area - Fulton County
\n",
"
Anderson Lake C.a.
\n",
"
Astoria
\n",
"
61501
\n",
"
Fulton
\n",
"
17
\n",
"
Cheri Bustos
\n",
"
93
\n",
"
Hammond Norine K.
\n",
"
...
\n",
"
In Use
\n",
"
2004
\n",
"
2004
\n",
"
144
\n",
"
1
\n",
"
1
\n",
"
0
\n",
"
Unusual
\n",
"
Unusual
\n",
"
Not provided
\n",
"
\n",
"
\n",
"
3
\n",
"
Department of Natural Resources
\n",
"
Anderson Lake Conservation Area - Fulton County
\n",
"
Anderson Lake C.a.
\n",
"
Astoria
\n",
"
61501
\n",
"
Fulton
\n",
"
17
\n",
"
Cheri Bustos
\n",
"
93
\n",
"
Hammond Norine K.
\n",
"
...
\n",
"
In Use
\n",
"
2004
\n",
"
2004
\n",
"
144
\n",
"
1
\n",
"
1
\n",
"
0
\n",
"
Unusual
\n",
"
Unusual
\n",
"
Not provided
\n",
"
\n",
"
\n",
"
4
\n",
"
Department of Natural Resources
\n",
"
Anderson Lake Conservation Area - Fulton County
\n",
"
Anderson Lake C.a.
\n",
"
Astoria
\n",
"
61501
\n",
"
Fulton
\n",
"
17
\n",
"
Cheri Bustos
\n",
"
93
\n",
"
Hammond Norine K.
\n",
"
...
\n",
"
In Use
\n",
"
2004
\n",
"
2004
\n",
"
144
\n",
"
1
\n",
"
1
\n",
"
0
\n",
"
Unusual
\n",
"
Unusual
\n",
"
Not provided
\n",
"
\n",
" \n",
"
\n",
"
5 rows × 22 columns
\n",
"
"
],
"text/plain": [
" Agency Name \\\n",
"0 Department of Natural Resources \n",
"1 Department of Natural Resources \n",
"2 Department of Natural Resources \n",
"3 Department of Natural Resources \n",
"4 Department of Natural Resources \n",
"\n",
" Location Name Address \\\n",
"0 Anderson Lake Conservation Area - Fulton County Anderson Lake C.a. \n",
"1 Anderson Lake Conservation Area - Fulton County Anderson Lake C.a. \n",
"2 Anderson Lake Conservation Area - Fulton County Anderson Lake C.a. \n",
"3 Anderson Lake Conservation Area - Fulton County Anderson Lake C.a. \n",
"4 Anderson Lake Conservation Area - Fulton County Anderson Lake C.a. \n",
"\n",
" City Zip code County Congress Dist Congressional Full Name Rep Dist \\\n",
"0 Astoria 61501 Fulton 17 Cheri Bustos 93 \n",
"1 Astoria 61501 Fulton 17 Cheri Bustos 93 \n",
"2 Astoria 61501 Fulton 17 Cheri Bustos 93 \n",
"3 Astoria 61501 Fulton 17 Cheri Bustos 93 \n",
"4 Astoria 61501 Fulton 17 Cheri Bustos 93 \n",
"\n",
" Rep Full Name ... Bldg Status Year Acquired Year Constructed \\\n",
"0 Hammond Norine K. ... In Use 1975 1975 \n",
"1 Hammond Norine K. ... In Use 2004 2004 \n",
"2 Hammond Norine K. ... In Use 2004 2004 \n",
"3 Hammond Norine K. ... In Use 2004 2004 \n",
"4 Hammond Norine K. ... In Use 2004 2004 \n",
"\n",
" Square Footage Total Floors Floors Above Grade Floors Below Grade \\\n",
"0 144 1 1 0 \n",
"1 144 1 1 0 \n",
"2 144 1 1 0 \n",
"3 144 1 1 0 \n",
"4 144 1 1 0 \n",
"\n",
" Usage Description Usage Description 2 Usage Description 3 \n",
"0 Unusual Unusual Not provided \n",
"1 Unusual Unusual Not provided \n",
"2 Unusual Unusual Not provided \n",
"3 Unusual Unusual Not provided \n",
"4 Unusual Unusual Not provided \n",
"\n",
"[5 rows x 22 columns]"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"buildings.head()"
]
},
{
"cell_type": "markdown",
"id": "7d1f1afe",
"metadata": {},
"source": [
"Let's make a quick plot with matplotlib to see what might need to be cleaned:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "11dc9147",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"buildings.plot(x='Year Acquired', y='Square Footage', figsize=(20,5),kind='scatter')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "84c825ab",
"metadata": {},
"source": [
"So, if we remember to when we first saw this dataset, we had a bunch of zeros that we decided we should tag as missing data with an NaN. Let's clean this dataframe:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "065324a3",
"metadata": {},
"outputs": [],
"source": [
"buildings.loc[buildings['Year Acquired'] == 0,'Year Acquired'] = np.nan\n",
"buildings.loc[buildings['Square Footage'] == 0,'Square Footage'] = np.nan"
]
},
{
"cell_type": "markdown",
"id": "e12bea6c",
"metadata": {},
"source": [
"And then re-make this plot:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "b13d1297",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"buildings.plot(x='Year Acquired', y='Square Footage', figsize=(20,5),kind='scatter')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "36e727e0",
"metadata": {},
"source": [
"Hey that looks much better! Though, we probably want a log-scale on the y-axis, just to be safe:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "ddd631cc",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"buildings.plot(x='Year Acquired', y='Square Footage', figsize=(20,5),kind='scatter',logy=True)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "9c3f78fe",
"metadata": {},
"source": [
"Nice. \n",
"\n",
"Ok, now that we have our data cleaned, we can further transform our data by creating a statistics dataframe out of our data:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "f58325ec",
"metadata": {},
"outputs": [],
"source": [
"stats = buildings.groupby(\"Year Acquired\")[\"Square Footage\"].describe()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "e5010562",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
],
"text/plain": [
" Country Longitude Latitude Year cumulative_sum\n",
"0 United States -100.445882 39.783730 1917-01-01 0\n",
"1 Brazil -53.200000 -10.333333 1917-01-01 0\n",
"2 Russia 97.745306 64.686314 1917-01-01 0\n",
"3 Japan 139.239418 36.574844 1917-01-01 0\n",
"4 Vietnam 107.965086 15.926666 1917-01-01 0"
]
},
"execution_count": 111,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"corg_melt = corg_clean2.reset_index().melt(['Country','Longitude','Latitude'], \n",
" var_name='Year', value_name='cumulative_sum')\n",
"corg_melt.head()"
]
},
{
"cell_type": "markdown",
"id": "b9dd0f22",
"metadata": {},
"source": [
"Now we can make a little [slider in Altair](https://altair-viz.github.io/user_guide/interactions.html#selection-values-in-expressions) to change the date range for our circles interactively:"
]
},
{
"cell_type": "code",
"execution_count": 112,
"id": "b5ea8d6a",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(Timestamp('1917-01-01 00:00:00'), Timestamp('2020-01-01 00:00:00'))"
]
},
"execution_count": 112,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"corg_melt['Year'].min(), corg_melt['Year'].max()"
]
},
{
"cell_type": "markdown",
"id": "ccfc3675",
"metadata": {},
"source": [
"Since sliders (at least at the time of writing) [can't have datetime inputs](https://stackoverflow.com/questions/62046930/altair-adding-date-slider-for-interactive-scatter-chart-pot) let's cheat a bit by making another column called \"year_int\":"
]
},
{
"cell_type": "code",
"execution_count": 113,
"id": "5724da74",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Country
\n",
"
Longitude
\n",
"
Latitude
\n",
"
Year
\n",
"
cumulative_sum
\n",
"
year_int
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
United States
\n",
"
-100.445882
\n",
"
39.783730
\n",
"
1917-01-01
\n",
"
0
\n",
"
1917
\n",
"
\n",
"
\n",
"
1
\n",
"
Brazil
\n",
"
-53.200000
\n",
"
-10.333333
\n",
"
1917-01-01
\n",
"
0
\n",
"
1917
\n",
"
\n",
"
\n",
"
2
\n",
"
Russia
\n",
"
97.745306
\n",
"
64.686314
\n",
"
1917-01-01
\n",
"
0
\n",
"
1917
\n",
"
\n",
"
\n",
"
3
\n",
"
Japan
\n",
"
139.239418
\n",
"
36.574844
\n",
"
1917-01-01
\n",
"
0
\n",
"
1917
\n",
"
\n",
"
\n",
"
4
\n",
"
Vietnam
\n",
"
107.965086
\n",
"
15.926666
\n",
"
1917-01-01
\n",
"
0
\n",
"
1917
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Country Longitude Latitude Year cumulative_sum year_int\n",
"0 United States -100.445882 39.783730 1917-01-01 0 1917\n",
"1 Brazil -53.200000 -10.333333 1917-01-01 0 1917\n",
"2 Russia 97.745306 64.686314 1917-01-01 0 1917\n",
"3 Japan 139.239418 36.574844 1917-01-01 0 1917\n",
"4 Vietnam 107.965086 15.926666 1917-01-01 0 1917"
]
},
"execution_count": 113,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"corg_melt['year_int'] = corg_melt['Year'].dt.year.astype('int')\n",
"corg_melt.head()"
]
},
{
"cell_type": "markdown",
"id": "acbfe05d",
"metadata": {},
"source": [
"Note here to that for [equity selections we don't have to use == signs](https://stackoverflow.com/questions/68071713/in-altair-equality-condition-doesnt-work) in Altair (it won't work... just for fun I guess)."
]
},
{
"cell_type": "code",
"execution_count": 117,
"id": "07fa6c96-851c-4eab-a01b-bfdc496a58e7",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"\u001b[0;31mSignature:\u001b[0m\n",
"\u001b[0malt\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mselection_point\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\u001b[0m\n",
"\u001b[0;34m\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
"\u001b[0;34m\u001b[0m \u001b[0mvalue\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mUndefined\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
"\u001b[0;34m\u001b[0m \u001b[0mbind\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mUndefined\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
"\u001b[0;34m\u001b[0m \u001b[0mempty\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mUndefined\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
"\u001b[0;34m\u001b[0m \u001b[0mexpr\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mUndefined\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
"\u001b[0;34m\u001b[0m \u001b[0mencodings\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mUndefined\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
"\u001b[0;34m\u001b[0m \u001b[0mfields\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mUndefined\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
"\u001b[0;34m\u001b[0m \u001b[0mon\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mUndefined\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
"\u001b[0;34m\u001b[0m \u001b[0mclear\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mUndefined\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
"\u001b[0;34m\u001b[0m \u001b[0mresolve\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mUndefined\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
"\u001b[0;34m\u001b[0m \u001b[0mtoggle\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mUndefined\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
"\u001b[0;34m\u001b[0m \u001b[0mnearest\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mUndefined\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
"\u001b[0;34m\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwds\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
"\u001b[0;34m\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mDocstring:\u001b[0m\n",
"Create a point selection parameter. Selection parameters define data queries that are driven by direct manipulation from user input (e.g., mouse clicks or drags). Point selection parameters are used to select multiple discrete data values; the first value is selected on click and additional values toggled on shift-click. To select a continuous range of data values on drag interval selection parameters (`selection_interval`) can be used instead.\n",
"\n",
"Parameters\n",
"----------\n",
"name : string (optional)\n",
" The name of the parameter. If not specified, a unique name will be\n",
" created.\n",
"value : any (optional)\n",
" The default value of the parameter. If not specified, the parameter\n",
" will be created without a default value.\n",
"bind : :class:`Binding` (optional)\n",
" Binds the parameter to an external input element such as a slider,\n",
" selection list or radio button group.\n",
"empty : boolean (optional)\n",
" For selection parameters, the predicate of empty selections returns\n",
" True by default. Override this behavior, by setting this property\n",
" 'empty=False'.\n",
"expr : :class:`Expr` (optional)\n",
" An expression for the value of the parameter. This expression may\n",
" include other parameters, in which case the parameter will\n",
" automatically update in response to upstream parameter changes.\n",
"encodings : List[str] (optional)\n",
" A list of encoding channels. The corresponding data field values\n",
" must match for a data tuple to fall within the selection.\n",
"fields : List[str] (optional)\n",
" A list of field names whose values must match for a data tuple to\n",
" fall within the selection.\n",
"on : string (optional)\n",
" A Vega event stream (object or selector) that triggers the selection.\n",
" For interval selections, the event stream must specify a start and end.\n",
"clear : string or boolean (optional)\n",
" Clears the selection, emptying it of all values. This property can\n",
" be an Event Stream or False to disable clear. Default is 'dblclick'.\n",
"resolve : enum('global', 'union', 'intersect') (optional)\n",
" With layered and multi-view displays, a strategy that determines\n",
" how selections' data queries are resolved when applied in a filter\n",
" transform, conditional encoding rule, or scale domain.\n",
" One of:\n",
"\n",
" * 'global': only one brush exists for the entire SPLOM. When the\n",
" user begins to drag, any previous brushes are cleared, and a\n",
" new one is constructed.\n",
" * 'union': each cell contains its own brush, and points are\n",
" highlighted if they lie within any of these individual brushes.\n",
" * 'intersect': each cell contains its own brush, and points are\n",
" highlighted only if they fall within all of these individual\n",
" brushes.\n",
"\n",
" The default is 'global'.\n",
"toggle : string or boolean (optional)\n",
" Controls whether data values should be toggled (inserted or\n",
" removed from a point selection) or only ever inserted into\n",
" point selections.\n",
" One of:\n",
"\n",
" * True (default): the toggle behavior, which corresponds to\n",
" \"event.shiftKey\". As a result, data values are toggled\n",
" when the user interacts with the shift-key pressed.\n",
" * False: disables toggling behaviour; the selection will\n",
" only ever contain a single data value corresponding\n",
" to the most recent interaction.\n",
" * A Vega expression which is re-evaluated as the user interacts.\n",
" If the expression evaluates to True, the data value is\n",
" toggled into or out of the point selection. If the expression\n",
" evaluates to False, the point selection is first cleared, and\n",
" the data value is then inserted. For example, setting the\n",
" value to the Vega expression True will toggle data values\n",
" without the user pressing the shift-key.\n",
"\n",
"nearest : boolean (optional)\n",
" When true, an invisible voronoi diagram is computed to accelerate\n",
" discrete selection. The data value nearest the mouse cursor is\n",
" added to the selection. The default is False, which means that\n",
" data values must be interacted with directly (e.g., clicked on)\n",
" to be added to the selection.\n",
"**kwds :\n",
" Additional keywords to control the selection.\n",
"\n",
"Returns\n",
"-------\n",
"parameter: Parameter\n",
" The parameter object that can be used in chart creation.\n",
"\u001b[0;31mFile:\u001b[0m ~/anaconda3/envs/DataVizPL/lib/python3.8/site-packages/altair/vegalite/v5/api.py\n",
"\u001b[0;31mType:\u001b[0m function"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"alt.selection_point?"
]
},
{
"cell_type": "code",
"execution_count": 119,
"id": "56195c43",
"metadata": {},
"outputs": [],
"source": [
"slider = alt.binding_range(min=corg_melt['year_int'].min(), \n",
" max=corg_melt['year_int'].max(), step=1, name='Max year:')\n",
"#selector = alt.selection_single(name=\"SelectorName\", fields=['cutoff'],\n",
"# bind=slider, init={'cutoff': 2000})\n",
"# selector = alt.selection_single(name=\"SelectorName\", fields=['year_int'],\n",
"# bind=slider, init={'year_int': 2000})\n",
"selector = alt.selection_point(name=\"SelectorName\", fields=['year_int'],\n",
" bind=slider, value=2000)"
]
},
{
"cell_type": "code",
"execution_count": 121,
"id": "1ded3b87",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"\n",
""
],
"text/plain": [
"alt.LayerChart(...)"
]
},
"execution_count": 121,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"geo = alt.topo_feature(data.world_110m.url, feature='countries')\n",
"\n",
"# US states background\n",
"world = alt.Chart(geo).mark_geoshape(\n",
" fill='lightgray',\n",
" stroke='white'\n",
").properties(\n",
" width=800,\n",
" height=500\n",
").project('equirectangular') # note we have a few projections we can use!\n",
"\n",
"points = alt.Chart(corg_melt).mark_circle().encode(\n",
" longitude='Longitude:Q',\n",
" latitude='Latitude:Q',\n",
" #size=alt.Size('Total Corg:Q',scale=alt.Scale(type='log')),\n",
" size=alt.condition(\n",
" #((alt.datum.year_int < selector.cutoff-10)&(alt.datum.year_int >= selector.cutoff)),\n",
" #(alt.datum.year_int == selector.cutoff),\n",
" #\"datum.year_int == selector.cutoff\",\n",
" #alt.expr.datum['year_int'] < selector.cutoff,\n",
" selector,\n",
" alt.Size('cumulative_sum:Q',scale=None), alt.value(0)\n",
" ),\n",
" tooltip='Country',\n",
"# ).add_selection(\n",
"# selector\n",
"# )\n",
").add_params(\n",
" selector\n",
")\n",
"\n",
"world + points"
]
},
{
"cell_type": "markdown",
"id": "24ddbe0b",
"metadata": {},
"source": [
"One final thing, let's add in some info about what each dot means in our tooltip:"
]
},
{
"cell_type": "code",
"execution_count": 122,
"id": "1fdbcbdf",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"\n",
""
],
"text/plain": [
"alt.LayerChart(...)"
]
},
"execution_count": 122,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"geo = alt.topo_feature(data.world_110m.url, feature='countries')\n",
"\n",
"# US states background\n",
"world = alt.Chart(geo).mark_geoshape(\n",
" fill='lightgray',\n",
" stroke='black'\n",
").properties(\n",
" width=800,\n",
" height=500\n",
"#).project('mercator') # note we have a few projections we can use!\n",
").project('equirectangular') # note we have a few projections we can use!\n",
"\n",
"points = alt.Chart(corg_melt).mark_circle().encode(\n",
" longitude='Longitude:Q',\n",
" latitude='Latitude:Q',\n",
" #size=alt.Size('Total Corg:Q',scale=alt.Scale(type='log')),\n",
" size=alt.condition(\n",
" #((alt.datum.year_int < selector.year_int-10)&(alt.datum.year_int >= selector.year_int)),\n",
" #(alt.datum.year_int < selector.year_int),\n",
" #\"datum.year_int == selector.year_int\",\n",
" #alt.expr.datum['year_int'] < selector.year_int,\n",
" selector,\n",
" alt.Size('cumulative_sum:Q',scale=None), alt.value(0)\n",
" ),\n",
" tooltip=['Country','cumulative_sum'],\n",
").add_params(\n",
" selector\n",
")\n",
"\n",
"world + points"
]
},
{
"cell_type": "markdown",
"id": "2c24cdbf",
"metadata": {},
"source": [
"Looks nice! Let's save it:"
]
},
{
"cell_type": "code",
"execution_count": 123,
"id": "0e4e3f7e",
"metadata": {},
"outputs": [],
"source": [
"chart_out = world + points\n",
"\n",
"chart_out.properties(width='container').save(myJekyllDir+\"corgis_dotchart_world.json\") "
]
},
{
"cell_type": "markdown",
"id": "3794475b",
"metadata": {},
"source": [
"We note that when we run this though, we get a few artifacts. We can try to \"smooth\" the transitions with a bit of interpolation:"
]
},
{
"cell_type": "code",
"execution_count": 124,
"id": "65c23b3d",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"\u001b[0;31mSignature:\u001b[0m\n",
"\u001b[0malt\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mselection_point\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\u001b[0m\n",
"\u001b[0;34m\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
"\u001b[0;34m\u001b[0m \u001b[0mvalue\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mUndefined\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
"\u001b[0;34m\u001b[0m \u001b[0mbind\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mUndefined\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
"\u001b[0;34m\u001b[0m \u001b[0mempty\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mUndefined\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
"\u001b[0;34m\u001b[0m \u001b[0mexpr\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mUndefined\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
"\u001b[0;34m\u001b[0m \u001b[0mencodings\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mUndefined\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
"\u001b[0;34m\u001b[0m \u001b[0mfields\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mUndefined\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
"\u001b[0;34m\u001b[0m \u001b[0mon\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mUndefined\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
"\u001b[0;34m\u001b[0m \u001b[0mclear\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mUndefined\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
"\u001b[0;34m\u001b[0m \u001b[0mresolve\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mUndefined\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
"\u001b[0;34m\u001b[0m \u001b[0mtoggle\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mUndefined\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
"\u001b[0;34m\u001b[0m \u001b[0mnearest\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mUndefined\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
"\u001b[0;34m\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwds\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
"\u001b[0;34m\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mDocstring:\u001b[0m\n",
"Create a point selection parameter. Selection parameters define data queries that are driven by direct manipulation from user input (e.g., mouse clicks or drags). Point selection parameters are used to select multiple discrete data values; the first value is selected on click and additional values toggled on shift-click. To select a continuous range of data values on drag interval selection parameters (`selection_interval`) can be used instead.\n",
"\n",
"Parameters\n",
"----------\n",
"name : string (optional)\n",
" The name of the parameter. If not specified, a unique name will be\n",
" created.\n",
"value : any (optional)\n",
" The default value of the parameter. If not specified, the parameter\n",
" will be created without a default value.\n",
"bind : :class:`Binding` (optional)\n",
" Binds the parameter to an external input element such as a slider,\n",
" selection list or radio button group.\n",
"empty : boolean (optional)\n",
" For selection parameters, the predicate of empty selections returns\n",
" True by default. Override this behavior, by setting this property\n",
" 'empty=False'.\n",
"expr : :class:`Expr` (optional)\n",
" An expression for the value of the parameter. This expression may\n",
" include other parameters, in which case the parameter will\n",
" automatically update in response to upstream parameter changes.\n",
"encodings : List[str] (optional)\n",
" A list of encoding channels. The corresponding data field values\n",
" must match for a data tuple to fall within the selection.\n",
"fields : List[str] (optional)\n",
" A list of field names whose values must match for a data tuple to\n",
" fall within the selection.\n",
"on : string (optional)\n",
" A Vega event stream (object or selector) that triggers the selection.\n",
" For interval selections, the event stream must specify a start and end.\n",
"clear : string or boolean (optional)\n",
" Clears the selection, emptying it of all values. This property can\n",
" be an Event Stream or False to disable clear. Default is 'dblclick'.\n",
"resolve : enum('global', 'union', 'intersect') (optional)\n",
" With layered and multi-view displays, a strategy that determines\n",
" how selections' data queries are resolved when applied in a filter\n",
" transform, conditional encoding rule, or scale domain.\n",
" One of:\n",
"\n",
" * 'global': only one brush exists for the entire SPLOM. When the\n",
" user begins to drag, any previous brushes are cleared, and a\n",
" new one is constructed.\n",
" * 'union': each cell contains its own brush, and points are\n",
" highlighted if they lie within any of these individual brushes.\n",
" * 'intersect': each cell contains its own brush, and points are\n",
" highlighted only if they fall within all of these individual\n",
" brushes.\n",
"\n",
" The default is 'global'.\n",
"toggle : string or boolean (optional)\n",
" Controls whether data values should be toggled (inserted or\n",
" removed from a point selection) or only ever inserted into\n",
" point selections.\n",
" One of:\n",
"\n",
" * True (default): the toggle behavior, which corresponds to\n",
" \"event.shiftKey\". As a result, data values are toggled\n",
" when the user interacts with the shift-key pressed.\n",
" * False: disables toggling behaviour; the selection will\n",
" only ever contain a single data value corresponding\n",
" to the most recent interaction.\n",
" * A Vega expression which is re-evaluated as the user interacts.\n",
" If the expression evaluates to True, the data value is\n",
" toggled into or out of the point selection. If the expression\n",
" evaluates to False, the point selection is first cleared, and\n",
" the data value is then inserted. For example, setting the\n",
" value to the Vega expression True will toggle data values\n",
" without the user pressing the shift-key.\n",
"\n",
"nearest : boolean (optional)\n",
" When true, an invisible voronoi diagram is computed to accelerate\n",
" discrete selection. The data value nearest the mouse cursor is\n",
" added to the selection. The default is False, which means that\n",
" data values must be interacted with directly (e.g., clicked on)\n",
" to be added to the selection.\n",
"**kwds :\n",
" Additional keywords to control the selection.\n",
"\n",
"Returns\n",
"-------\n",
"parameter: Parameter\n",
" The parameter object that can be used in chart creation.\n",
"\u001b[0;31mFile:\u001b[0m ~/anaconda3/envs/DataVizPL/lib/python3.8/site-packages/altair/vegalite/v5/api.py\n",
"\u001b[0;31mType:\u001b[0m function"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"alt.selection_point?"
]
},
{
"cell_type": "code",
"execution_count": 125,
"id": "d0c34d9d",
"metadata": {},
"outputs": [],
"source": [
"slider = alt.binding_range(min=corg_melt['year_int'].min(), \n",
" max=corg_melt['year_int'].max(), step=1, name='Max year:')\n",
"# selector = alt.selection_single(name=\"SelectorName\", fields=['year_int'],\n",
"# bind=slider, init={'year_int': corg_melt['year_int'].min()},\n",
"# nearest=True)\n",
"\n",
"selector = alt.selection_point(name=\"SelectorName\", fields=['year_int'],\n",
" bind=slider, value=corg_melt['year_int'].min(), #init={'year_int': corg_melt['year_int'].min()},\n",
" nearest=True)"
]
},
{
"cell_type": "code",
"execution_count": 126,
"id": "4467f02f",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"\n",
""
],
"text/plain": [
"alt.LayerChart(...)"
]
},
"execution_count": 126,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"geo = alt.topo_feature(data.world_110m.url, feature='countries')\n",
"\n",
"# US states background\n",
"world = alt.Chart(geo).mark_geoshape(\n",
" fill='lightgray',\n",
" stroke='black'\n",
").properties(\n",
" width=800,\n",
" height=500\n",
"#).project('mercator') # note we have a few projections we can use!\n",
").project('equirectangular') # note we have a few projections we can use!\n",
"\n",
"points = alt.Chart(corg_melt).mark_circle().encode(\n",
" longitude='Longitude:Q',\n",
" latitude='Latitude:Q',\n",
" #size=alt.Size('Total Corg:Q',scale=alt.Scale(type='log')),\n",
" size=alt.condition(\n",
" #((alt.datum.year_int < selector.year_int-10)&(alt.datum.year_int >= selector.year_int)),\n",
" #(alt.datum.year_int < selector.year_int),\n",
" #\"datum.year_int == selector.year_int\",\n",
" #alt.expr.datum['year_int'] < selector.year_int,\n",
" selector,\n",
" alt.Size('cumulative_sum:Q',scale=None), alt.value(0)\n",
" ),\n",
" tooltip=['Country','cumulative_sum'],\n",
").add_params(\n",
" selector\n",
"#).add_selection(\n",
"# selector\n",
")\n",
"\n",
"world + points"
]
},
{
"cell_type": "code",
"execution_count": 127,
"id": "925dd8fc",
"metadata": {},
"outputs": [],
"source": [
"chart_out = world + points\n",
"\n",
"chart_out.properties(width='container').save(myJekyllDir+\"corgis_dotchart_world_smooth.json\") "
]
},
{
"cell_type": "markdown",
"id": "240f2707",
"metadata": {},
"source": [
"Groovy!\n",
"\n",
"One thing to note here is how much of the data cleaning and transformation we ended up doing in Python. In theory one probably *could* do this in Altair/vega-lite, but not without a lot of headache and in Python, we have the option of checking each \"stage\" of our data transformation so we can make sure it makes sense -- in vega-lite/Altair, we don't really have this option (as easily)."
]
},
{
"cell_type": "markdown",
"id": "e385a097",
"metadata": {},
"source": [
"### Corgi data and choropleth\n",
"\n",
"One final thing (well, not final final, there are infinite things we can do!) is to instead of plotting points on a map, we can color the a map of the world by the population of corgis at a particular time. \n",
"\n",
"This is called a [choropleth map](https://altair-viz.github.io/gallery/choropleth.html), and this is probably the last time I will EVER spell that correctly :D \n",
"\n",
"These can be [a little tricky in Altair](https://altair-viz.github.io/altair-tutorial/notebooks/09-Geographic-plots.html#colored-choropleths) since you have to map between pre-determined names of countries (as stored in the vegadataset world map) and however your data is stored."
]
},
{
"cell_type": "markdown",
"id": "a5583f3d",
"metadata": {},
"source": [
"Let's start with just coloring our mappable data based on the total corgis born. First, based on the [documentation about how to do this](https://altair-viz.github.io/altair-tutorial/notebooks/09-Geographic-plots.html#colored-choropleths) we know that we have to match up the world-map ID with whatever ID for each country as listed in our dataset. There are also some [other transformation-related things to be aware of](https://stackoverflow.com/questions/59224026/how-to-add-a-slider-to-a-choropleth-in-altair) that we'll cover after we deal with the ID look up stuff.\n",
"\n",
"\n",
"Let's dig a bit deeper with geopandas:"
]
},
{
"cell_type": "code",
"execution_count": 201,
"id": "e2d642c1-cf9d-4473-8af9-771469e107b1",
"metadata": {},
"outputs": [],
"source": [
"# import fiona\n",
"# fiona.supported_drivers"
]
},
{
"cell_type": "code",
"execution_count": 202,
"id": "4393182c",
"metadata": {},
"outputs": [],
"source": [
"import geopandas"
]
},
{
"cell_type": "code",
"execution_count": 203,
"id": "f350e176",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'https://cdn.jsdelivr.net/npm/vega-datasets@v1.29.0/data/world-110m.json'"
]
},
"execution_count": 203,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.world_110m.url"
]
},
{
"cell_type": "code",
"execution_count": 204,
"id": "3b7784d1",
"metadata": {},
"outputs": [],
"source": [
"#gdf = geopandas.read_file(data.world_110m.url,include_fields=['name'],layer='countries', driver='GeoJSON')\n",
"gdf = geopandas.read_file(data.world_110m.url,layer='countries')"
]
},
{
"cell_type": "code",
"execution_count": 205,
"id": "822936ff",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
id
\n",
"
geometry
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
4
\n",
"
POLYGON ((61.20961 35.64925, 62.23202 35.27011...
\n",
"
\n",
"
\n",
"
1
\n",
"
24
\n",
"
MULTIPOLYGON (((23.91324 -10.92658, 24.01764 -...
\n",
"
\n",
"
\n",
"
2
\n",
"
8
\n",
"
POLYGON ((20.59041 41.85586, 20.46440 41.51565...
\n",
"
\n",
"
\n",
"
3
\n",
"
784
\n",
"
POLYGON ((51.57952 24.24479, 51.75592 24.29387...
\n",
"
\n",
"
\n",
"
4
\n",
"
32
\n",
"
MULTIPOLYGON (((-66.95887 -54.89756, -67.56368...
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
172
\n",
"
548
\n",
"
MULTIPOLYGON (((167.51508 -16.59835, 167.18027...
\n",
"
\n",
"
\n",
"
173
\n",
"
887
\n",
"
POLYGON ((52.38592 16.38285, 52.19152 15.93771...
\n",
"
\n",
"
\n",
"
174
\n",
"
710
\n",
"
POLYGON ((28.21888 -32.77244, 27.46287 -33.227...
\n",
"
\n",
"
\n",
"
175
\n",
"
894
\n",
"
POLYGON ((32.75853 -9.23064, 33.23013 -9.67747...
\n",
"
\n",
"
\n",
"
176
\n",
"
716
\n",
"
POLYGON ((31.19251 -22.25149, 30.65971 -22.151...
\n",
"
\n",
" \n",
"
\n",
"
177 rows × 2 columns
\n",
"
"
],
"text/plain": [
" id geometry\n",
"0 4 POLYGON ((61.20961 35.64925, 62.23202 35.27011...\n",
"1 24 MULTIPOLYGON (((23.91324 -10.92658, 24.01764 -...\n",
"2 8 POLYGON ((20.59041 41.85586, 20.46440 41.51565...\n",
"3 784 POLYGON ((51.57952 24.24479, 51.75592 24.29387...\n",
"4 32 MULTIPOLYGON (((-66.95887 -54.89756, -67.56368...\n",
".. ... ...\n",
"172 548 MULTIPOLYGON (((167.51508 -16.59835, 167.18027...\n",
"173 887 POLYGON ((52.38592 16.38285, 52.19152 15.93771...\n",
"174 710 POLYGON ((28.21888 -32.77244, 27.46287 -33.227...\n",
"175 894 POLYGON ((32.75853 -9.23064, 33.23013 -9.67747...\n",
"176 716 POLYGON ((31.19251 -22.25149, 30.65971 -22.151...\n",
"\n",
"[177 rows x 2 columns]"
]
},
"execution_count": 205,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"gdf"
]
},
{
"cell_type": "markdown",
"id": "2f520b97",
"metadata": {},
"source": [
"So, here we see that there is this ID -- this matches up with each country, for example:"
]
},
{
"cell_type": "code",
"execution_count": 206,
"id": "773ec570",
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
""
],
"text/plain": [
""
]
},
"execution_count": 206,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"gdf.iloc[0]['geometry']"
]
},
{
"cell_type": "code",
"execution_count": 207,
"id": "894e919d",
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
""
],
"text/plain": [
""
]
},
"execution_count": 207,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"gdf.iloc[1]['geometry']"
]
},
{
"cell_type": "markdown",
"id": "229e1e69",
"metadata": {},
"source": [
"Are both countries! But how to find out which ones?"
]
},
{
"cell_type": "markdown",
"id": "f59b2829",
"metadata": {},
"source": [
"To do that, we have to map the ID's to their [world country codes](https://documentation-resources.opendatasoft.com/explore/dataset/natural-earth-countries-110m/information/). Luckily, that is [already done for us](https://github.com/alisle/world-110m-country-codes)."
]
},
{
"cell_type": "code",
"execution_count": 208,
"id": "33b6169f",
"metadata": {},
"outputs": [],
"source": [
"country_codes = pd.read_json('https://raw.githubusercontent.com/alisle/world-110m-country-codes/master/world-110m-country-codes.json')"
]
},
{
"cell_type": "code",
"execution_count": 209,
"id": "67f8b834",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
],
"text/plain": [
" id geometry code \\\n",
"168 840 MULTIPOLYGON (((-155.68896 18.91661, -155.9373... US \n",
"\n",
" name \n",
"168 United States "
]
},
"execution_count": 221,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"gdf_comb.loc[gdf_comb['name']=='United States']"
]
},
{
"cell_type": "code",
"execution_count": 222,
"id": "69131f0a",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Russia Russian Federation\n",
"found ['Russian Federation'] for Russia\n",
"Vietnam Viet Nam\n",
"found ['Viet Nam'] for Vietnam\n",
"no match Korea, North\n",
"no match Kosovo\n"
]
}
],
"source": [
"for c in corg_clean2['Country'].unique():\n",
" if c not in gdf_comb['name'].values: # if not in there, look for fuzzy\n",
" #print('no',c)\n",
" country_match = []\n",
" for cc in gdf_comb['name'].values:\n",
" #if c in cc: # there is an NaN\n",
" if type(cc)==str:\n",
" c2 = \"\".join(c.split()).lower()\n",
" cc2 = \"\".join(cc.split()).lower()\n",
" if c2 in cc2:\n",
" country_match.append(cc)\n",
" print(c,cc)\n",
" if len(country_match) >0:\n",
" print('found', country_match, 'for',c)\n",
" else:\n",
" print('no match',c)\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 223,
"id": "c4cdfe95",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"found ['Russian Federation'] for Russia\n",
"found ['Viet Nam'] for Vietnam\n",
"no match Korea, North\n",
"no match Kosovo\n"
]
}
],
"source": [
"# store these names to be the same as in our dataset\n",
"for c in corg_clean2['Country'].unique():\n",
" if c not in gdf_comb['name'].values: # if not in there, look for fuzzy\n",
" #print('no',c)\n",
" country_match = []\n",
" for cc in gdf_comb['name'].values:\n",
" #if c in cc: # there is an NaN\n",
" if type(cc)==str:\n",
" c2 = \"\".join(c.split()).lower()\n",
" cc2 = \"\".join(cc.split()).lower()\n",
" if c2 in cc2:\n",
" country_match.append(cc)\n",
" #print(c,cc)\n",
" if len(country_match) >0:\n",
" print('found', country_match, 'for',c)\n",
" if len(country_match)==1: # only one\n",
" gdf_comb.loc[gdf_comb['name']==country_match[0],'name'] = c # replace\n",
" else:\n",
" print('no match',c)"
]
},
{
"cell_type": "markdown",
"id": "7a2cca7a",
"metadata": {},
"source": [
"Missing ids for North Korea and Kosovo, in our original corgi dataset, but let's add the IDs that we can to our corgi dataset:"
]
},
{
"cell_type": "code",
"execution_count": 224,
"id": "f4485747-5bdf-4e6a-9700-8b707f47418e",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"