We Raised $8M Series A to Continue Building Experiment Tracking and Model Registry That “Just Works”

Read more

Blog » General » Geospatial Data Science: Logging Interactive Charts in Neptune with Plotly [Guide]

Geospatial Data Science: Logging Interactive Charts in Neptune with Plotly [Guide]

Geospatial data science is becoming an essential part of the data science landscape. Almost every event can be mapped to the surface of the earth. As a result, the field tends to answer detailed location questions and understand the features and rationale for their location. Like general data science, the geospatial provides a similar baseline of computing skills, statistical concepts, and data wrangling and visualization. This article is a brief guide for geospatial data science analysis and visualizations. We’re going to demystify terminologies, highlight the best parts of this area, explain what skills and technologies are used, and examine a Python use case.

What is Geospatial Data Science?

There are a lot of buzzwords in data science, and too much buzz leads to misconceptions. So, let’s clarify some Geospatial Data Science terms:

  • Spatial comes from the Latin word “spatium”, meaning space. ‘Spatial’ means space, as it covers features and phenomena distributed in a dimensional continuum. 
  • Geography is the investigation of Earth’s physical features, phenomena, atmosphere, and the effects of human activity on the planet.
  • Geospatial is a word of mixed origin, the Greek “gaya” and Latin “Spatium”, meaning earth-space. Geospatial implies that a geographic entity’s location is referenced with a coordinate system.
  • Geographic Information System(GIS) is a system for gathering, managing, manipulating, analyzing, storing, and visualizing geospatial data (data with geographic components). 

Geospatial Data Science is about using location to find patterns and solve complex problems. The scope goes beyond locality, to uncovering hidden patterns, understanding phenomena, finding complex relationships, and driving decision-making. This field explores data across time and space to learn about geographic events about location, distance, and spatial interactions.

Why is Geospatial Data Science important?

Geospatial Data Science makes spatial decision-making easier. You can extract profound insights from geospatial data using a set of analytical methods and algorithms, including machine learning and deep learning techniques. The landscape of Geospatial Data Science includes:

  • Data Engineering: help read, transform, format, clean, and enrich geospatial data.
  • Visualization and Exploration: guide investigation visually with maps and dynamic charts.
  • Geospatial analysis: aids studying and research of geospatial entities using various approaches and techniques.
  • Machine learning and deep learning: utilizing multiple machine learning and deep learning algorithms to build predictive models to solve complex geospatial problems.
  • Big data analytics: Transform large volumes of geospatial data into a manageable size for more accessible and better analysis.
  • Modeling and scripting: Automating processes and extending functionalities of geospatial systems.

On the business side of things, according to ‘State of Geospatial Data Science in Enterprise 2020’, there’s been an increase of 68% in geospatial analysis investments across numerous industries. It’s a very promising field because there are still few geospatial data scientists out there — only 1 in 3 data scientists claim to be experts in geospatial analysis.

Where can Geospatial Data Science be used?

Geospatial Data Science is revolutionary because nearly everything can be georeferenced. This means that geospatial data science can be used in almost any industry, including:

  • Healthcare
  • Telecommunications
  • Urban planning/development
  • Marketing
  • Social services
  • Military
  • Natural resource exploration and exploitation
  • Transportation
  • Education
  • Weather
  • Agriculture

Geospatial Data Science toolkits

Let’s discuss programming languages, databases, and tools that come in handy for Geospatial Data Science.

R and Python are popular programming languages, as they provide numerous libraries for data science operations:

Geospatial data is stored and managed via a database; some examples here are PostGIS, GrassGIS, Postgres, SQL, and PgAdmin.

Some tools are designed to capture, store, manipulate, analyze, manage, and present geospatial data, for example, Esri, QGIS, Mapbox, CARTO, Google Earth, SpatialKey, Geospark, Alteryx, FME, Elastic Search, Oracle, AWS, and Tableau.

Types of geospatial data

As Carly Florina said, “The goal is to turn data into information and information into insights”, so it’s important to know and understand different data types.

Geospatial data can be divided into two types:

  1. Geospatial-referenced data: represented by vector and raster formats.
  2. Attribute data: defined by tabular formats.

Vector data: Datasets stored in pairs on longitude and latitude pairs (coordinates). The basic units of this form of data are points (0-dimension), lines (1-dimension), and polygons (2-dimensions). Each of these is a series of one or more coordinate points — a collection of points forms line units, joined closed rings of lines form a polygon.

  • Points: Used to describe distinct data and nonadjacent features. 0-dimension, so they don’t have properties of length or area. Examples of point data are points of interest (POI), i.e. features such as schools, shopping malls, volcanoes, or hospitals.
Geospatial data - points
New York Hospital point data map made by Aboze Brain | Kepler.gl
  • Lines/Arc: Used to describe a set of ordered coordinates representing linear features. 1-dimension, the length can be measured (it has a starting and ending point), but not the area. Examples of line data are rivers, roads, contours, boundary lines.
Geospatial data - lines
New York Centerline map made by Aboze Brain | Kepler.gl
  • Polygon: Used to describe areas, 2-dimensional in nature. Polygons are defined by lines that make up the boundaries and a point inside for identification and overview. Examples of polygons are administrative blocks, boundaries of cities or forests.
Geospatial data - polygon
New York Airport Polygons map made by Aboze Brain | Kepler.gl

Raster data: This data type represents surfaces; it’s also known as grid data. The ‘grid’ consists of a matrix of cells (pixels) organized in rows and columns to represent information. Raster data exists in two forms based on the data, discrete raster data such as population density distribution and continuous raster data such as temperature or elevations. There are also three types of raster datasets: thematic data, spectral data, and imagery.

  1. The thematic map shows the distribution of human or natural features, mainly discrete data-based.
  2. Spectral maps show specific wavelengths of the electromagnetic spectrum, mostly continuous data-based.
  3. Imagery is simply satellite or aerial photographs.
Geospatial data - raster
New York Elevation data analysis made by Aboze Brain | Kepler.gl

All the datasets used for the examples above were sourced from the NYC Open data portal here.

Attribute data: These are tabular data for describing geospatial features. The table fields can contain field data types which are:

  • Integer values
  • Floating values (decimal numbers)
  • Character values (strings/text)
  • Date values
  • Binary Large Object (BLOB) for storing information such images, multimedia, or bits of code

Formats of geospatial data

Both Vector and Raster data come in various file formats differentiable by the file extension names. To work on geospatial analysis and analytics with this data across multiple platforms, you need to know which file format these platforms accept. 

Here’s an overview of file formats for both vector and raster data:

Vector data formats:

Raster data formats:

Geospatial data sources

We know the types of geospatial data, but where can we source it from? There are plenty of websites, Wikipedia published a list of reliable sources to fetch geospatial data here.

Use case: Python as a tool for Geospatial Data Science

Let’s use some Python libraries for geospatial analysis. We’ll be using Geopandas and Plotly for data wrangling and visualization, and Neptune AI to log interactive maps, to achieve the following:

  1. Scatter Plots on Map
  2. Choropleth Map
  3. Density Heatmap
  4. Lines on Map
  5. Study areas on Map

Prerequisites:

  • Python 3.7
  • Python Libraries – Use the PyPI package manager:

– Geopandas

pip install geopandas

– Plotly express

pip install plotly-express

– Neptune Client

pip install --upgrade --quiet neptune-client
  • Also, these libraries can be installed with Conda:

– Geopandas

conda install geopandas

– Plotly express

conda install -c plotly plotly_express 

– Neptune Client

 conda install -c conda-forge neptune-client

Data Source: I sourced data from Grid3 (Geo-Referenced Infrastructure and Demographic Data for Development). This initiative provides high-resolution population, infrastructure, and other reference data supporting national sectoral development priorities, humanitarian efforts, and the United Nations’ Sustainable Development Goals (SDGs) in Nigeria.

The scope of analysis for this publication is focused on Lagos, Nigeria. Lagos is one of the fastest-growing cities globally and a major financial center for all of Africa.

The link to the code, datasets and other resources used in this article can be found here.

Geospatial data - Lagos map
Source: Google Maps

To log interactive plots in Neptune AI, you need to sign up and create a new project. Creating an account will provide you with custom API credentials to properly integrate Neptune AI’s various features

Geospatial data - logging to Neptune

With your Neptune AI account fully set up and access to your custom API, we import and initialize Neptune.

In your code editor, open a new file named .env (note the leading dot) and add the following credentials:

API_KEY=<Your API key>

This is important for security purposes, as you should never hardcode your secrets into your application. Create a gitignore file and add a .env file.

import neptune.new as neptune
import os
from dotenv import load_dotenv
load_dotenv()

API_token = os.getenv("API_token")

run = neptune.init(project='codebrain/Geospatial-article',
                   api_token=API_token) # your credentials

Note: Dotenv is a module that loads environment variables from a .env file into a process. 

Executing this code block will give you a custom link that connects your project to Neptune, like this: https://app.neptune.ai/codebrain/Geospatial-analysis/e/GEOS-2

Next, we import our base libraries and read the files. The file formats to be read in are all in JSON formats. The data frames correspond to health facilities, administrative boundaries, and population datasets.

Next, we import our base libraries and read the files. The file formats to be read in are all in JSON. The data frames read correspond to health facilities, administrative boundaries, and population datasets.

# importing libraries
import pandas as pd
pd.set_option('display.max_columns', None)
import geopandas as gpd
import plotly.express as px

#reading datasets
health_df = gpd.read_file('../Datasets/health-care.geojson')
adminstrative_df = gpd.read_file('../Datasets/lga.geojson')
pop_df = gpd.read_file('../Datasets/NGA_population.json')

It’s essential to get an overview of each dataset feature. This way you learn about the various variables and their contextual meaning and datatypes.

health_df.columns
Geospatial data - dataset features
Adminstrative_df.columns
Geospatial data - dataset features
pop_df.columns
Geospatial data - dataset features

Let’s clean up the various data frames with features relevant to our analysis:

health_df = health_df[['latitude', 'longitude','functional_status','type', 'lga_name','state_name', 'geometry']]
adminstrative_df = adminstrative_df[['lga_name','state_name','geometry']] 
pop_df = pop_df[['lganame','mean', 'statename','geometry']]
pop_df = pop_df.replace(to_replace=['Ajeromi Ifelodun', 'Ifako Ijaye','Oshodi Isolo' ], value=['Ajeromi/Ifelodun','Ifako/Ijaye', 'Oshodi/Isolo']).reset_index()
pop_df = pop_df.rename(columns={'lganame':'lga_name'})
pop_df.drop('index', axis=1, inplace=True)

For the context of this guide, we want to do analysis on health care facility distribution based on functionality status, health care facility count per 100000 population, and health care type distribution (primary, secondary and tertiary). So, let’s create these features from the base data:

health_pop = health_df.merge(pop_df, how='left', on='lga_name')
health_pop.drop(columns=['statename','geometry_x'],axis=1, inplace=True)
health_pop.rename(columns={'geometry_y':'geometry'}, inplace=True)
total_hospital_count=health_pop.groupby('lga_name')['geometry'].count().reset_index()
hosp_per_x = total_hospital_count.merge(pop_df, how='left', on='lga_name')
hosp_per_x.rename(columns={'geometry_x':'count', 'mean':'mean_pop', 'geometry_y':'geometry'}, inplace=True)
hosp_per_x['Health_facilities_per_100000'] = (hosp_per_x['count']/hosp_per_x['mean_pop'])*100000
hosp_per_x.drop(columns=['count','mean_pop', 'statename'], axis=1, inplace=True)
hosp_per_100000 = hosp_per_x[['lga_name','Health_facilities_per_100000']]
 
hosp_type_df = health_df.replace(to_replace=['Primary', 'Secondary','Tertiary' ],
                      value=[1,2,3]).reset_index()
hosp_type_df.drop('index', axis=1, inplace=True)

We’ll use Mapbox maps with Plotly for analysis. Mapbox is a mapping and location cloud platform for developers, it provides endpoints for different applications. Mapbox box custom map base layer will be utilized to improve the aesthetic of our map and access to customization features. The Frank style from the Mapbox gallery was used for the various maps

To connect these two platforms for epic results, you need a Mapbox account and a public Mapbox Access Token. You can sign up here and get your public access token as follows:

Geospatial data - mapbox

Building interactive maps

Scatter plots

This plot shows the distribution of point data based on coordinates. The main objective here is to plot the distribution of health care facilities based on their functional status.

fig1 = px.scatter_mapbox(health_df, lat="latitude", lon="longitude", color="functional_status", hover_data=["type", "lga_name"],
                       zoom=8, height=300,
                      labels={'functional_status':'Functional status of Health Facilities'},
                      center={'lat': 6.5355, 'lon': 3.3087}
)
fig1.update_layout(mapbox_style=style_url, mapbox_accesstoken=access_token)
fig1.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
run['interactive__scatter_plot_img'] = neptune.types.File.as_html(fig1)

N/B: The various runtime and plots will be saved on your Neptune AI profile.

Each runtime tracks the results and simulates computation resources like CPU and memory usage.

Interactive scatter plot:

Choropleth map

This is a map composed of colored polygons. It’s used to represent spatial variations of a quantity. Here, the objective is to show the distribution of health care facility counts per 100000 population in the various subsections of the case study area — known as Local Government Areas(LGAs) in Lagos.

import json
 # Opening JSON file
f = open('/content/drive/MyDrive/Geospatial-article/Datasets/lga.geojson',)
 # returns JSON object as  a dictionary
geo_json_file = json.load(f)
fig2 = px.choropleth_mapbox(hosp_per_100000,
                    geojson=geo_json_file,
                    locations='lga_name',
                    color='Health_facilities_per_100000',
                    featureidkey="properties.lga_name", 
                    range_color=(0, 100),
                    labels={'Health_facilities_per_100000':'Health Facilities_per_100000'},
                    zoom=8.5,
                    center={'lat': 6.5355, 'lon': 3.3087}
                         )
fig2.update_layout(mapbox_style=style_url, mapbox_accesstoken=access_token)
 
fig2.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
run['interactive__chloropleth_map_img'] = neptune.types.File.as_html(fig2)

Interactive plot of chloropleth map:

Density heatmap

This map shows the magnitude of a phenomenon with colors, with obvious visual cues on how the phenomenon is clustered or varies over space. Here, we can try to visualize clusters of types of health care facilities in the case study.

Keys:

  • 1: Primary health care facilities
  • 2: Secondary health care facilities
  • 3: Tertiary health care facilities
fig3 = px.density_mapbox(hosp_type_df, lat='latitude', lon='longitude', z='type', radius=10,
                       center={'lat': 6.5355, 'lon': 3.3087}, zoom=8.5,
                       labels={'type':'Health Facilities type'},
                       )
fig3.update_layout(mapbox_style=style_url, mapbox_accesstoken=access_token)
fig3.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
run['interactive__heatmap_map_img'] = neptune.types.File.as_html(fig3)

Interactive plot of density heatmap map:

Line on the map

Sometimes in order to make analysis on maps, you might need to draw lines on your map, for example, to explain distance, or route. This can be easily done as follows:

import plotly.graph_objects as go
fig4 = go.Figure(go.Scattermapbox(
   mode = "markers+lines",
   lat = [6.5095,6.6018, 6.4698],
   lon = [3.3711,3.3515, 3.5852],
   marker = {'size': 10}))
fig4.update_layout(mapbox_style=style_url, mapbox_accesstoken=access_token)
fig4.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig4.update_layout(mapbox= dict(
   center={'lat': 6.5355, 'lon': 3.3087},zoom=8.5 ))
run['interactive__line_on_map_img'] = neptune.types.File.as_html(fig4)

Interactive plot of line on map:

Study area on the map

Like lines on maps, sometimes we wish to isolate certain areas to further study them. These isolations can vary in shapes (polygons), depending on the coordinates of the desired isolation area. This isolation can be carried out as follows:

fig5 = go.Figure(go.Scattermapbox(
   fill = "toself",
   lon = [3.297806, 3.295470, 3.349685, 3.346413], lat = [6.539536,6.488922, 6.488922, 6.542322],
   marker = { 'size': 10, 'color': "red" }))
fig5.update_layout(mapbox_style=style_url, mapbox_accesstoken=access_token)
fig5.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig5.update_layout(mapbox= dict(
   center={'lat': 6.5355, 'lon': 3.3087},zoom=10))
run['interactive__line_on_map_img'] = neptune.types.File.as_html(fig5)

Interactive plot of study area:

Conclusion

And that’s it! If you made it this far, thank you for reading. I hope this article gave you a comprehensive overview of how to get started in the emerging field of geospatial data science. Thanks for reading!


READ NEXT

The Best Tools for Machine Learning Model Visualization

4 mins read | Paweł Kijko | Posted May 25, 2020

The phrase “Every model is wrong but some are useful” is especially true in Machine Learning. When developing machine learning models you should always understand where it works as expected and where it fails miserably.

There are many methods that you can use to get that understanding:

  • Look at evaluation metrics (also you should know how to choose an evaluation metric for your problem)
  • Look at performance charts like ROC, Lift Curve, Confusion Matrix, and others
  • Look at learning curves to estimate overfitting
  • Look at model predictions on best/worst cases
  • Look how resource-intensive is model training and inference (they translate to serious costs and will be crucial to the business side of things) 

Once you get some decent understanding for one model you are good, right? Wrong 🙂

Typically, you need to do some or a lot of experimenting with model improvement ideas and visualizing differences between various experiments become crucial. 

You can do all of those (or most of those) yourself but today there are tools that you can use. If you’re looking for the best tools that will help you visualize, organize, and gather data, you’re in the right place.

Continue reading ->
Neptune-ai CB Insights AI 100

Neptune.ai Named to the 2022 CB Insights AI 100 List of Most Promising AI Startups

Read more
Series-A-announcement-Neptune

We Raised $8M Series A to Continue Building Experiment Tracking and Model Registry That “Just Works”

Read more
Self supervised learning

Self-Supervised Learning and Its Applications

Read more
GAN failure modes

GANs Failure Modes: How to Identify and Monitor Them

Read more