Mulling over boxplots and scatterplots.

I was searching for a way to analyse radar image data using boxplots so that I could visualise the distribution of backscatter values of different forest cover types for a given polarisation. The answer was provided by R, perhaps the most powerful free software tool there is for statistical computing and graphics, which enabled me to generate the boxplots for my purpose. It took a bit of time to produce as I had to learn R from scratch, which became possible thanks to available online resources. R also gave me the flexibility to manipulate and visualise the data the way I wanted to, which could not have been possible in Excel, and it generated more visually appealing graphics.

Boxplot showing radar backscatter values of forest types for ALOS/PALSAR HV polaristion.

Boxplot showing radar backscatter values of forest types for ALOS/PALSAR HV polarisation.

The boxplot shown above was generated through R via RStudio, an integrated development environment that provided a powerful interface for R, and the ggplot2 library package, a plotting system for R based on the grammar for graphics developed by Dr Hadley Wickham. From this boxplot, for example, it can be said that forest types closed mixed forests (CFM) and mangrove forests (MF) can be distinguished apart from each using just ALOS/PALSAR’s HV polarisation. On the other hand, the radar backscatter values of open forests (i.e., OFB, OFC, OFM) regardless whether these are broadleaved, coniferous, or mixed overlap significantly; hence, these types could be hardly distinguished from each other using the same image.

After the boxplot, the next step was to see the relationship and distribution of the backscatter values between two variables (against both HH and HV polarisation) by displaying them in a scatterplot, which would also tell me whether the backscatter values of these forest types could be separated from one another. I finally figured it out after much trial and error and drinking several mugs of coffee. I documented the process for creating the scatterplot using R below:

At the R console using RStudio IDE, I loaded the ggplot2 package:

> library(ggplot2)

Next, I loaded my data which was contained in a CSV file using the following command:

> SPHHHV <- read.csv(file="/filepath/SPHHHV.csv", header=TRUE, sep=",")

This meant that I was telling R that the values in the file were separated by a comma (sep=”,”) and a column heading was present (header=TRUE). The data was also captured in a data frame that was stored in the variable SPHHHV. Then I used the head command to display the first several rows of the matrix or data frame SPHHHV as follows:

> head (SPHHHV)
  ForestType   Mean.HH    SD.HH   Mean.HV    SD.HV
1        CFB -7.477564 1.525228 -12.81651 1.673345
2        CFC -7.031030 1.938132 -12.09642 1.563122
3        CFM -6.595219 0.827661 -11.16381 0.686749
4        FPB -7.007717 1.937744 -13.31794 2.057441
5         MF -8.146819 1.136936 -14.70111 1.624990
6        OFB -7.414856 1.523413 -13.03388 1.754264

Next, I created another data frame q to contain the ggplot using aesthetics mapping, which specified ggplot would access the data frame SPHHHV and that the mean HH and mean HV values were to be plotted along x- and y-axes, respectively, and these values would be grouped or colored according to forest type. I went on to construct data frames to define the limits for the error bars, specifically one each for HH and HV.

> q <- ggplot(SPHHHV, aes(colour=ForestType, y=Mean.HV, x=Mean.HH))
> limitsHH <- aes(xmax = Mean.HH + SD.HH, xmin = Mean.HH - SD.HH)
> limitsHV <- aes(ymax = Mean.HV + SD.HV, ymin = Mean.HV - SD.HV)

It was time to test whether the plot would work. I executed the following command, which generated the subsequent scatterplot. Just to note: geom_point() plots the points contained within the data frame q, and then produces error bars along the x-axis (geom_errorbarh) and y-axis (geom_errorbar) defined by the limits (limitsHH, limitsHV):

> q + geom_point() + geom_errorbarh(limitsHH, height=0.2) + geom_errorbar(limitsHV, width=0.2)
Initial scatterplot test showing HH and HV backscatter values of forest types along the x- and y-axes, respectively.

Initial scatterplot test showing HH and HV backscatter values of forest types along the x- and y-axes, respectively.

Looks like it worked. Finally, the finishing touches were made by adding the following commands to include the axis labels, plot title, and some plot enhancements:

Increasing the point size to enhance the visibility of the mean values:

+ geom_point(size=5)

Adding axes labels and and the plot title:

+ xlab(label='HH (dB)') + ylab(label='HV (dB)') + ggtitle(label='Mean Radar Backscatter of Forest Types for ALOS/PALSAR HH & HV Polarization')

Expanding the limits of the plot area to add a bit more space:

+ coord_cartesian(xlim = c(-4.50,-9.50),ylim = c(-10,-17))

The complete command to generate the final scatterplot is as follows:

> q + geom_point(size=5) + xlab(label='HH (dB)') + ylab(label='HV (dB)') + ggtitle(label='Mean Radar Backscatter of Forest Types for ALOS/PALSAR HH & HV Polarization') + coord_cartesian(xlim = c(-4.50,-9.50),ylim = c(-10,-17)) + geom_errorbarh(limitsHH, height=0.2) + geom_errorbar(limitsHV, width=0.2)
Final scatterplot showing HH and HV backscatter values of forest types along the x- and y-axes, respectively.

Final scatterplot showing HH and HV backscatter values of forest types along the x- and y-axes, respectively.

The scatterplot shows that mean backscatter values of most forest types are clustered and their plots overlap, which would make it difficult, if not impossible, to distinguish them from each other. Perhaps an exception could be mangrove forests (MF) can be discriminated from closed mixed forest (CFM), which are situated at the lower-left and upper-right areas of the plot.

Now, time to move on to the next item on the list of analyses to be completed.

In the footsteps of revolutionaries.

Last November 30th Filipinos commemorated the birth of one of its great national heroes, Andres Bonifacio, on his sesquicentennial year. Andres Bonifacio is well-known as the Supremo of the revolutionary society called the Katipunan, whose primary goal was to pursue Philippine independence from Spanish colonial rule through revolution, and is also widely considered as the father of the Philippine Revolution. Unlike Dr Jose Rizal, who embraced a reformist perspective and peaceful political advocacy through assimilation with Spain as the way forward (and of whom we also commemorated his martyrdom last December 30th), Bonifacio took a more radical approach and inspired resistance and armed revolution to attain independence. He met his untimely death still in pursuit of this ideal, although quite sadly at the hands of Emilio Aguinaldo, a fellow compatriot.

Celebrating Bonifacio Day also reminded me of another prominent figure, an English physician by the name of Dr John Snow, a revolutionary in his own way who changed the science of epidemiology forever, and made his imprint in the history of public health and geography. His work involved tracing the source of the cholera outbreak in London in 1854 by mapping the cluster of cholera cases around a water pump, which had been drawing sewage water from the Thames and then delivered to households. The map he developed led him to investigate the spatial patterns and examine the contaminated water samples, and then to finally establish the connection between polluted water and the increased incidence of cholera. The ending of the outbreak is attributed to his findings.

The map of the 1854 London cholera outbreak led to Dr John Snow’s visionary work in deducing the mode of transmission of epidemic cholera, which is why he is regarded as father of modern epidemiology. He is also considered as the ‘original’ father of geographic information systems due to his demonstration of one of the earliest uses of spatial analysis, and simply because the map made a huge impact through data visualisation. In fact, his cholera map changed the world by changing the way people viewed disease.

Dr John Snow's 1854 London cholera outbreak map. This map shows the locations of the 13 public water sources in and around Soho district, and shows cholera deaths by home address, marked as black bars stacked perpendicular to the streets. (Image source: University of Delaware)

Dr John Snow’s 1854 London cholera outbreak map. This map shows the locations of the 13 public water sources in and around Soho district, and shows cholera deaths by home address, marked as black bars stacked perpendicular to the streets. (Image source: University of Delaware)

Not so long ago, I realised there was a suite of available geospatial tools and models that when put together would redefine our approach to forest conservation. These tools provided a way to express data from forest inventory plots and biodiversity transects in a more spatially explicit manner.  Using both forest inventory field data and earth observation data such as synthetic aperture radar, spatially explicit estimates of above ground biomass and forest carbon stock can be computed and mapped. Point localities of species observations from biodiversity transects can be modeled together with environmental thematic data to produce  species distribution maps. These tools have actually been around for a while, and are still active areas of research. However these have not yet been utilised or applied more widely, particularly in the Philippines, for planning and for informed decision-making. The following maps of Mt Kitanglad Range Natural Park below show an example of how these approaches can be applied.

Species distribution map of some threatened birds in Mt Kitanglad Range Natural Park modeled using maximum entropy.

Species distribution map of some threatened birds in Mt Kitanglad Range Natural Park (including Philippine eagle, Philippine cockatoo, Celestial monarch, Philippine dwarf kingfisher, Whiskered pitta, and Philippine hawk-eagle) modeled using maximum entropy. Red pixels denote high congruence of species occurrence in those areas; orange to yellow as moderate; and green to cyan to blue as low. Red lines show protected area boundaries; black lines are municipal boundaries.

Using species distribution models, for example, can show the probability of occurrence or the most likely suitable habitat of a particular species. But combining several models of species, say of threatened or trigger species, could show species richness such that areas of high congruence of these species in a given pixel can be delineated into an appropriate management zone (such as a strict protection zone). The species richness map resulting from the combined distribution models can also support a more explicit delineation of high conservation value areas. In fact, high species occurrence (red pixels) can be seen within Mt Kitanglad Range Natural Park, which could be deemed as adequately contained or ‘protected’ within the confines of the protected area. However, another area directly south of the protected area can be seen as areas with high congruence of species occurrence—hence a high conservation value area—perhaps of similar intensity as the Park, but is currently not identified or delineated as a protected area. The species distribution models can also then support the delineation, refinement, or even redefinition, of boundaries of identified key biodiversity areas and protected areas.

Forest carbon density map of Mt Kitanglad Range Natural Park generated by Saatchi et al. 2011 using a data fusion model.

A 1-km resolution forest carbon density map (c.2000) of Mt Kitanglad Range Natural Park generated by Saatchi et al. 2011 using a data fusion model based on maximum entropy approach and spatial imagery from multiple sensors to extrapolate aboveground biomass from ground-based forest inventory plots and LIDAR-based height data. Red pixels show areas of high aboveground biomass; orange to yellow as moderate; and green to cyan to blue as low. Red lines show protected area boundaries; black lines are municipal boundaries.

The next map shows a 1-km pixel resolution forest carbon density map (c.2000) generated by Saatchi et al. 2011 using a data fusion model based on the maximum entropy approach and spatial imagery from multiple sensors, including MODIS, SRTM, and QSCAT to extrapolate aboveground biomass measurements from forest inventory plots and ICESat GLAS data to the entire landscape. Here we can see that forest carbon density within Mt Kitanglad Range Natural Park is dominantly at, say, moderate levels (orange to yellow pixels). In fact, areas of high forest carbon density (red pixels) are outside the protected area, particularly located to the west. Although, not shown here on a map, forest cover within Mt Kitanglad Range Natural Park is extensive and almost covers the entire Park. Forests within the Park, while extensive and covers the entire protected area, are not forests with high biomass or carbon density, which of course correspond to trees with large dbh and height values. Perhaps these forests are composed of trees with smaller dbh and tree heights?

An interesting story between these two maps is that areas of high congruence for identified threatened species do not necessarily correspond with areas of high aboveground biomass. Or in other words, high conservation value areas are not necessarily congruent with high forest carbon density areas, at least in this example. (At the moment, it’s too bad I don’t have a way yet to depict information from two maps into just one singular map by displaying the legends of two raster maps as a color matrix. But the good news is a colleague from UNEP WCMC has developed a way to make this happen, so I’ll be seeking her help to be able to generate that combined map.) One implication of these maps in terms of REDD+, particularly in the Philippines where the national REDD+ strategy puts equal weights on forest carbon and biodiversity, points out that areas identified as important for carbon under a performance-based financial incentive mechanism like REDD, may not be the same areas that are deemed important for biodiversity conservation.

At FFI we’re currently exploring these tools and approaches, and our hope is to utilise these more extensively and share these with other conservation practitioners.  I would like to think that we are following the footsteps of great revolutionaries that have gone before us like Snow or Bonifacio—that of pursuing revolutionary ideas. Will it change the way we do forest conservation or the way we look at conservation planning? Only time will tell.

PhilGEOS 2013.

I wish to invite you to participate in PhilGEOS 2013 to be held on 28 to 29 November 2013 at the National Institute of Physics Auditorium at the University of the Philippines, Diliman, Quezon City. The theme of the 2nd Philippines Geomatics Symposium, Geomatics for a Resilient Agriculture and Forestry, aims to gather not only professionals working in the geomatics industry, but practitioners who are utilising and advancing geospatial technologies in the fields of agriculture and forestry.

PhilGEOS 2013

I am honored to be one of the co-authors of a paper entitled, Satellite and Airborne Mapping and Data Sets for Forest Monitoring and Assessment of Carbon Stocks in the Philippines: An Expository Review, which will be presented by Dr G Mendoza, our lead author, at the symposium. Other interesting topics (at least for me) include oral presentations on assessing tree species diversity from permanent monitoring plots in natural forests by Oplenda et al.; on mapping forest canopy density for assessing carbon emissions or removals by Urizza et al.; and assessing the utility of various GNSS receivers under forest canopies by Balicanta et al. You can read more about the list of accepted abstracts from the PhilGEOS 2013 website.