By Andy May
R is an extremely powerful programming language for processing, analyzing and displaying data from large datasets. As discussed in the first post of this series on analyzing IGRA2 radiosonde data with R, the language has improved considerably in recent years. Surprisingly it is free and can be downloaded here. This post will cover some necessary R programming techniques for those interested in reading and plotting IGRA2 data. The IGRA2 measurements have had minimal processing and are as close to raw data as possible, unlike RICH or ROABCORE data, thus it is a useful check on climate model output. The IGRA2 data can be downloaded here or from its ftp site: ‘ftp.ncei.noaa.gov/pub/data/igra.’ It is well formed but requires some manipulation to make it useable. Once the data are prepared, some further tricky programming is needed to plot it. I’ll briefly introduce the key programming techniques here. The full suite of complete R programs that I used to analyze the IGRA2 data (May, 2025) can be downloaded here (warning the file is 658 MB and processing it will require > 32 GB of RAM). For a simple and brief list of the programs and what they do, download this pdf.
If you simply want to use the code described here to write your own programs, I recommend using grok.com as a programming aid, it is very helpful. R is such a rich language, programming in it is difficult without help. As a friend once told me, when programming in R, good groking is as important as good coding.
We will discuss two R programs, the first reads the downloaded IGRA2 text files and formats the data properly for R processing, the second makes binned tropospheric profile plots of selected values. The bins average radiosonde ascents by latitude and altitude to make interpretation easier. A third program, covered in the next post, makes global maps of selected 1990-2025 values averaged by station.
Reading the IGRA2 data
Once the data are downloaded, you will immediately realize how massive it is. Downloading all the monthly data takes up over 100 GB of storage in 2,923 files (as of August 2025). Each file is a station, and each station has two or more weather balloon ascents every day. I’ve tried processing the data and preparing it for analysis in one pass twice and never succeeded, it is just too much data. Preparing the data must be done in several passes, in this post we will just describe the initial program that reads the downloaded IGRA2 files. As mentioned above, all the programs I used and their documentation can be downloaded as part of the supplementary materials (May, 2025b) to my most recent paper (May, 2025). In this post I skip some of the preparatory steps, so the output from the first program shown in this post is not the input for the second one.
Downloading the IGRA2 radiosonde data using a browser requires you to download the files one at a time. You can log into the IGRA ftp site using an ftp app like filezilla with a user id of ‘anonymous,’ your email address as a password, and download all the files in one go.
The complete programs discussed in this post and the next one can be downloaded here. The program that reads the IGRA2 files is called IGRA_station_proc.r. It assumes that all the files are in one directory and the file names are listed in a text file called dir.lst. Each file contains all the ascents for one radiosonde station, and each ascent has a header record followed by the data collected by pressure level. The program accepts and processes all the ascents that have data between 1990 and 2025 with more than 90 pressure levels, there are 1,136 of these.
IGRA_station_proc.r first reads the list of IGRA2 files, then begins to read the files one-by-one. After reading an ascent header record, containing the station name, date and time of the ascent, latitude and longitude, it reads the data for each pressure level. The text records are then converted into the appropriate R data type, character, integer, or floating point. The most critical data items are ‘temp’ (temperature in degrees C), ‘hPa’ (pressure in hectopascals), ‘rh’ (relative humidity in percent), ‘gph’ (geopotential height in meters), ‘wdir’ (the ‘from’ wind direction as an azimuth), and ‘wspd’ (wind speed in m/sec). Then the following calculations are performed:
# Molar Density (mol/m^3)
mol_den <- if (is.na(hPa) | is.na(temp)) NA else (hPa * 100) / ((temp + 273.15) * 8.3145)
# rh from dpd (dew point depression) if rh is NA (null)
# The dewpoint = the temperature(temp) - dewpoint depression (dpd)
if (is.na(rh) && !is.na(dpd) && !is.na(temp)) {
rh <- dewpoint.to.humidity(t = temp, dp = (temp - dpd),
temperature.metric = 'celsius')
}
# Compute q_gkg (specific humidity)
t_k = temp + 273.15 # Convert to K
es = 6.1078 * exp((17.269388 * (t_k - 273.16)) / (t_k - 35.86)) #Magnus-Tetens eq. per Murray, 1967
e = (rh / 100) * es
q_gkg = ifelse(hPa > e & !is.na(temp) &
!is.na(rh) & !is.na(hPa),
0.622 * e / (hPa - e) * 1000, NA_real_)
The R code above first computes the molar density (Connolly & Connolly, 2014a), then if the relative humidity (rh) is missing and dew-point depression (dpd) has a valid value it computes the rh from the dpd. Finally, it computes the specific humidity (q_gkg) from the rh, temp, hPa, and the Magnus-Tetens equation (Murray, 1967).
Due to the volume of data, there are several more steps before the binned plots described below can be made, these are described here. This post is just a summary of the critical code needed to plot the IGRA2 data.
Making binned plots
Binned plots divide the vertical section into 10 hPa pressure bins and then average all the values in that bin for the variable being profiled over a selected subset of the data. In our case we subset by ten-degree latitude slices and by month, as shown in figure 1 for the latitude slice from 30°N to 40°N in December. The plot is of the binned molar density and shows the best fit lines from above 125 hPa (in blue) and between 300 and 750 hPa (in green) and their intersection as a blue dashed horizontal line.
The program that made figure 1 is Lat_split_bin_plot_avg_V2.r. The code to compute molar density (mol_den) was shown previously. The code to compute the best fit lines shown and the molar density intersection, which is just under the WMO tropopause (May, 2025), is shown below.
First, create subsets of the current month and latitude slice (‘cur_lat’) for the upper troposphere (above 125 hPa) and lower troposphere (between 300 and 750 hPa). The lower limit of 750 hPa is meant to exclude the common weather events below 750 hPa (about 2.5 km) so the line is not affected as much by surface events. Finally clear out any values in the slopes and intercepts.
lower <- cur_lat[hPa > 300 & hPa < 750]
upper <- cur_lat[hPa < 125]
lower_slope <- lower_int <- upper_slope <- upper_int <- NA_real_
Then fit the least squares lines using the R function “lm” and extract the slope and intercept of the lines from the lm function’s regression results using “coef,” as long as there are more than 2 points in the subset.
if (nrow(lower) > 2) {
lm_lower <- lm(hPa ~ mol_den, data = lower)
lower_slope <- coef(lm_lower)[2]
lower_int <- coef(lm_lower)[1]
}
if (nrow(upper) > 2) {
lm_upper <- lm(hPa ~ mol_den, data = upper)
upper_slope <- coef(lm_upper)[2]
upper_int <- coef(lm_upper)[1]
}
Find the intersection of the two lines using the R “matrix” and “solve” functions.
# === Find intersection of upper and lower fits ===
int_hPa <- int_mol_den <- NA_real_
if (all(!is.na(c(upper_slope, upper_int, lower_slope, lower_int)))) {
# Solve: hPa = m1 * mol_den + b1
# hPa = m2 * mol_den + b2
A <- matrix(c(1, -upper_slope, 1, -lower_slope), nrow = 2, byrow = TRUE)
B <- c(upper_int, lower_int)
sol <- solve(A, B)
int_hPa <- sol[1]
int_mol_den <- sol[2]
}
Now that we have the intersection, find the values of other key variables at this intersection pressure using an R function we create called “interp_at_int.” This function uses the R function “approx,” which interpolates a value at a specific point.
# === Interpolate key variables at intersection pressure ===
interp_at_int <- function(var) {
if (!is.null(cur_lat[[var]]) && sum(!is.na(cur_lat[[var]])) > 2 && !is.na(int_hPa) && int_hPa > 0) {
approx(cur_lat$hPa, cur_lat[[var]], xout = int_hPa, rule = 1)$y
} else NA_real_
}
int_temp <- interp_at_int("temp")
int_gph <- interp_at_int("gph")
int_wspd <- interp_at_int("wspd")
int_wdir <- interp_at_int("wdir")
int_rh <- interp_at_int("rh")
int_q_gkg <- interp_at_int("q_gkg")
Although the molar density intersection is not the same as the WMO tropopause (WMO & Ashford, 1957), it is just under it and, most importantly, locating it can be automated (May, 2025).
Now that the critical values are computed, the following code makes the plot in figure 1. There are many possible plots, those to be plotted are identified in the vector named “plot_indices.”
Loop through the list of plots stored in “plot_indices.”
# === Plotting loop ===
for (iplot in plot_indices) {
Store the variable to be plotted on the x axis in “x_var.” If the variable name is not in cur_lat, skip this plot.
x_var <- plot_scales$Name[iplot]
if (!x_var %in% names(cur_lat)) next
The R line of code below uses functions in the data.table package to create a new data.table “plot_dat”. The first part ([!is.na(get(x_var)) & !is.na(hPa)) ensures that only observations that have x_var and pressure (hPa) values are kept. The second part (.(hPa, x_val = get(x_var))]) identifies the variables to keep, hPa and x_var. x_var is a character variable that contains the name of the column we want to keep and plot. The period in front of the second part is a data.table alias for “list”; it simply means we are providing a list of columns (or variables) to carry from “cur_lat” to “plot_dat.”
# Prepare data for binning
plot_dat <- cur_lat[!is.na(get(x_var)) & !is.na(hPa), .(hPa, x_val = get(x_var))]
Create the pressure bins every 10 hPa from 10 to 1030 hPa with the ‘seq’ function, which is part of base R. The data.table special operator “:=” adds or modifies columns in place efficiently without copying the whole table. The base R function “cut” breaks a vector into intervals defined by breaks. The other parameters are to include the lowest values and will not label each bin.
# Pressure bins: 10 hPa intervals
breaks <- seq(10, 1030, by = 10)
plot_dat[, press_bin := cut(hPa, breaks = breaks, include.lowest = TRUE, labels = FALSE)]
Average the pressure values and the x variable in each bin. This statement uses data.table syntax to bin the data by pressure levels. It first excludes bins with no values, then averages pressure (hPa) and the x variable (x_val) for each bin. “N_obs” is the number of observations in each bin, “.N” is a special data.table variable that provides the number rows in the current group. “binned” becomes a data.table with four variables, “press_bin,” “avg_hPa,” “avg_x,” and “n_obs.”
binned <- plot_dat[!is.na(press_bin),
.(avg_hPa = mean(hPa),
avg_x = mean(x_val),
n_obs = .N),
by = press_bin
][n_obs > 0]
Create the file name for the plot using the platform-independent function “file.path” from the month abbreviation (“month_abbs” is an vector of month abbreviations from the R standard variable “month.abb”), latitude (“lat_val”), and the x variable name(“x_var”). “%s” means string and “%d” means an integer.
# File and title
plotname <- file.path(output_dir,
sprintf("%s_Lat_slice_%d_%s.png",
month_abbs[imonth], lat_val, x_var))
Using the month abbreviations, latitude, and current title from “plot_scales” this statement builds a nice looking “title” for the plot with the string building function “sprintf.”
title <- sprintf("%s, Latitude %d\u00b0 to %d\u00b0, %s",
month_abbs[imonth], lat_val, lat_val + 10, plot_scales$Title[iplot])
Use “ggplot” to make the graph. First set x=avg_x and y=pressure (hPa).
p <- ggplot(binned, aes(x = avg_x, y = avg_hPa)) +
Set the size of the dots on the graph.
geom_point(size = 1.5) +
Specify the x and y scales and axis titles using the values from plot_scales, which contains the contents of ‘plot_scales_title.csv.’ The y axis is reversed to place 1000 hPa on the bottom because it is the surface. X and Y axis labels are built from values in plot_scales.
scale_y_reverse(limits = c(plot_scales$base[1], plot_scales$top[1])) +
scale_x_continuous(limits = c(plot_scales$top[iplot], plot_scales$base[iplot])) +
labs(title = title,
x = paste0(plot_scales$Title[iplot], " (", plot_scales$Units[iplot], ")"),
y = "Pressure (hPa)") +
Use “theme_bw,” this is a black and white professional looking plot theme. The font is specified for the title and axis labels.
theme_bw(base_size = 12) +
theme(plot.title = element_text(size = 16, face = "bold"),
axis.title = element_text(size = 14))
Add the dashed blue horizontal line marking the molar density intersection.
# Add intersection line
if (!is.na(int_hPa)) {
p <- p + geom_hline(yintercept = int_hPa, color = "blue", linetype = "dashed")
}
For the molar density plots we need the upper and lower best fit lines. The slopes and intercepts are made negative because the y axis is reversed.
# Special overlays
if (iplot == 2) { # molar density plot
if (!is.na(upper_slope)) {
p <- p + geom_abline(slope = -upper_slope, intercept = -upper_int, color = "blue")
}
if (!is.na(lower_slope)) {
p <- p + geom_abline(slope = -lower_slope, intercept = -lower_int, color = "green")
}
}
For the lapse rate plots we need a vertical line at the WMO cutoff of 2.
if (iplot == 5) { # lapse rate - WMO tropopause cutoff
p <- p + geom_vline(xintercept = 2, color = "blue", linetype = "dashed")
}
For the North/South mass flux plot we need a zero vertical line.
if (iplot %in% c(8, 10, 11)) { # u, v, NS_flux
p <- p + geom_vline(xintercept = 0, color = "blue", linetype = "dashed")
}
Uncomment this line to see the plot on the screen
# print(p)
Save the plot as a 300 dpi page sized plot suitable for publication.
ggsave(plotname, plot = p, width = 8, height = 6, dpi = 300, bg = "white")
} # end plot loop
Discussion
Figure 1 is just an example of many plots that can be made by the full R code in the supplementary materials. The code plots temperature, geopotential height, and lapse rate as discussed in May, 2025. We also plot wind speed and direction as a function of pressure, examples and discussion of these plots can be seen here.
As May 2025 and Connolly, 2025 point out, the Hadley circulation does not work as shown in textbook cartoons, like in Hadley Cell Dynamics (Showman, 2009). Although as Showman carefully explains in his unpublished chapter, a true north-south Hadley cell cannot exist on a rotating planet, the angular momentum in the tropics is too high, thus a complex circulation that is mostly east to west, with a small north/south component is set up. This tropical circulation is comprised of easterly winds at the surface and westerly winds high in the troposphere, separated by a low wind speed zone where the wind direction reverses that begins above 300 hPa at the equator, and disappears at Earth’s surface between 20 and 30 degrees North/South, as shown in figure 5 here and in (May, 2025).
One final note, there have been many changes in radiosonde instruments over the past few decades and regionally, further some radiosonde readings are affected by time of day. Corrections for these problems are attempted in the RICH and RAOBCORE datasets which have been “homogenized” to try and make the points more mappable and uniform in their response than IGRA2. In the next post in this series, we will describe how to map the nearly raw IGRA2 measurements, and the values computed from them.
I was recently made aware of another R package that can be used to read IGRA2 data. I’ve not tried it, but if you are interested in it, download it here.
Works Cited
Connolly, M. (2025). 20 Million weather balloons: How this data shows that all the climate models are based on wrong assumptions. Retrieved from https://www.youtube.com/watch?v=48Hp9CqSlMQ&t=1026s
Connolly, M., & Connolly, R. (2014a). The physics of the Earths atmosphere I. Phase change associated with tropopause. Open Peer Rev. J. Retrieved from https://ronanconnollyscience.com/wp-content/uploads/2020/10/oprj-article-atmospheric-science-19.pdf
Dima, I. M., & Wallace, J. M. (2003). On the Seasonality of the Hadley Cell. Journal of the Atmospheric Sciences, 60(12), 1522 – 1527. https://doi.org/10.1175/1520-0469(2003)060<1522:OTSOTH>2.0.CO;2
May, A. (2025). The Molar Density Tropopause Proxy and its relation to the ITCZ and Hadley Circulation. OSF. https://doi.org/10.17605/OSF.IO/KBP9S
May, A. (2025b, November 28). Supplementary Materials: The Molar Density Tropopause Proxy and Its Relation to the ITCZ and Hadley Circulation. https://doi.org/10.5281/zenodo.17752293
Murray, F. W. (1967). On the Computation of Saturation Vapor Pressure. Defense Technical Information Center. Retrieved from https://apps.dtic.mil/sti/html/tr/AD0637612/
Showman, A. (2009). Hadley Cell Dynamics. Retrieved from https://courses.seas.harvard.edu/climate/eli/Courses/EPS281r/Sources/Hadley-cell/2-Showman-2009.pdf
WMO, & Ashford, O. M. (1957, October). Meteorology – A three-dimensional science, second session of the commission for aerology. WMO Bulletin, 6(4), 134-138.
