The Data

Introducing the Data


Welcome!

This portfolio shows the development of three different visualizations of the Star Wars universe. The first visualization shows the average age of characters in the universe bifurcated by gender. The second visualization demonstrates how starship classes differ in terms of length. The third visualization illustrates the distance of Star War’s planets from their host stars and animates their orbital periods. Planets from our solar system are included for easy comparison.

All of the data for the visualizations are from Jenny Bryan’s repurrrsive package. The repurrrsive package includes various datasets to assist in the teaching of R. The Star Wars dataset is originally from SWAPI (i.e., the Star Wars API).

The People Data


In order to get the people data into the correct format for our visualization, I first made a tibble of the data from repurrrsive‘s sw_people dataset. I selected only those columns of interest and unlisted the columns (all of the data in repurrrsive is, as the name would suggest, stored in recursive lists). I then converted the variables to the correct variable types and replaced “n/a” with more meaningful values. I also used parse_number to extract the characters’ ages from their birth year. Although the number actually refers to how many years before or after the Battle of Yavin a character was born, all characters that had values for birth_year were born before the Battle of Yavin. Therefore, the result of parse_number provided all of the characters’ ages at the Battle of Yavin.

The Starships Data


To get the data into the correct format for the starship visualization, I again created a tibble of the data from repurrrsive’s sw_starships dataset and unlisted the columns that were stored as lists. I converted all numeric variables to be numeric. Consumables (i.e., the amount of food and life-preserving resources the ship can hold) had a rather odd format: the variable had both a number and a unit (e.g., 1 month, 3 years, 1 week, 5 days). To address this, I used separate to first split the integer (e.g., 1, 3, 1, 5) from the unit of time (month, years, week, days). I then converted the unit to the number of days it represented using a combination of case_when and str_detect. I used str_detect to account for singulars and plurals in the units of time (e.g., 1 day vs 15 days). I then multiplied the integer and the unit of time together to create the number of days the starship had consumables for.

The Planets Data


Again, I converted repurrrsive’s sw_planets dataset to a tibble, unlisted the list columns, and converted the numeric variables to numeric variables.

When I cleaned the variables, I wasn’t exactly sure what variables I was going to use for my plot, so I ended up creating a lot of additional variables that I thought could be interesting to plot. Directly applicable to my final plot, I calculated the planets’ distances from their stars using Kepler’s Third Law (i.e., “The squares of the sidereal periods (of revolution) of the planets are directly proportional to the cubes of their mean distances from the Sun” - Encyclopædia Britannica). Essentially, the distance of a planet from its star is proportional to the length of its year. In this case, I used Astronomical Units (AU; ~150,000,000 km) to represent the distance from the star to the planet. Using astronomical units is helpful in this case because they are far more manageable and, since the distance from the Earth to the Sun is approximately 1 AU, they are far more interpretable. Although, if one is able to hold the thought of 1,636,025,600 football fields in their head, that would also be a viable option.

I also added a column indicating whether the planet was featured in the original trilogy using data from Wikipedia. Finally, I used data from NASA and Ask an Astronomer to add the first five planets in the Solar System to the dataset, including their AUs, orbital periods, and their radii.

The People

Version One


Intended Message

At its most basic, the plot to the left was intended to illustrate how the ages of characters in the Starwars universe differ by gender. Other than wanting to convey the average difference in ages among genders in the Star Wars universe, I think it is important to consider that the mean age of the female characters is so young compared to that of the male characters. It seems the recent movies have taken both incidental (e.g., Carrie Fisher) and deliberate (e.g., Laura Dern) steps to include older women. Star Wars would certainly benefit from being more inclusive when it comes to casting.

Intended Audience

The intended audience for this visualization are fans of Star Wars and the general public. No advanced statistical knowledge would be required to interperet it.

Description of Version One

I refer to this first attempt as the LEGO® Star Wars version of this plot, owing mostly to the simplistic design of the lightsaber1 hilt. Despite its evident simplicitly, the code is likely the most complex of all of the versions of this plot. The code for creating the hilt was included in the plotting code (rather than simply calling a function) and I repeated the code to generate the hilt for each lightsaber. I also included the code to create the blade and blade aura in the main ggplot code chunk, requiring the inclusion of six separate geom_col layers with decreased alpha and increased width for each additional call.

Version Two


Description of Version Two

For the second version, I removed all the code for creating the hilts and the blades from the ggplot code chunk. Instead, I saved them in separate scripts that are sourced at the beginning of the R Markdown document used to deploy this dashboard. Each function is essentially just a list with multiple geom_*s inside of it. The geom_saberhilt function takes one argument (i.e., column), which simply tells the function which column to make the hilt for and adjusts its position accordingly. In total, each hilt is made up of eight separate rectangles. I added a black outline to make the separate component of the hilt pop.

I also dropped the droid lightsaber from this plot. It seemed to be distracting from the intended message of the plot.

Following the advice of Daniel Anderson (AKA datalorax), I dropped the legend, as it was completely redundant with the x-axis of the plot.

For the second version, I also added the ends of the blades using geom_point. As is clear from the plot, there is a size discrepancy between the end of the blades and the blades themselves. This primarily stems from the width of geom_col being relative to the size of the plot and geom_point being absolute.

Version Three


Description of Version Three

I like to think of this version of the plot as the How to Draw A Horse version, given its massive improvement over the previous plot.

Again following the advice of Daniel Anderson, I used coord_flip to flip the x- and y-axes, drastically improving the look of the lightsabers. I assume the improvement is due partially to some innate preference for landscape orientation over portrait orientation, but, in any case, it allows the plot to neatly fit on a computer screen.

Daniel also suggested I add annotations, which I added in the form of the means provided slightly below the blades. To achieve this, I used position_nudge(x = -.09, y = -8) to move the geom_text labels slightly below and slightly to the left of the actual means of the two groups.

I also created a simple black and pink (specifically “deeppink2”) theme in a separate function script that I sourced at the beginning of the R Markdown document and added to this plot. The theme also bolds titles, axis titles, axis text, and captions. I also customized the major grid lines and completely removed the minor gridlines. Finally, I automatically suppressed the legend using legend.position = "none". I did customize the legend to be consistent with the rest of the theme in case I decided to include it (e.g., visualization 3).

Version Four


Description of Version 4

In the (current) final plot of the characters’ ages, I followed the advice of Jon Rochelle and Maria Schweer-Collins and changed the color of the lightsaber for males from red to blue. They suggested that the red colour would imply that this was representing the average age of male characters who had also succumbed to the Dark Side. I was initially wary of using blue to represent males as I usually try to avoid using the conventional colors to represent women and men, but I think in this case the use of blue is far less confusing.

As noted by Andrew Edelblum, Yoda is a major outlier at 896 years old. The only character who would seem to be even close to Yoda’s age was Jabba Desilijic Tiure at 600 years old, but Jabba is hermaphroditic and was not included in the plot. In order to address Yoda’s age, I simply dropped him from the plot and noted in the plot’s subtitle that he was excluded. I did consider including a point to indicate where he was on the plot, but that would require an x-axis that spanned to around 900, which would seem to obscure any difference between the female and male lightsabers.

Andrew and Maria also suggested that the title should read “age by gender” rather than “age and gender”. They were completely correct, and I made the corresponding change.

Finally, I switched from using title case for the title and adjusted the x-axis limits using scale_y_continuous (scale_y since the coordinates were flipped).

The Starships

Version One


Intended Message

Overall, this plot is aimed at illustrating how starships from Star Wars differ in size. Specifically, I was trying to illustrate that Yachts are, by far, the largest ships with assault starfighters being slightly larger than basic starfighters. Yachts also have the most variable sizes among all of the starships.

Intended Audience

Although the audience for this plot is generally the same as the previous plot (i.e., fans of Star Wars), I believe this plot is skewed far more towards those who are knowledgable enough about statistics to be able to interpret error bars.

Description of Version One

Points and standard error bars have always looked like Tie Fighters to me, so I decided I would try to plot the length of a starship by starship class (i.e., Starfighters, Assault Starfighters, and Yachts). I excluded any class that only had one ship type in the dataset (e.g., Star Destroyers). To achieve the look I was going for, I plotted an error bar using geom_errorbar and then plotted a black geom_point for the ships hull and a grey, shape-13 geom_point for the ships’ windows.

Version Two


Description of Version Two

For the second version, I changed the ships’ wings and hulls to a dark gray to more accurately reflect the look of a Tie Fighter. I also added a black geom_point between the windows’ frames and the hulls to give the impression of glass.

In this stage, I also created random data to represent stars in the plot. To do this, I used sample (without replacement) to pick 500 values between the x-axis limits of 0 and 4, 500 values between the y-axis limits of 0 and 80, and 500 transparency values between .1 and .9. I then simply plotted these points in white using geom_point, mapping the x-axis to x, the y-axis to y, and the alpha to the transparency values.

To show the stars in this version, I retroactively added in a black background. The original plot had used the new_retro theme from the vapoRwave package by Matthew Oldach Matthew Oldach. Unfortunately, that theme used fonts that drastically limited reproducibility and was part of the impetus behind me creating my own theme.

Version Three


Description of Version Three

Version 3 involved mostly minor appearance changes. I added my_theme to make the theme consistent with my other visualizations and made the size of the ships’ bodies larger to be more consistent with actual (fictional) Tie Fighters. I also added a \n between “Assault” and “Starfighter” in the y-axis labels so that “Assault Starfighter” did not take up substantially more horizontal space than the other labels.

Version Four


Description of Version Four

In this version of the starship plot, I completely reworked the body of the starships to more accurately reflect the look of Tie Fighters. I did this by adding three additional geom_point layers. In the end the body of the ship comprised (1) a large light gray layer to outline the hulls, (2) a slightly smaller dark gray layer to represent the hulls, (3) a black layer to represent the annular windows, (4) a light gray asterisk layer to represent the radial window frames, (5) a light gray layer to represent the outer window frame, (6) a light gray layer to represent the inner window frames, and (7) a black layer to represent the center windows.

As hinted at above and as per Jon’s advice, I also added a thin layer of light gray around the ships body and wings to make the Tie Fighters pop from the starry background.

I did consider dropping the “class” y-axis title because the y-axis labels seemed intuitive to me; ultimately, I decided that the title might be helpful in aiding comprehension. Instead, I shifted the y-axis title away from the y-axis labels using axis.title.y = element_text(vjust = 5) so it wouldn’t look as cramped. I also increased the plot margin on the right side of the plot so the text wasn’t right up against the plot border.

Following Maria‘s advice, I also attempted to plot the mean and standard error of each ship as text on the plot, but it was hard to find an appropriate space to put them. One would imagine immediately underneath the ships would look good, but, given the differing width of the starships’ wings, it ended up looking awkward. Instead, I added subtitles describing what the wings and the hull represent.

The Planets

Version One


Intended Message

In the original version of this plot, I intended to illustrate that as a planet gets further from its star the amount of surface water decreases. This did seem to be the case, but I became more interested in the size of planets, orbital periods, and the distance of planets from their stars. In the end, the intent of the plot became illustrating that planets farther from their stars are both larger and orbit slower than those closer to their stars (and that this holds for planets in the Star Wars universe).

Intended Audience

Again, those interested in Star Wars would likely enjoy this plot. A general background in astronomy would also be helpful in interpreting the plot, but would not seem to be necessary.

Description of Version One

The first version of this plot used the Star Wars colors (black and yellow) to show the relationship between the amount of water on a planet and the distance of those planets from their host stars. This relationship was illustrated using geom_smooth with the method set to lm. The planets were represented using geom_point and point size was mapped to the radii of the planets.

Version Two


Description of Version Two

For the second plot, I dropped the interest in the amount of surface water the planet had and instead used coord_polar to show a top-down version of a star system with planets arbitrarily arranged around a star. As can be gleaned from the plot, it does seem that large planets (in this case, Bespin) are farther from their host stars than smaller planets.

I used a number of geom_text layers to add where solar system planets would sit on the plot for easy comparison. I also suppressed all axis labels and used geom_text to inset the axis labels inside the plot, rather than having them float outside the plot. Using geom_label_repel I added non-overlapping labels for the planets that appeared in the original trilogy.

In a final step, I added my theme to this plot to make it consistent with the other plots. I also changed the color of the points to better complement the theme.

Version Three


Description of Version Three

Following Maria’s advice, I dropped all the planets that are not in the original trilogy. This vastly simplified the plot and seemed to make it far easier to comprehend. Making it slightly more busy (but vastly aiding comparison) I also added the solar system planets as their own points, rather than having them represented by text.

Although not interactive (as per Andrew’s excellent suggestion), I animated the plot using transition_time from Thomas Pedersen’s gganimate. I believe the animation effectively illustrates the orbital period of the planets in an easily interpretable way. I also added a subtitle to show the number of days that pass, ranging from 0 to 5110 (i.e., the length of Bespin’s orbit)

In order to have planets with orbits less than 5110 days continue orbiting once they had completed their orbit, I made a function called orbital_slice to generate the position of each planet at 511 time points during Bespin’s 5110-day orbit. However, this presented a major problem: Once a planet had finished its orbit, instead of continuing forward from 359 degrees to 0 degrees, it would orbit backwards, returning to 0, before it continued forward again.

Version Four


Description of Version Four

As noted in the description of the previous plot, a major problem with using transition_time with repeating orbital data is that each planet, at the end of its orbit, will travel backwards from 359 degrees to 0 degrees. My first solution was to insert an NA in the data; the planet would briefly disappear at 359 degrees and then reappear at 0 degrees. However, even at an incredibly high number of slices, the flickering of the planets was noticeable.

My solution came when I switched from using transition_time to using transition_manual. I still had to use a large number of slices since I am manually creating the movement from one frame to the next but the effect was far smoother and no flickering occurred.

For this plot, I also ended up having to drop “Alderaan”; it was orbiting at the same speed and at the same distance from its star as the Earth. I decided that the Earth was a far more important point of comparison than Alderaan.

Finally, to increase interpretability, I added a geom_label, inset in the plot, describing the number of kilometers that an AU represents, and a legend describing what color of planets come from the Solar System and what color of planets come from the Star Wars universe.


  1. Which, in the ggplot2 universe, would be “lightsabre”.