Big Data/Analytics Zone is brought to you in partnership with:

My obsession with data processing, analysis and visualization (especially using R) started in academia, and now it occupies both my work and hobby time. Come learn with me (or even teach me) as I figure out how to tackle tough data problems at work, or more likely than not try to use R to do some for-fun analyses! Matthew is a DZone MVB and is not an employee of DZone and has posted 10 posts at DZone. You can read more from them at their website. View Full User Profile

Estimating Age from First Name, Part 1

11.19.2013
| 7396 views |
  • submit to reddit

Today I read a cute post from Flowing Data on the most trendy names in US history. What caught my attention was a link posted in the article to the source data, which happens to be yearly lists of baby names registered with the US social security agency since 1880
(see here). I thought that it might be good to compile and use these lists at work for two reasons:

(1) I don’t have experience handling file input programmatically in R (ie working with a bunch of files in a directory instead of manually loading one or two) and
(2) It could be useful to have age estimates in the donor files that I work with (using the year when each first name was most popular).

I’ve included the R code in this post at the bottom, after the following explanatory text.

I managed to build a dataframe that contains in each row a name, how many people were registered as having been born with that name in a given year, the year, the total population for that year, and the relative proportion of people with that name in that year.

Once I got that dataframe, I built a function to query that dataframe for the year when a given name was most popular, an estimated age using that year, and the relative proportion of people born with that name from that year.

I don’t have any testing data to check the results against, but I did do an informal check around the office, and it seems okay!

However, I’d like to scale this upwards so that age estimates can be calculated for each row in a vector of first names. As the code stands below, the function I made takes too long to be scaled up effectively.

I’m wondering what’s the best approach to take? Some ideas I have so far follow:

  • Create a smaller dataframe where each row contains a unique name, the year when it was most popular, and the relative popularity in that year. Make a function to query this new dataframe.
  • Possibly convert the above dataframe into a data table and then building a function to query the data table instead.
  • Failing the efficacy of the above two ideas, load the popularity data into Python, and make a function to query it there instead.
Does anyone have any better ideas for me?

library(stringr)
library(plyr)
 
# We're assuming you've downloaded the SSA files into your R project directory.
 
file_listing = list.files()[3:135]
for (f in file_listing) {
  year = str_extract(f, "[0-9]{4}")
  if (year == "1880") { # Initializing the very long dataframe
    name_data = read.csv(f, header=FALSE)
    names(name_data) = c("Name", "Sex", "Pop")
    name_data$Year = rep(year, dim(name_data)[1]) }
  else { # adding onto the very long dataframe
    name_data_new = read.csv(f, header=FALSE)
    names(name_data_new) = c("Name", "Sex", "Pop")
    name_data_new$Year = rep(year, dim(name_data_new)[1])
    name_data = rbind(name_data, name_data_new)
}}
 
year_pop_totals = ddply(name_data, .(Year), function (x) sum(x$Pop))
name_data = merge(name_data, year_pop_totals, by.x="Year", by.y="Year", all.x=TRUE)
name_data$Rel_Pop = name_data$Pop/name_data$V1
 
estimate_age = function (input_name, sex = NA) {
if (is.na(sex)) {
  name_subset = subset(name_data, Name == input_name & Year >= 1921)} #1921 is a year I chose arbitrarily. Change how you like.
else {
  name_subset = subset(name_data, Name == input_name & Year >= 1921 & Sex == sex)
}
  year_and_rel_pop = name_subset[which(name_subset$Rel_Pop == max(name_subset$Rel_Pop)),c(1,6)]
  current_year = as.numeric(substr(Sys.time(),1,4))
  estimated_age = current_year - as.numeric(year_and_rel_pop[1])
  return(list(year_of_birth=as.numeric(year_and_rel_pop[1]), age=estimated_age, relative_pop=sprintf("%1.2f%%",year_and_rel_pop[2]*100)))
}

I’ll also accept any suggestions for cleaning up my code as is :)



Published at DZone with permission of Matthew Dubins, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)