DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Vibe Coding With GitHub Copilot: Optimizing API Performance in Fintech Microservices
  • Concourse CI/CD Pipeline: Webhook Triggers
  • Solid Testing Strategies for Salesforce Releases
  • Build a Stateless Microservice With GitHub Copilot in VSCode

Trending

  • Accelerating AI Inference With TensorRT
  • Mastering Advanced Aggregations in Spark SQL
  • Enhancing Security With ZTNA in Hybrid and Multi-Cloud Deployments
  • Unlocking AI Coding Assistants: Generate Unit Tests
  1. DZone
  2. Testing, Deployment, and Maintenance
  3. Deployment
  4. R: Scraping the Release Dates of Github Projects

R: Scraping the Release Dates of Github Projects

By 
Mark Needham user avatar
Mark Needham
·
Jun. 29, 15 · Interview
Likes (0)
Comment
Save
Tweet
Share
1.3K Views

Join the DZone community and get the full member experience.

Join For Free

Continuing on from my blog post about scraping Neo4j’s release dates I thought it’d be even more interesting to chart the release dates of some github projects.

In theory the release dates should be accessible through the github API but the few that I looked at weren’t returning any data so I scraped the data together.

We’ll be using rvest again and I first wrote the following function to extract the release versions and dates from a single page:

library(dplyr)
library(rvest)
 
process_page = function(releases, session) {
  rows = session %>% html_nodes("ul.release-timeline-tags li")
 
  for(row in rows) {
    date = row %>% html_node("span.date")
    version = row %>% html_node("div.tag-info a")
 
    if(!is.null(version) && !is.null(date)) {
      date = date %>% html_text() %>% str_trim()
      version = version %>% html_text() %>% str_trim()
      releases = rbind(releases, data.frame(date = date, version = version))
    }  
  }
  return(releases)
}

Let’s try it out on the Cassandra release page and see what it comes back with:

> r = process_page(data.frame(), html_session("https://github.com/apache/cassandra/releases"))
> r
           date               version
1  Jun 22, 2015       cassandra-2.1.7
2  Jun 22, 2015      cassandra-2.0.16
3   Jun 8, 2015       cassandra-2.1.6
4   Jun 8, 2015   cassandra-2.2.0-rc1
5  May 19, 2015 cassandra-2.2.0-beta1
6  May 18, 2015      cassandra-2.0.15
7  Apr 29, 2015       cassandra-2.1.5
8   Apr 1, 2015      cassandra-2.0.14
9   Apr 1, 2015       cassandra-2.1.4
10 Mar 16, 2015      cassandra-2.0.13

That works pretty well but it’s only one page! To get all the pages we can use the follow_link function to follow the ‘Next’ link until there aren’t anymore pages to process.

We end up with the following function to do this:

find_all_releases = function(starting_page) {
  s = html_session(starting_page)
  releases = data.frame()
 
  next_page = TRUE
  while(next_page) {
    possibleError = tryCatch({  
      releases = process_page(releases, s)
      s = s %>% follow_link("Next") 
    }, error = function(e) { e })
 
    if(inherits(possibleError, "error")){
      next_page = FALSE
    }
  }
  return(releases)
}

Let’s try it out starting from the Cassandra page:

> cassandra = find_all_releases("https://github.com/apache/cassandra/releases")
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-2.0.13
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-2.0.10
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-2.0.8
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-1.2.13
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-2.0.0-rc1
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-1.2.3
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-1.2.0-beta2
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-1.0.10
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-1.0.6
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-1.0.0-rc2
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-0.7.7
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-0.7.4
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-0.7.0-rc3
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-0.6.4
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-0.5.0-rc3
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-0.4.0-final
 
> cassandra %>% sample_n(10)
            date               version
151 Mar 13, 2010   cassandra-0.5.0-rc2
25   Jul 3, 2014      cassandra-1.2.18
51  Jul 27, 2013       cassandra-1.2.8
21  Aug 19, 2014   cassandra-2.1.0-rc6
73  Sep 24, 2012 cassandra-1.2.0-beta1
158 Mar 13, 2010   cassandra-0.4.0-rc2
113 May 20, 2011     cassandra-0.7.6-2
15  Oct 24, 2014       cassandra-2.1.1
103 Sep 15, 2011 cassandra-1.0.0-beta1
93  Nov 29, 2011       cassandra-1.0.4

I want to plot when the different releases happened in time and in order to do that we need to create an extra column containing the ‘release series’ which we can do with the following transformation: 

series = function(version) {
  parts = strsplit(as.character(version), "\\.")  
  return(unlist(lapply(parts, function(p) paste(p %>% unlist %>% head(2), collapse = "."))))  
}
 
bySeries = cassandra %>%
  mutate(date2 = mdy(date), series = series(version),
         short_version = gsub("cassandra-", "", version),
         short_series = series(short_version))
 
> bySeries %>% sample_n(10)
            date               version      date2        series short_version short_series
3    Jun 8, 2015       cassandra-2.1.6 2015-06-08 cassandra-2.1         2.1.6          2.1
161 Mar 13, 2010 cassandra-0.4.0-beta1 2010-03-13 cassandra-0.4   0.4.0-beta1          0.4
62  Feb 15, 2013      cassandra-1.1.10 2013-02-15 cassandra-1.1        1.1.10          1.1
153 Mar 13, 2010 cassandra-0.5.0-beta2 2010-03-13 cassandra-0.5   0.5.0-beta2          0.5
37   Feb 7, 2014       cassandra-2.0.5 2014-02-07 cassandra-2.0         2.0.5          2.0
36   Feb 7, 2014      cassandra-1.2.15 2014-02-07 cassandra-1.2        1.2.15          1.2
29   Jun 2, 2014   cassandra-2.1.0-rc1 2014-06-02 cassandra-2.1     2.1.0-rc1          2.1
21  Aug 19, 2014   cassandra-2.1.0-rc6 2014-08-19 cassandra-2.1     2.1.0-rc6          2.1
123 Feb 16, 2011       cassandra-0.7.2 2011-02-16 cassandra-0.7         0.7.2          0.7
135  Nov 1, 2010 cassandra-0.7.0-beta3 2010-11-01 cassandra-0.7   0.7.0-beta3          0.7

Now let’s plot those releases and see what we get:

ggplot(aes(x = date2, y = short_series), 
       data = bySeries %>% filter(!grepl("beta|rc", short_version))) +     
  geom_text(aes(label=short_version),hjust=0.5, vjust=0.5, size = 4, angle = 90) + 
  theme_bw()

2015 06 23 22 59 19

An interesting thing we can see from this visualisation is what overlap the various series of versions have. Most of the time there are only two series of versions overlapping but the 1.2, 2.0 and 2.1 series all overlap which is unusual.

In this chart we excluded all beta and RC versions. Let’s bring those back in and just show the last 3 versions:

ggplot(aes(x = date2, y = short_series), 
       data = bySeries %>% filter(grepl("2\\.[012]\\.|1\\.2\\.", short_version))) +     
  geom_text(aes(label=short_version),hjust=0.5, vjust=0.5, size = 4, angle = 90) + 
  theme_bw()

2015 06 23 23 08 04

From this chart it’s clearer that the 2.0 and 2.1 series have recent releases so there will probably be three overlapping versions when the 2.2 series is released as well.

The chart is still a bit cluttered although less than before. I’m not sure of a better way of visualising this type of data so if you have any ideas do let me know!


Release (computing) R (programming language) GitHub

Opinions expressed by DZone contributors are their own.

Related

  • Vibe Coding With GitHub Copilot: Optimizing API Performance in Fintech Microservices
  • Concourse CI/CD Pipeline: Webhook Triggers
  • Solid Testing Strategies for Salesforce Releases
  • Build a Stateless Microservice With GitHub Copilot in VSCode

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!