As a package author, it’s nice to know how often your package is getting installed and used. This is not only good for staying motivated (assuming your package is getting some use), but it’s also important as part of building a portfolio in data science. After all, data science is all about measurement and quantification. What better evidence is there of your contributions to Open Source Software than cold hard numbers that represent usage data?
If your R package is published to CRAN, then package installs can be found in the CRAN logs and accessed with the cranlogs library.
If your package was published to Github, you can track repository clones. However, Github’s built-in analytics is limited to the most recent two-weeks of data. This is not helpful if you want to track overall usage. So you really have to roll your own script that continually harvests this analytics data from Github.
This is how I did it using R.
1/5 I wanted to track how many times my #rstats package was cloned on @github to see how many users it has / how useful it is. But the built-in @github analytics only has a history of 2 weeks.— Tyler Burleigh (@tylerburleigh) August 17, 2019
I made a simple system with #rstats to fetch/save stats on a regular basis. /thread
2/5 First I installed the python library github-traffic-stats so I could fetch traffic stats using the @github API. This allowed me to do it all in one line of R code using a system command.— Tyler Burleigh (@tylerburleigh) August 17, 2019
I saved this as an R script ("get_traffic.R")https://t.co/7wuFIreJ14 pic.twitter.com/PpKSI2KqQP
3/5 I wanted this script to run on a regular basis, so I installed the taskscheduleR package and used taskscheduler_create() to make the Windows version of a cron job scheduled task. (The cronR package can do this on Linux systems!)— Tyler Burleigh (@tylerburleigh) August 17, 2019
Now it runs every Sunday at 9am. pic.twitter.com/W3xr5ju8KD
4/5 When it runs, it creates 3 csv files. I'm only interested in the files that end with "clone-stats.csv" because these tell me how many times the repository was cloned each day. I'll need to filter on those files. pic.twitter.com/7JBahQ6iLv— Tyler Burleigh (@tylerburleigh) August 17, 2019
5/5 The @github API gives 2 weeks at a time, but my script runs weekly, so this will create duplicate entries within the csv files.— Tyler Burleigh (@tylerburleigh) August 17, 2019
I use grepl to filter on the filenames, then I bind all the rows and filter on distinct dates. Finally, I sum all the clones. 52 clones so far! pic.twitter.com/mp0ZUaeG1D