R is a joy if you treat it like awk

October 7th, 2019 - Derek Rodriguez

During my undergrad I took a data science class. It was a great class where we dove into topics such as EDA, lasso regression, SVMs, visualizations, and many other good things that other people across the Internet love to blog about ad nauseum. The one thing I had a terrible experience with was R. Let me clarify—R is a fantastic environment for statisticians to explore advanced computation methods on cleaned data. It is also a great environment for developing great insights on clean data using their DataFrame type. If you haven't caught the hint yet, R's true potential is so much more immediate if your data is already great.

The sad reality of most exciting data science projects is that they more often than not aren't based on a clean, well-studied dataset. Thank God Tensorflow changed their default tutorial away from digit recognition, because MNIST has quickly become the Wonderwall of computer vision. The reality is half of data science is just cleaning data, and the other half is complaining about how much of data science is just cleaning data. You can add another half's worth of time for debugging your neural net, if you were coaxed into using one that isn't off-the-shelf.

Half-assing something in Unix can produce great results as long you chose the right half of the ass. Thirty minutes of of some good cURL mixed with awk and/or some sed perform miracles on poorly-formatted plaintext source. Python's scientific libraries (PySpark, pandas, etc.) do a decent job of copying R's DataFrame magic and even allowed for it to scale to multi-node architectures. That being said, criticizing R for all the awkwardness that comes from a language developed by a bunch of a statisticians is absolutely not the point of the post.

I came to an important realization. R is a Unix tool. Throwing an R snippet into a command just like you would with awk or sed opens a whole new world of possibility. It blows my mind that isn't R isn't shown in this context more often. If R is bad at cleaning your data, you can just clean it in awk and then just pipe to R. I've actually reduced my R usage down to mostly just two aliases (written here in fish syntax):

    alias stats 'Rscript -e "summary(as.numeric(readLines(\"stdin\")))"'   
    alias boxplot 'Rscript -e "boxplot(as.numeric(readLines(\"stdin\")))"'

The first produces a quick numeric summary of a series of numbers on stdin. The second produces a boxplot with that same data in PDF format. For example, let's say I've just bought a new server and I want to see how long it takes to spin up a web app after a crash. I would write a script that simulates a crash, and then logs the recovery time after each crash and saves it in a file called output.log. If I wanted a boxplot of the total recovery times, all I would have to do is:

$   grep "Recovery time:" output.log | awk '{print $2}' | boxplot

No spinning up Rstudio. No messing with ggplot settings. No more oh-my-god-why-how-do-I-just-get-this-into-a-list. Just results that help you figure do some quick-and-dirty EDA. I'd love it if people e-mail me some R snippets like the one I shared here so I can add them to my running list.

Update: I got some really great feedback from this article on HN. I've taken a stab at rewriting the post a little to emphasize the original intent of recontextualizing R as a great command-line tool, instead of as the core language for a data science / statistics programming ecosystem.