R Studio

January 2020

I enter R studio for the first time. For the uninitiated, R studio is one of many data interrogation centres. In these places, students, academic and professional torturers, know as data scientists, cut, coerce, mutilate and perform cruel tests on their suspects to force confessions. 

I came into Data Science for the Social Sciences eager to learn how to use R Studio to extract the secrets of human behaviour. 

Two classes in, this eagerness turned to frustration. The tools failed in my hands. The suspects scoffed at my incompetence. Instead of answers, I got errors.

I was the hostage. Literally. When Covid was starting to ramp up in the US, I had the option to return to China, where life was returning to normal. I couldn’t go. I was too incompetent to meet my instructor’s expectation without Sabina by my side. 

By May, we were expected to be able to merge, join, pivot and use simple functions to automate basic tasks. I left barely knowing how to turn on the tools and proper protocol to bring in a suspect.

August 2021

Emerging markets. I am no longer in training. Professor Ales puts the class to work strong arming the Penn World Tables to divulge the secrets of economic performance.

Hours of struggle ensue. table$(GDP/capital). Error. pwt$p.capita <- GDP/pop. Inspect column. NAs. Plot(x = log(GDP), y = year, data = PWT). Object logGDP not defined. Vector must be numeric.
When basic R failed, I employ tidyverse. Group by(country). Summarise(mean.kap= mean(capital). Error. Mean(tfp). Object tfp not found. NOT FOUND??? NOT FOUND??? That cursed Blue Grey building became the bane of my existence. It drained my energy. It stole my time. It turned my pristine desktop into a crowded holding cell.

Before dropping the course, the only lesson from training that stuck kicked in. Outsource.

Thank you Owen and Pranay. Thank you Nick. 

I turned in 7 completed assignments for Emerging Markets. 7 used borrowed code

May 2021

“You cannot rely on others to do your data work” Professor Olivola warns me after I confessed that Owen wrote the code for my analysis. He was right. Owen wasn’t going to be there in two years. The data would only get more cunning. The tools needed would only get more complex.

We met again in July. I faced faces.csv, the results of a 300 person 5x2x2 study. Instead of checking Owen’s schedule, I checked Stack Exchange. Evidence collected from the data splattered walls and piles of dissected data would have earned me a one way ticket to the Hague. Nonetheless, after 40 hours and 583 lines of code in cleaning.rmd, faces.csv revealed its secrets. Turns out I bungled the study design. I didn’t care. I did it. I made mute data speak. It’s called ggplot.

March 2022

I meet my match when I bring in the monstrosity that is NHSTA Crash Reporting data. I was no longer working with individual suspects. This was 27 member cabal. I wasn’t scared this time. I dragged Accidents, Vehichles, Persons, Distract and DRIMPAIR into Traffic.rmd. With a couple deft keystrokes, I bent each suspect into shape. Mutate, pivot filter. I squeezed out the means and standard deviations while saving lines by using Tapply.

Even as my tools struggled with the sheer size of the data set, I hacked and squeezed away. I rang it through multiple rounds of lfe regressions. The data confessed. There would be no threat of prosecution this time. I kept the workspace clean.

July 2022

A suspect dies before the interrogation is over. The AHA annual survey refused to speak. 465,790 NAs for nurse staffing ratio. 352,465 NAs for COVID infections. No amount of wrangling would make it useful. Maybe it was sunk costs. Maybe I was afraid of Professor Sadun. I didn’t want to give up. Alas, there was nothing to be done. I walked into Professor Sadun’s room anticipating reproach. She only reassured me that it was okay.  For the rest of the summer , I was given more interesting work on a more interesting project. Sometimes the effort isn’t worth it. Some data would rather die than speak.

September 2022

I am applying to join the ranks of professional data interrogators as a predoc. PIs expect me to master all the techniques. They send requests for information. I begrudgingly drag recalcitrant datasets into rmd files to fulfill them. I hate this. Alas, I cannot leave RStudio.

Data interrogation will always be the worst part of my duties as a researcher. The reflexive panic when faced with a large dataset will likely never go away. There is still much to learn. However, looking back, it’s hard not to smile seeing how far I’ve come.

, ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: