Scripts, Notebooks & Abstraction

We’ve talked a bit about the differences between scripts and spreadsheet applications – this can be abstracted to talking about programming instructions versus using graphical user interface (GUI) environments, where a GUI limits reproducibility while also complicating doing repretative tasks – repetitive tasks take a lot of time and are error prone. Whenever possible, we want to avoid using a GUI when reproducibility is important.

While we’ve mentioned scripts, we’ve almost exclusively been working in RMarkdown documents, not R scripts. And this warrants breaking down a bit. Each has their place within a research context and within the context of research data management. One of the most important pieces of RDM is documentation and a key peice of documentation is the process of decision making in how one works through their data whether that be in the cleaning or analytical stage of the process. This has been referred to as ‘researcher degrees of freedom’. That is, all the decisions a researcher is at liberty to make in this process, whether it be defining an outlier, rounding a variable, grouping variables (choosing a bin size for age ranges for example), deciding you need to collect more data after having looked at the data, etc. See False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant available at https://doi.org/10.1177/0956797611417632.

An RMarkdown document, sometimes referred to as an electronic notebook, is an excellent platform by which to maintain a record of these decisions; as you articulate your steps in markdown in plain language, this is directly in line with the code that processes your data. When we talked earlier about literate programming – the idea that programming instructions should be both human and machine interprettable and tell a human what you want the computer to do – an RMarkdown document enhances this concept with its rich context. However, at a certain stage, this rich context may be more appropriately decoupled from the scripting process – especially as we move in the direction of confirmatory analyses, where we define in advance how we’ll be handling the data. In these instances, the context is more appropriately contained within some form of study registration. In such a situation, we can use a script, or a document that contains only our instruction set (R code), perhaps with a few comments, but no markdown.

Exploratory research is hypothesis generating, while confirmatory research is hypothesis testing. It is rare that confirmatory research is every conducted fully independent of exploratory research, as confirmatory research often suggests other paths of inquiry. However, it is critical to clearly differentiate between the two and to document the two processes appropriately, generally where exploratory research is documented while doing the exploration and confirmatory research is documented in advance.

Increased System Complexity

It’s important to note that as we move from a script, to an RMarkdown document, to a GUI we increase the complexity of the system(s) that we’re working in. Essentially, there are more moving parts. We’ll talk about this a bit more when we talk about dependencies, but it’s worth noting here the concept of abstraction as it relates to balancing ease of use of a system with replication – we’ll touch on this again when we look at building visualizations in R. The layers of abstraction in a programming language, like R is, in a simplified form:

0s and 1s -> machine code -> low level programming -> high level programming -> interactive programming (ie scientific computing, scripting) -> electronic notebooks -> GUI applications

This abstraction can make it difficult to understand exactly what’s happening especially as a novice user, but an intuitive framework can help you move between systems. When we write code in R this goes through several stages of processing, often first just with a check and balance to make sure the data will work with the function it’s being fed into, it will then actually run in a lower level language (like C or Fortran), which has been converted into machine code, an even more rudimentary instruction set, which is then stored on disk as binary values and processed by your CPU.

LS0tDQp0aXRsZTogIlNjcmlwdHMsIE5vdGVib29rcyAmIEFic3RyYWN0aW9uIg0KcGFnZXRpdGxlOiAiU2NyaXB0cywgTm90ZWJvb2tzICYgQWJzdHJhY3Rpb24iDQpvdXRwdXQ6DQogIGh0bWxfZG9jdW1lbnQ6DQogICAgY29kZV9mb2xkaW5nOiBzaG93ICMgYWxsb3dzIHRvZ2dsaW5nIG9mIHNob3dpbmcgYW5kIGhpZGluZyBjb2RlLiBSZW1vdmUgaWYgbm90IHVzaW5nIGNvZGUuDQogICAgY29kZV9kb3dubG9hZDogdHJ1ZSAjIGFsbG93cyB0aGUgdXNlciB0byBkb3dubG9hZCB0aGUgc291cmNlIC5SbWQgZmlsZS4gUmVtb3ZlIGlmIG5vdCB1c2luZyBjb2RlLg0KICAgIGluY2x1ZGVzOg0KICAgICAgYWZ0ZXJfYm9keTogZm9vdGVyLmh0bWwgIyBpbmNsdWRlIGEgY3VzdG9tIGZvb3Rlci4NCiAgICB0b2M6IHRydWUNCiAgICB0b2NfZGVwdGg6IDMNCiAgICB0b2NfZmxvYXQ6DQogICAgICBjb2xsYXBzZWQ6IGZhbHNlDQogICAgICBzbW9vdGhfc2Nyb2xsOiBmYWxzZQ0KLS0tDQoNCiMjIFNjcmlwdHMsIE5vdGVib29rcyAmIEFic3RyYWN0aW9uDQoNCldlJ3ZlIHRhbGtlZCBhIGJpdCBhYm91dCB0aGUgZGlmZmVyZW5jZXMgYmV0d2VlbiBzY3JpcHRzIGFuZCBzcHJlYWRzaGVldCBhcHBsaWNhdGlvbnMgLS0gdGhpcyBjYW4gYmUgYWJzdHJhY3RlZCB0byB0YWxraW5nIGFib3V0IHByb2dyYW1taW5nIGluc3RydWN0aW9ucyB2ZXJzdXMgdXNpbmcgZ3JhcGhpY2FsIHVzZXIgaW50ZXJmYWNlIChHVUkpIGVudmlyb25tZW50cywgd2hlcmUgYSBHVUkgbGltaXRzIHJlcHJvZHVjaWJpbGl0eSB3aGlsZSBhbHNvIGNvbXBsaWNhdGluZyBkb2luZyByZXByZXRhdGl2ZSB0YXNrcyAtLSByZXBldGl0aXZlIHRhc2tzIHRha2UgYSBsb3Qgb2YgdGltZSBhbmQgYXJlIGVycm9yIHByb25lLiBXaGVuZXZlciBwb3NzaWJsZSwgKip3ZSB3YW50IHRvIGF2b2lkIHVzaW5nIGEgR1VJIHdoZW4gcmVwcm9kdWNpYmlsaXR5IGlzIGltcG9ydGFudCoqLg0KDQpXaGlsZSB3ZSd2ZSBtZW50aW9uZWQgc2NyaXB0cywgd2UndmUgYWxtb3N0IGV4Y2x1c2l2ZWx5IGJlZW4gd29ya2luZyBpbiBSTWFya2Rvd24gZG9jdW1lbnRzLCBub3QgUiBzY3JpcHRzLiBBbmQgdGhpcyB3YXJyYW50cyBicmVha2luZyBkb3duIGEgYml0LiBFYWNoIGhhcyB0aGVpciBwbGFjZSB3aXRoaW4gYSByZXNlYXJjaCBjb250ZXh0IGFuZCB3aXRoaW4gdGhlIGNvbnRleHQgb2YgcmVzZWFyY2ggZGF0YSBtYW5hZ2VtZW50LiAqKk9uZSBvZiB0aGUgbW9zdCBpbXBvcnRhbnQgcGllY2VzIG9mIFJETSBpcyBkb2N1bWVudGF0aW9uKiogYW5kIGEga2V5IHBlaWNlIG9mIGRvY3VtZW50YXRpb24gaXMgdGhlIHByb2Nlc3Mgb2YgZGVjaXNpb24gbWFraW5nIGluIGhvdyBvbmUgd29ya3MgdGhyb3VnaCB0aGVpciBkYXRhIHdoZXRoZXIgdGhhdCBiZSBpbiB0aGUgY2xlYW5pbmcgb3IgYW5hbHl0aWNhbCBzdGFnZSBvZiB0aGUgcHJvY2Vzcy4gVGhpcyBoYXMgYmVlbiByZWZlcnJlZCB0byBhcyAncmVzZWFyY2hlciBkZWdyZWVzIG9mIGZyZWVkb20nLiBUaGF0IGlzLCBhbGwgdGhlIGRlY2lzaW9ucyBhIHJlc2VhcmNoZXIgaXMgYXQgbGliZXJ0eSB0byBtYWtlIGluIHRoaXMgcHJvY2Vzcywgd2hldGhlciBpdCBiZSBkZWZpbmluZyBhbiBvdXRsaWVyLCByb3VuZGluZyBhIHZhcmlhYmxlLCBncm91cGluZyB2YXJpYWJsZXMgKGNob29zaW5nIGEgYmluIHNpemUgZm9yIGFnZSByYW5nZXMgZm9yIGV4YW1wbGUpLCBkZWNpZGluZyB5b3UgbmVlZCB0byBjb2xsZWN0IG1vcmUgZGF0YSBhZnRlciBoYXZpbmcgbG9va2VkIGF0IHRoZSBkYXRhLCBldGMuIFNlZSAqRmFsc2UtUG9zaXRpdmUgUHN5Y2hvbG9neTogVW5kaXNjbG9zZWQgRmxleGliaWxpdHkgaW4gRGF0YSBDb2xsZWN0aW9uIGFuZCBBbmFseXNpcyBBbGxvd3MgUHJlc2VudGluZyBBbnl0aGluZyBhcyBTaWduaWZpY2FudCogYXZhaWxhYmxlIGF0IFtodHRwczovL2RvaS5vcmcvMTAuMTE3Ny8wOTU2Nzk3NjExNDE3NjMyXShodHRwczovL2RvaS5vcmcvMTAuMTE3Ny8wOTU2Nzk3NjExNDE3NjMyKS4NCg0KQW4gUk1hcmtkb3duIGRvY3VtZW50LCBzb21ldGltZXMgcmVmZXJyZWQgdG8gYXMgYW4gZWxlY3Ryb25pYyBub3RlYm9vaywgaXMgYW4gZXhjZWxsZW50IHBsYXRmb3JtIGJ5IHdoaWNoIHRvIG1haW50YWluIGEgcmVjb3JkIG9mIHRoZXNlIGRlY2lzaW9uczsgYXMgeW91IGFydGljdWxhdGUgeW91ciBzdGVwcyBpbiBtYXJrZG93biBpbiBwbGFpbiBsYW5ndWFnZSwgdGhpcyBpcyBkaXJlY3RseSBpbiBsaW5lIHdpdGggdGhlIGNvZGUgdGhhdCBwcm9jZXNzZXMgeW91ciBkYXRhLiBXaGVuIHdlIHRhbGtlZCBlYXJsaWVyIGFib3V0IGxpdGVyYXRlIHByb2dyYW1taW5nIC0tIHRoZSBpZGVhIHRoYXQgcHJvZ3JhbW1pbmcgaW5zdHJ1Y3Rpb25zIHNob3VsZCBiZSBib3RoIGh1bWFuIGFuZCBtYWNoaW5lIGludGVycHJldHRhYmxlIGFuZCB0ZWxsIGEgaHVtYW4gd2hhdCB5b3Ugd2FudCB0aGUgY29tcHV0ZXIgdG8gZG8gLS0gYW4gUk1hcmtkb3duIGRvY3VtZW50IGVuaGFuY2VzIHRoaXMgY29uY2VwdCB3aXRoIGl0cyByaWNoIGNvbnRleHQuIEhvd2V2ZXIsIGF0IGEgY2VydGFpbiBzdGFnZSwgdGhpcyByaWNoIGNvbnRleHQgbWF5IGJlIG1vcmUgYXBwcm9wcmlhdGVseSBkZWNvdXBsZWQgZnJvbSB0aGUgc2NyaXB0aW5nIHByb2Nlc3MgLS0gZXNwZWNpYWxseSBhcyB3ZSBtb3ZlIGluIHRoZSBkaXJlY3Rpb24gb2YgY29uZmlybWF0b3J5IGFuYWx5c2VzLCB3aGVyZSB3ZSBkZWZpbmUgaW4gYWR2YW5jZSBob3cgd2UnbGwgYmUgaGFuZGxpbmcgdGhlIGRhdGEuIEluIHRoZXNlIGluc3RhbmNlcywgdGhlIGNvbnRleHQgaXMgbW9yZSBhcHByb3ByaWF0ZWx5IGNvbnRhaW5lZCB3aXRoaW4gc29tZSBmb3JtIG9mIHN0dWR5IHJlZ2lzdHJhdGlvbi4gSW4gc3VjaCBhIHNpdHVhdGlvbiwgd2UgY2FuIHVzZSBhIHNjcmlwdCwgb3IgYSBkb2N1bWVudCB0aGF0IGNvbnRhaW5zIG9ubHkgb3VyIGluc3RydWN0aW9uIHNldCAoUiBjb2RlKSwgcGVyaGFwcyB3aXRoIGEgZmV3IGNvbW1lbnRzLCBidXQgbm8gbWFya2Rvd24uDQoNCjo6Om5vdGUNCioqRXhwbG9yYXRvcnkgcmVzZWFyY2ggaXMgaHlwb3RoZXNpcyBnZW5lcmF0aW5nLCB3aGlsZSBjb25maXJtYXRvcnkgcmVzZWFyY2ggaXMgaHlwb3RoZXNpcyB0ZXN0aW5nKiouIEl0IGlzIHJhcmUgdGhhdCBjb25maXJtYXRvcnkgcmVzZWFyY2ggaXMgZXZlcnkgY29uZHVjdGVkIGZ1bGx5IGluZGVwZW5kZW50IG9mIGV4cGxvcmF0b3J5IHJlc2VhcmNoLCBhcyBjb25maXJtYXRvcnkgcmVzZWFyY2ggb2Z0ZW4gc3VnZ2VzdHMgb3RoZXIgcGF0aHMgb2YgaW5xdWlyeS4gSG93ZXZlciwgaXQgaXMgY3JpdGljYWwgdG8gY2xlYXJseSBkaWZmZXJlbnRpYXRlIGJldHdlZW4gdGhlIHR3byBhbmQgdG8gZG9jdW1lbnQgdGhlIHR3byBwcm9jZXNzZXMgYXBwcm9wcmlhdGVseSwgZ2VuZXJhbGx5IHdoZXJlIGV4cGxvcmF0b3J5IHJlc2VhcmNoIGlzIGRvY3VtZW50ZWQgd2hpbGUgZG9pbmcgdGhlIGV4cGxvcmF0aW9uIGFuZCBjb25maXJtYXRvcnkgcmVzZWFyY2ggaXMgZG9jdW1lbnRlZCBpbiBhZHZhbmNlLg0KOjo6DQoNCiMjIEluY3JlYXNlZCBTeXN0ZW0gQ29tcGxleGl0eQ0KDQpJdCdzIGltcG9ydGFudCB0byBub3RlIHRoYXQgYXMgd2UgbW92ZSBmcm9tIGEgc2NyaXB0LCB0byBhbiBSTWFya2Rvd24gZG9jdW1lbnQsIHRvIGEgR1VJIHdlIGluY3JlYXNlIHRoZSBjb21wbGV4aXR5IG9mIHRoZSBzeXN0ZW0ocykgdGhhdCB3ZSdyZSB3b3JraW5nIGluLiBFc3NlbnRpYWxseSwgdGhlcmUgYXJlIG1vcmUgbW92aW5nIHBhcnRzLiBXZSdsbCB0YWxrIGFib3V0IHRoaXMgYSBiaXQgbW9yZSB3aGVuIHdlIHRhbGsgYWJvdXQgZGVwZW5kZW5jaWVzLCBidXQgaXQncyB3b3J0aCBub3RpbmcgaGVyZSB0aGUgY29uY2VwdCBvZiBhYnN0cmFjdGlvbiBhcyBpdCByZWxhdGVzIHRvIGJhbGFuY2luZyBlYXNlIG9mIHVzZSBvZiBhIHN5c3RlbSB3aXRoIHJlcGxpY2F0aW9uIC0tIHdlJ2xsIHRvdWNoIG9uIHRoaXMgYWdhaW4gd2hlbiB3ZSBsb29rIGF0IGJ1aWxkaW5nIHZpc3VhbGl6YXRpb25zIGluIFIuIFRoZSBsYXllcnMgb2YgYWJzdHJhY3Rpb24gaW4gYSBwcm9ncmFtbWluZyBsYW5ndWFnZSwgbGlrZSBSIGlzLCBpbiBhIHNpbXBsaWZpZWQgZm9ybToNCg0KMHMgYW5kIDFzIC0+IG1hY2hpbmUgY29kZSAtPiBsb3cgbGV2ZWwgcHJvZ3JhbW1pbmcgLT4gaGlnaCBsZXZlbCBwcm9ncmFtbWluZyAtPiBpbnRlcmFjdGl2ZSBwcm9ncmFtbWluZyAoaWUgc2NpZW50aWZpYyBjb21wdXRpbmcsIHNjcmlwdGluZykgLT4gZWxlY3Ryb25pYyBub3RlYm9va3MgLT4gR1VJIGFwcGxpY2F0aW9ucw0KDQpUaGlzIGFic3RyYWN0aW9uIGNhbiBtYWtlIGl0IGRpZmZpY3VsdCB0byB1bmRlcnN0YW5kIGV4YWN0bHkgd2hhdCdzIGhhcHBlbmluZyBlc3BlY2lhbGx5IGFzIGEgbm92aWNlIHVzZXIsIGJ1dCBhbiBpbnR1aXRpdmUgZnJhbWV3b3JrIGNhbiBoZWxwIHlvdSBtb3ZlIGJldHdlZW4gc3lzdGVtcy4gV2hlbiB3ZSB3cml0ZSBjb2RlIGluIFIgdGhpcyBnb2VzIHRocm91Z2ggc2V2ZXJhbCBzdGFnZXMgb2YgcHJvY2Vzc2luZywgb2Z0ZW4gZmlyc3QganVzdCB3aXRoIGEgY2hlY2sgYW5kIGJhbGFuY2UgdG8gbWFrZSBzdXJlIHRoZSBkYXRhIHdpbGwgd29yayB3aXRoIHRoZSBmdW5jdGlvbiBpdCdzIGJlaW5nIGZlZCBpbnRvLCBpdCB3aWxsIHRoZW4gYWN0dWFsbHkgcnVuIGluIGEgbG93ZXIgbGV2ZWwgbGFuZ3VhZ2UgKGxpa2UgQyBvciBGb3J0cmFuKSwgd2hpY2ggaGFzIGJlZW4gY29udmVydGVkIGludG8gbWFjaGluZSBjb2RlLCBhbiBldmVuIG1vcmUgcnVkaW1lbnRhcnkgaW5zdHJ1Y3Rpb24gc2V0LCB3aGljaCBpcyB0aGVuIHN0b3JlZCBvbiBkaXNrIGFzIGJpbmFyeSB2YWx1ZXMgYW5kIHByb2Nlc3NlZCBieSB5b3VyIENQVS4=