Scripts, Notebooks & Abstraction
We’ve talked a bit about the differences between scripts and
spreadsheet applications – this can be abstracted to talking about
programming instructions versus using graphical user interface (GUI)
environments, where a GUI limits reproducibility while also complicating
doing repretative tasks – repetitive tasks take a lot of time and are
error prone. Whenever possible, we want to avoid using a GUI
when reproducibility is important.
While we’ve mentioned scripts, we’ve almost exclusively been working
in RMarkdown documents, not R scripts. And this warrants breaking down a
bit. Each has their place within a research context and within the
context of research data management. One of the most important
pieces of RDM is documentation and a key peice of documentation
is the process of decision making in how one works through their data
whether that be in the cleaning or analytical stage of the process. This
has been referred to as ‘researcher degrees of freedom’. That is, all
the decisions a researcher is at liberty to make in this process,
whether it be defining an outlier, rounding a variable, grouping
variables (choosing a bin size for age ranges for example), deciding you
need to collect more data after having looked at the data, etc. See
False-Positive Psychology: Undisclosed Flexibility in Data
Collection and Analysis Allows Presenting Anything as Significant
available at https://doi.org/10.1177/0956797611417632.
An RMarkdown document, sometimes referred to as an electronic
notebook, is an excellent platform by which to maintain a record of
these decisions; as you articulate your steps in markdown in plain
language, this is directly in line with the code that processes your
data. When we talked earlier about literate programming – the idea that
programming instructions should be both human and machine interprettable
and tell a human what you want the computer to do – an RMarkdown
document enhances this concept with its rich context. However, at a
certain stage, this rich context may be more appropriately decoupled
from the scripting process – especially as we move in the direction of
confirmatory analyses, where we define in advance how we’ll be handling
the data. In these instances, the context is more appropriately
contained within some form of study registration. In such a situation,
we can use a script, or a document that contains only our instruction
set (R code), perhaps with a few comments, but no markdown.
Exploratory research is hypothesis generating, while
confirmatory research is hypothesis testing. It is rare that
confirmatory research is every conducted fully independent of
exploratory research, as confirmatory research often suggests other
paths of inquiry. However, it is critical to clearly differentiate
between the two and to document the two processes appropriately,
generally where exploratory research is documented while doing the
exploration and confirmatory research is documented in advance.
Increased System Complexity
It’s important to note that as we move from a script, to an RMarkdown
document, to a GUI we increase the complexity of the system(s) that
we’re working in. Essentially, there are more moving parts. We’ll talk
about this a bit more when we talk about dependencies, but it’s worth
noting here the concept of abstraction as it relates to balancing ease
of use of a system with replication – we’ll touch on this again when we
look at building visualizations in R. The layers of abstraction in a
programming language, like R is, in a simplified form:
0s and 1s -> machine code -> low level programming -> high
level programming -> interactive programming (ie scientific
computing, scripting) -> electronic notebooks -> GUI
applications
This abstraction can make it difficult to understand exactly what’s
happening especially as a novice user, but an intuitive framework can
help you move between systems. When we write code in R this goes through
several stages of processing, often first just with a check and balance
to make sure the data will work with the function it’s being fed into,
it will then actually run in a lower level language (like C or Fortran),
which has been converted into machine code, an even more rudimentary
instruction set, which is then stored on disk as binary values and
processed by your CPU.
LS0tDQp0aXRsZTogIlNjcmlwdHMsIE5vdGVib29rcyAmIEFic3RyYWN0aW9uIg0KcGFnZXRpdGxlOiAiU2NyaXB0cywgTm90ZWJvb2tzICYgQWJzdHJhY3Rpb24iDQpvdXRwdXQ6DQogIGh0bWxfZG9jdW1lbnQ6DQogICAgY29kZV9mb2xkaW5nOiBzaG93ICMgYWxsb3dzIHRvZ2dsaW5nIG9mIHNob3dpbmcgYW5kIGhpZGluZyBjb2RlLiBSZW1vdmUgaWYgbm90IHVzaW5nIGNvZGUuDQogICAgY29kZV9kb3dubG9hZDogdHJ1ZSAjIGFsbG93cyB0aGUgdXNlciB0byBkb3dubG9hZCB0aGUgc291cmNlIC5SbWQgZmlsZS4gUmVtb3ZlIGlmIG5vdCB1c2luZyBjb2RlLg0KICAgIGluY2x1ZGVzOg0KICAgICAgYWZ0ZXJfYm9keTogZm9vdGVyLmh0bWwgIyBpbmNsdWRlIGEgY3VzdG9tIGZvb3Rlci4NCiAgICB0b2M6IHRydWUNCiAgICB0b2NfZGVwdGg6IDMNCiAgICB0b2NfZmxvYXQ6DQogICAgICBjb2xsYXBzZWQ6IGZhbHNlDQogICAgICBzbW9vdGhfc2Nyb2xsOiBmYWxzZQ0KLS0tDQoNCiMjIFNjcmlwdHMsIE5vdGVib29rcyAmIEFic3RyYWN0aW9uDQoNCldlJ3ZlIHRhbGtlZCBhIGJpdCBhYm91dCB0aGUgZGlmZmVyZW5jZXMgYmV0d2VlbiBzY3JpcHRzIGFuZCBzcHJlYWRzaGVldCBhcHBsaWNhdGlvbnMgLS0gdGhpcyBjYW4gYmUgYWJzdHJhY3RlZCB0byB0YWxraW5nIGFib3V0IHByb2dyYW1taW5nIGluc3RydWN0aW9ucyB2ZXJzdXMgdXNpbmcgZ3JhcGhpY2FsIHVzZXIgaW50ZXJmYWNlIChHVUkpIGVudmlyb25tZW50cywgd2hlcmUgYSBHVUkgbGltaXRzIHJlcHJvZHVjaWJpbGl0eSB3aGlsZSBhbHNvIGNvbXBsaWNhdGluZyBkb2luZyByZXByZXRhdGl2ZSB0YXNrcyAtLSByZXBldGl0aXZlIHRhc2tzIHRha2UgYSBsb3Qgb2YgdGltZSBhbmQgYXJlIGVycm9yIHByb25lLiBXaGVuZXZlciBwb3NzaWJsZSwgKip3ZSB3YW50IHRvIGF2b2lkIHVzaW5nIGEgR1VJIHdoZW4gcmVwcm9kdWNpYmlsaXR5IGlzIGltcG9ydGFudCoqLg0KDQpXaGlsZSB3ZSd2ZSBtZW50aW9uZWQgc2NyaXB0cywgd2UndmUgYWxtb3N0IGV4Y2x1c2l2ZWx5IGJlZW4gd29ya2luZyBpbiBSTWFya2Rvd24gZG9jdW1lbnRzLCBub3QgUiBzY3JpcHRzLiBBbmQgdGhpcyB3YXJyYW50cyBicmVha2luZyBkb3duIGEgYml0LiBFYWNoIGhhcyB0aGVpciBwbGFjZSB3aXRoaW4gYSByZXNlYXJjaCBjb250ZXh0IGFuZCB3aXRoaW4gdGhlIGNvbnRleHQgb2YgcmVzZWFyY2ggZGF0YSBtYW5hZ2VtZW50LiAqKk9uZSBvZiB0aGUgbW9zdCBpbXBvcnRhbnQgcGllY2VzIG9mIFJETSBpcyBkb2N1bWVudGF0aW9uKiogYW5kIGEga2V5IHBlaWNlIG9mIGRvY3VtZW50YXRpb24gaXMgdGhlIHByb2Nlc3Mgb2YgZGVjaXNpb24gbWFraW5nIGluIGhvdyBvbmUgd29ya3MgdGhyb3VnaCB0aGVpciBkYXRhIHdoZXRoZXIgdGhhdCBiZSBpbiB0aGUgY2xlYW5pbmcgb3IgYW5hbHl0aWNhbCBzdGFnZSBvZiB0aGUgcHJvY2Vzcy4gVGhpcyBoYXMgYmVlbiByZWZlcnJlZCB0byBhcyAncmVzZWFyY2hlciBkZWdyZWVzIG9mIGZyZWVkb20nLiBUaGF0IGlzLCBhbGwgdGhlIGRlY2lzaW9ucyBhIHJlc2VhcmNoZXIgaXMgYXQgbGliZXJ0eSB0byBtYWtlIGluIHRoaXMgcHJvY2Vzcywgd2hldGhlciBpdCBiZSBkZWZpbmluZyBhbiBvdXRsaWVyLCByb3VuZGluZyBhIHZhcmlhYmxlLCBncm91cGluZyB2YXJpYWJsZXMgKGNob29zaW5nIGEgYmluIHNpemUgZm9yIGFnZSByYW5nZXMgZm9yIGV4YW1wbGUpLCBkZWNpZGluZyB5b3UgbmVlZCB0byBjb2xsZWN0IG1vcmUgZGF0YSBhZnRlciBoYXZpbmcgbG9va2VkIGF0IHRoZSBkYXRhLCBldGMuIFNlZSAqRmFsc2UtUG9zaXRpdmUgUHN5Y2hvbG9neTogVW5kaXNjbG9zZWQgRmxleGliaWxpdHkgaW4gRGF0YSBDb2xsZWN0aW9uIGFuZCBBbmFseXNpcyBBbGxvd3MgUHJlc2VudGluZyBBbnl0aGluZyBhcyBTaWduaWZpY2FudCogYXZhaWxhYmxlIGF0IFtodHRwczovL2RvaS5vcmcvMTAuMTE3Ny8wOTU2Nzk3NjExNDE3NjMyXShodHRwczovL2RvaS5vcmcvMTAuMTE3Ny8wOTU2Nzk3NjExNDE3NjMyKS4NCg0KQW4gUk1hcmtkb3duIGRvY3VtZW50LCBzb21ldGltZXMgcmVmZXJyZWQgdG8gYXMgYW4gZWxlY3Ryb25pYyBub3RlYm9vaywgaXMgYW4gZXhjZWxsZW50IHBsYXRmb3JtIGJ5IHdoaWNoIHRvIG1haW50YWluIGEgcmVjb3JkIG9mIHRoZXNlIGRlY2lzaW9uczsgYXMgeW91IGFydGljdWxhdGUgeW91ciBzdGVwcyBpbiBtYXJrZG93biBpbiBwbGFpbiBsYW5ndWFnZSwgdGhpcyBpcyBkaXJlY3RseSBpbiBsaW5lIHdpdGggdGhlIGNvZGUgdGhhdCBwcm9jZXNzZXMgeW91ciBkYXRhLiBXaGVuIHdlIHRhbGtlZCBlYXJsaWVyIGFib3V0IGxpdGVyYXRlIHByb2dyYW1taW5nIC0tIHRoZSBpZGVhIHRoYXQgcHJvZ3JhbW1pbmcgaW5zdHJ1Y3Rpb25zIHNob3VsZCBiZSBib3RoIGh1bWFuIGFuZCBtYWNoaW5lIGludGVycHJldHRhYmxlIGFuZCB0ZWxsIGEgaHVtYW4gd2hhdCB5b3Ugd2FudCB0aGUgY29tcHV0ZXIgdG8gZG8gLS0gYW4gUk1hcmtkb3duIGRvY3VtZW50IGVuaGFuY2VzIHRoaXMgY29uY2VwdCB3aXRoIGl0cyByaWNoIGNvbnRleHQuIEhvd2V2ZXIsIGF0IGEgY2VydGFpbiBzdGFnZSwgdGhpcyByaWNoIGNvbnRleHQgbWF5IGJlIG1vcmUgYXBwcm9wcmlhdGVseSBkZWNvdXBsZWQgZnJvbSB0aGUgc2NyaXB0aW5nIHByb2Nlc3MgLS0gZXNwZWNpYWxseSBhcyB3ZSBtb3ZlIGluIHRoZSBkaXJlY3Rpb24gb2YgY29uZmlybWF0b3J5IGFuYWx5c2VzLCB3aGVyZSB3ZSBkZWZpbmUgaW4gYWR2YW5jZSBob3cgd2UnbGwgYmUgaGFuZGxpbmcgdGhlIGRhdGEuIEluIHRoZXNlIGluc3RhbmNlcywgdGhlIGNvbnRleHQgaXMgbW9yZSBhcHByb3ByaWF0ZWx5IGNvbnRhaW5lZCB3aXRoaW4gc29tZSBmb3JtIG9mIHN0dWR5IHJlZ2lzdHJhdGlvbi4gSW4gc3VjaCBhIHNpdHVhdGlvbiwgd2UgY2FuIHVzZSBhIHNjcmlwdCwgb3IgYSBkb2N1bWVudCB0aGF0IGNvbnRhaW5zIG9ubHkgb3VyIGluc3RydWN0aW9uIHNldCAoUiBjb2RlKSwgcGVyaGFwcyB3aXRoIGEgZmV3IGNvbW1lbnRzLCBidXQgbm8gbWFya2Rvd24uDQoNCjo6Om5vdGUNCioqRXhwbG9yYXRvcnkgcmVzZWFyY2ggaXMgaHlwb3RoZXNpcyBnZW5lcmF0aW5nLCB3aGlsZSBjb25maXJtYXRvcnkgcmVzZWFyY2ggaXMgaHlwb3RoZXNpcyB0ZXN0aW5nKiouIEl0IGlzIHJhcmUgdGhhdCBjb25maXJtYXRvcnkgcmVzZWFyY2ggaXMgZXZlcnkgY29uZHVjdGVkIGZ1bGx5IGluZGVwZW5kZW50IG9mIGV4cGxvcmF0b3J5IHJlc2VhcmNoLCBhcyBjb25maXJtYXRvcnkgcmVzZWFyY2ggb2Z0ZW4gc3VnZ2VzdHMgb3RoZXIgcGF0aHMgb2YgaW5xdWlyeS4gSG93ZXZlciwgaXQgaXMgY3JpdGljYWwgdG8gY2xlYXJseSBkaWZmZXJlbnRpYXRlIGJldHdlZW4gdGhlIHR3byBhbmQgdG8gZG9jdW1lbnQgdGhlIHR3byBwcm9jZXNzZXMgYXBwcm9wcmlhdGVseSwgZ2VuZXJhbGx5IHdoZXJlIGV4cGxvcmF0b3J5IHJlc2VhcmNoIGlzIGRvY3VtZW50ZWQgd2hpbGUgZG9pbmcgdGhlIGV4cGxvcmF0aW9uIGFuZCBjb25maXJtYXRvcnkgcmVzZWFyY2ggaXMgZG9jdW1lbnRlZCBpbiBhZHZhbmNlLg0KOjo6DQoNCiMjIEluY3JlYXNlZCBTeXN0ZW0gQ29tcGxleGl0eQ0KDQpJdCdzIGltcG9ydGFudCB0byBub3RlIHRoYXQgYXMgd2UgbW92ZSBmcm9tIGEgc2NyaXB0LCB0byBhbiBSTWFya2Rvd24gZG9jdW1lbnQsIHRvIGEgR1VJIHdlIGluY3JlYXNlIHRoZSBjb21wbGV4aXR5IG9mIHRoZSBzeXN0ZW0ocykgdGhhdCB3ZSdyZSB3b3JraW5nIGluLiBFc3NlbnRpYWxseSwgdGhlcmUgYXJlIG1vcmUgbW92aW5nIHBhcnRzLiBXZSdsbCB0YWxrIGFib3V0IHRoaXMgYSBiaXQgbW9yZSB3aGVuIHdlIHRhbGsgYWJvdXQgZGVwZW5kZW5jaWVzLCBidXQgaXQncyB3b3J0aCBub3RpbmcgaGVyZSB0aGUgY29uY2VwdCBvZiBhYnN0cmFjdGlvbiBhcyBpdCByZWxhdGVzIHRvIGJhbGFuY2luZyBlYXNlIG9mIHVzZSBvZiBhIHN5c3RlbSB3aXRoIHJlcGxpY2F0aW9uIC0tIHdlJ2xsIHRvdWNoIG9uIHRoaXMgYWdhaW4gd2hlbiB3ZSBsb29rIGF0IGJ1aWxkaW5nIHZpc3VhbGl6YXRpb25zIGluIFIuIFRoZSBsYXllcnMgb2YgYWJzdHJhY3Rpb24gaW4gYSBwcm9ncmFtbWluZyBsYW5ndWFnZSwgbGlrZSBSIGlzLCBpbiBhIHNpbXBsaWZpZWQgZm9ybToNCg0KMHMgYW5kIDFzIC0+IG1hY2hpbmUgY29kZSAtPiBsb3cgbGV2ZWwgcHJvZ3JhbW1pbmcgLT4gaGlnaCBsZXZlbCBwcm9ncmFtbWluZyAtPiBpbnRlcmFjdGl2ZSBwcm9ncmFtbWluZyAoaWUgc2NpZW50aWZpYyBjb21wdXRpbmcsIHNjcmlwdGluZykgLT4gZWxlY3Ryb25pYyBub3RlYm9va3MgLT4gR1VJIGFwcGxpY2F0aW9ucw0KDQpUaGlzIGFic3RyYWN0aW9uIGNhbiBtYWtlIGl0IGRpZmZpY3VsdCB0byB1bmRlcnN0YW5kIGV4YWN0bHkgd2hhdCdzIGhhcHBlbmluZyBlc3BlY2lhbGx5IGFzIGEgbm92aWNlIHVzZXIsIGJ1dCBhbiBpbnR1aXRpdmUgZnJhbWV3b3JrIGNhbiBoZWxwIHlvdSBtb3ZlIGJldHdlZW4gc3lzdGVtcy4gV2hlbiB3ZSB3cml0ZSBjb2RlIGluIFIgdGhpcyBnb2VzIHRocm91Z2ggc2V2ZXJhbCBzdGFnZXMgb2YgcHJvY2Vzc2luZywgb2Z0ZW4gZmlyc3QganVzdCB3aXRoIGEgY2hlY2sgYW5kIGJhbGFuY2UgdG8gbWFrZSBzdXJlIHRoZSBkYXRhIHdpbGwgd29yayB3aXRoIHRoZSBmdW5jdGlvbiBpdCdzIGJlaW5nIGZlZCBpbnRvLCBpdCB3aWxsIHRoZW4gYWN0dWFsbHkgcnVuIGluIGEgbG93ZXIgbGV2ZWwgbGFuZ3VhZ2UgKGxpa2UgQyBvciBGb3J0cmFuKSwgd2hpY2ggaGFzIGJlZW4gY29udmVydGVkIGludG8gbWFjaGluZSBjb2RlLCBhbiBldmVuIG1vcmUgcnVkaW1lbnRhcnkgaW5zdHJ1Y3Rpb24gc2V0LCB3aGljaCBpcyB0aGVuIHN0b3JlZCBvbiBkaXNrIGFzIGJpbmFyeSB2YWx1ZXMgYW5kIHByb2Nlc3NlZCBieSB5b3VyIENQVS4=