Data processing and analysis should be reproducible – independent of which software, programming language, or operating system you use. This is best achieved by choosing automated script-based workflows (over manual point-and-click procedures), and supported by adequate documentation and shared code that allows others to regenerate results. Because analysis scripts are run repeatedly, due to iterative development through corrections and refinements, automation is essential for both reproducibility and efficiency. This section outlines a recommended workflow for R users.
- Create a self-contained project folder. Include data, code, documentation, and outputs in a single structured environment, ensuring the project remains understandable, reproducible, and portable across systems and collaborators. If you use the free and open source software RStudio to manage your R project, your project directory (or folder) should contain a .Rproj file (see R tutorial). Use relative paths (i.e.
“./subfolder”, where.represents the root of your .Rproj directory) or the libraryhere, so the project stays portable to another environment - Use a standard folder structure. Your code repository should include a standard folder structure that make sense for your type of research, ideally shared across your team members. You can for instance use our research project template.
- Stop clicking, start coding. Automatize all possible steps, including data acquisition (see 2.1. Data Collection), data processing and transformation, data analyses, data visualization, and results reporting (see 3.2. Reporting Results)
- Structure, comment, and standardize your scripts. R scripts themselves should follow current standards to increase their readability (see Readable Code Lecture). Use meaningful names for variables, functions, and scripts. Add comments to your code explaining why you made a decision, any known limitations to your code, and citations of methods. Do not include sensitive information such as credentials or name of excluded patient as comments in your code!
- Define your own functions rather than copy pasting pieces of code which makes it hard to maintain error-free. Functions are ‘self-contained’ sets of commands that accomplish a specific task. They usually ‘take in’ data or parameter values (these inputs are called ‘function arguments’), process them, and ‘return’ a result. See our R tutorial and data simulation tutorial for examples.
- Set seeds for random processes to enable exact replication. A seed is a number used to initialize a pseudorandom number generator algorithm. It serves as the starting point for a sequence of numbers that appear random but are actually produced by a deterministic, fixed algorithm. See e.g. our data simulation tutorial for examples.
- Follow accessibility standards when generating outputs (e.g. use colorblind-friendly color scheme for figures)
- Follow a style guide to increase readability. Use automated styling tools (e.g.
styler,lintR).
- Use LRZ Compute Cloud for data-intensive analyses. LRZ Supercomputing provide virtual machines, high-performance computing, and storage to researchers of LMU Munich.
LEARN MORE
TOOLS & RESOURCES
Version control tracks changes to files over time. You can see what changed, when, and why. You can revert to previous versions. Collaborators can work without overwriting each other.
In a version controlled workflow, you back up your local Git repositories on the cloud-based platforms GitHub or LRZ GitLab and share access to the online version of your repositories with your collaborators.
Learn to create git branches to collaborate on the same piece of code in a unique repository. i.e. temporary copies where you can work without breaking the original source code, which you later merge back to the main branch (see Advanced Git tutorial).
Maintain efficient communication to coordinate collaborative work. GitHub or GitLab facilitate collaboration through “issues” (a precise description of something to fix), “discussions” (asynchronous thinking through figuring out how to resolve a problem), and can still very well resolve “conflicts” (i.e. collaborators wanting to merge changes on the exact same line of script). To complement this, LMU Munich offers LMU chat (Matrix), a secured open source chat service with all LMU members on which you can also invite external collaborators.
- Git is a version control system that tracks changes in text files (e.g. CSV, plain text, R, Python). The Git software and your Git repositories should be, respectively, installed and located in your local environment (i.e. on your computer, not on a drive, see Git tutorial).
- GitHub is the most popular, free but proprietary and US-based cloud-based platform for software development with Git, providing collaboration features like pull requests and issues (see GitHub tutorial). You should not have any sensitive information on GitHub even in a private repository.
- LRZ GitLab is a cloud-based hosting platform that works exactly the same as GitHub but is free and open source and is installed on the LRZ servers for LMU Munich and can therefore be considered secure when the repository is private.
While your LRZ GitLab account is associated with your LMU Munich affiliation, your GitHub account can be associated with your private email, be included in your CV, and be used for public sharing of your data and code (see 4. Preserve & Share).
In a version controlled workflow, you back up your local Git repositories on either GitHub or LRZ GitLab through a secure SSH connection (see GitHub tutorial) and share access to your repositories with your collaborators through the cloud-based platform GitHub or LRZ GitLab.
If you work with sensitive data, you must not include the raw or processed data in the version-controlled repository that will end up being shared publicly.
Instead, explicitly exclude the data directory using the .gitignore file from the start, or, at the time of sharing, create a new local repository that contains all project files except the data.
Importantly, if data are removed from an existing repository, they may still remain accessible in the repository’s history, since previous states of the project can be restored. If sensitive data are accidentally committed and pushed, it is possible to rewrite the repository history to remove them retrospectively. However, this process is complex and error-prone, so it is best avoided by ensuring that sensitive data are excluded from version control from the outset.
Create a LRZ GitLab “organization” for the team. This allows repositories, permissions, and project resources to be managed centrally rather than under individual accounts. This ensures continuity when team members leave, as ownership can be transferred to e.g. the PI and other administrators of the organization.
LEARN MORE
TOOLS & RESOURCES
Manage your computational environment by explicitly recording the software, package versions, and dependencies required for your analyses, ensuring results can be reproduced across systems and over time. Tools such as packages managers (e.g. Renv for R packages, Conda for Python packages) or broader containers (e.g. Docker or Binder) help stabilize workflows and prevent inconsistencies caused by packages or software updates.
For a R project repository:
- Activate Renv to keep track of all packages versions (see our renv tutorial). This way, you or someone else can reproduce your results on another computer or at a later time using the same R packages versions.
Before publishing your project (see 4. Preserve & share):
- Record your dependencies in your README file for possible reconstruction with repo2docker or binder (see Code Publishing tutorial).
LEARN MORE
TOOLS & RESOURCES
As with all documentation, your project repository’s documentation should be written early - initially for your near-future self to support efficient re-engagement after interruptions, then revised for internal team review, and ultimately expanded and refined for public sharing (see 4.2. Open Source Code).
- Create a README (e.g. a .md or .txt file) early and update it as you go. Your README is the entry point to your project. A good README answers the essential questions: who created the script, what it contains, how they relate to other scripts and in which order scripts should be run, what the dependencies of the project are, how to obtain/access the input data, whether the code can be reused.
- Annotate your code explaining why you made a decision. All parameter values used as input for a function, or other decisions, should be justified minimally as comments in your code to later be included in your manuscript. Do not include sensitive information such as credentials or name of excluded patient as comments in your code!
- Update your data dictionary and README files. Your data documentation should be updated to include all data exclusion, change in range of possible values, etc. (see 2.2.4. Documentation).
Example README to allow team members to review your code:
# Analysis of Treatment Effects
## Requirements
- R version 4.3+
- Packages listed in renv.lock
## Running the Analysis
1. Install dependencies: `renv::restore()`
2. Run scripts in order: 01_preprocessing.R, 02_analysis.R