R Programming

4 programming

4 programming

R Programming

Programming in R

Introduction

Format of this Manual

The R & BioConductor manual provides a general introduction to the usage of the R environment and its basic command syntax.

Code Editors for R

Programming in R using Vim or Emacs Programming in R using RStudio

Integrating R with Vim and Tmux

Users interested in integrating R with vim and tmux may want to consult the Vim-R-Tmux configuration page.

Finding Help

Control Structures

Conditional Executions

Comparison Operators

  • equal: ==
  • not equal: !=
  • greater/less than: > <
  • greater/less than or equal: >= <=

Logical Operators

  • and: &
  • or: |
  • not: !

If Statements

Ifelse Statements

[1] 1 2 3 4 0 0 0 0 9 10

The most commonly used loop structures in R are for , while and apply loops. Less common are repeat loops. The break function is used to break out of loops, and next halts the processing of the current iteration and advances the looping index.

myve <- NULL # Creates empty storage container

for(i in seq(along=mydf[,1])) <

myve <- c(myve, mean(as.numeric(mydf[i, 1:3]))) # Note: inject approach is much faster than append with ‘c’. See below for details.

[1] 3.333333 3.100000 3.066667 3.066667 3.333333 3.666667 3.133333 3.300000

[9] 2.900000 3.166667 3.533333 3.266667 3.066667 2.800000 3.666667 3.866667

for(i in seq(along=x)) <

[1] 0 1 2 3 1 1 1 1 1 1

Example: stop on condition and print error message

for(i in seq(along=x)) <

stop(“values need to be <5”)

Error: values need to be <5

While Loop

Apply Loop Family

For T wo-Dimensional Data Sets: apply

X : array, matrix or data.frame; MARGIN : 1 for rows, 2 for columns, c(1,2) for both; FUN : one or more functions; ARGs : possible arguments for function

apply(iris[,1:3], 1, mean)

[1] 3.333333 3.100000 3.066667 3.066667 3.333333 3.666667 3.133333 3.300000

apply(as.matrix(x), 1, test) # Returns same result as previous for loop*

[1] 0 1 2 3 1 1 1 1 1 1

[1] 0 1 2 3 1 1 1 1 1 1

For R agged Arrays: tapply

tapply(as.vector(iris[,4]), factor(iris[,5]), mean)

setosa versicolor virginica

aggregate(iris[,1:4], list(iris$Species), mean)

Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width

1 setosa 5.006 3.428 1.462 0.246

2 versicolor 5.936 2.770 4.260 1.326

3 virginica 6.588 2.974 5.552 2.026

For Vectors and Lists: lapply and sapply

Both apply a function to vector or list objects. The function lapply returns a list, while sapply attempts to return the simplest data object, such as vector or matrix instead of list .

Sepal.Length Sepal.Width Petal.Length

Other Loops

Repeat Loop

Loop is repeated until a break is specified. This means there needs to be a second statement to test whether or not to break from the loop.

Improving Speed Performance of Loops

system.time(for(i in seq(along=myMA[,1])) results <- c(results, mean(myMA[i,])))

user system elapsed

39.156 6.369 45.559

system.time(for(i in seq(along=myMA[,1])) results[i] <- mean(myMA[i,]))

user system elapsed

user system elapsed

user system elapsed

user system elapsed

0.8505795 1.3419460 1.3768646 1.3005428

user system elapsed

0.8505795 1.3419460 1.3768646 1.3005428

myList <- tapply(colnames(myMA), c(1,1,1,2,2,2,3,3,4,4), list)

myMAmean <- sapply(myList, function(x) rowMeans(myMA[,x]))

colnames(myMAmean) <- sapply(myList, paste, collapse=”_”)

C1_C2_C3 C4_C5_C6 C7_C8 C9_C10

1 0.0676799 -0.2860392 0.09651984 -0.7898946

2 -0.6120203 -0.7185961 0.91621371 1.1778427

3 0.2960446 -0.2454476 -1.18768621 0.9019590

4 0.9733695 -0.6242547 0.95078869 -0.7245792

## Alternative to achieve the same result with similar performance, but in a much less elegant way

myselect <- c(1,1,1,2,2,2,3,3,4,4) # The colums are named according to the selection stored in myselect

myList <- tapply(seq(along=myMA[1,]), myselect, function(x) paste(“myMA[ ,”, x, “]”, sep=””))

myList <- sapply(myList, function(x) paste(“(“, paste(x, collapse=” + “),”)/”, length(x)))

myMAmean <- sapply(myList, function(x) eval(parse(text=x)))

colnames(myMAmean) <- tapply(colnames(myMA), myselect, paste, collapse=”_”)

C1_C2_C3 C4_C5_C6 C7_C8 C9_C10

1 0.0676799 -0.2860392 0.09651984 -0.7898946

2 -0.6120203 -0.7185961 0.91621371 1.1778427

3 0.2960446 -0.2454476 -1.18768621 0.9019590

4 0.9733695 -0.6242547 0.95078869 -0.7245792

A very useful feature of the R environment is the possibility to expand existing functions and to easily write custom functions. In fact, most of the R software can be viewed as a series of R functions.

Example: Function basics

myfct # prints definition of function

myfct(x1=2, x2=5) # applies function to values 2 and 5

myfct(x1=2) # does the same as before, but the default value ‘5’ is used in this case

cat(“my function returns:”, “\n”)

myfct2(x1=5) # performs calculation on default vector (z1) that is defined in the function body

my function returns:

[1] 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

my function returns:

[1] 6.0 5.8 5.6 5.4 5.2 5.0 4.8 4.6 4.4 4.2 4.0

Control utilities for functions: return , warning and stop

if (x1>=0) print(x1) else stop(“This function did not finish, because x1 < 0”)

warning(“Value needs to be > 0”)

In myfct(x1 = 2) : Value needs to be > 0

Error in myfct(x1 = -2) : This function did not finish, because x1 < 0

Useful Utilities

Debugging Utilities

Regular Expressions

The grep function can be used for finding patterns in strings, here letter A in vector month.name .

x <- x[c(grep(“^J”, as.character(x), perl = TRUE))]

Interpreting Character String as Expression

mylist[1] # prints name of 1st entry in vector but does not execute it as expression that returns values of 10th object

get(mylist[1]) # uses 1st entry name in vector and executes it as expression

eval(parse(text=mylist[1])) # alternative approach to obtain similar result

Time, Date and Sleep

user system elapsed

[1] “Wed Dec 11 15:31:17 2012”

Calling External Software with System Command

Related utilities on Windows operating systems

shell.exec(“C:/Documents and Settings/Administrator/Desktop/my_file.txt”) # opens file with associated program

Miscellaneous Utilities

(1) Batch import and export of many files.

In the following example all file names ending with *.txt in the current directory are first assigned to a list (the ‘ $ ‘ sign is used to anchor the match to the end of a string). Second, the files are imported one-by-one using a for loop where the original names are assigned to the generated data frames with the assign function. Consult help with ?read.table to understand arguments row.names=1 and comment.char = “A” . Third, the data frames are exported using their names for file naming and appending the extension *.out .

x <- read.table(i, header=TRUE, comment.char = “A”, sep=”\t”)

write.table(x, paste(i, c(“.out”), sep=””), quote=FALSE, sep=”\t”, col.names = NA)

(2) Running Web Applications (basics on designing web client/crawling/scraping scripts in R)

Example for obtaining MW values for peptide sequences from the EXPASY’s pI/MW Tool web page.

for(i in myentries) <

myurl <- paste(“http://ca.expasy.org/cgi-bin/pi_tool?protein=”, i, “&resolution=monoisotopic”, sep=””)

mylines <- res[grep(“Theoretical pI/Mw:”, res)]

myresult <- c(myresult, as.numeric(gsub(“.*/ “, “”, mylines)))

Sys.sleep(1) # halts process for one sec to give web service a break

final <- data.frame(Pep=myentries, MW=myresult)

cat(“\n The MW values for my peptides are:\n”)

1 MKWVTFISLLFLFSSAYS 2139.11

2 MWVTFISLL 1108.60

3 MFISLLFLFSSAYS 1624.82

Running R Programs

(2.1) Syntax for running R programs from the command-line. Requires in first line of my_script.R the following statement: #!/usr/bin/env Rscript

All commands starting with a ‘$’ sign need to be executed from a Unix or Linux shell.

(2.2) Alternatively, one can use the following syntax to run R programs in BATCH mode from the command-line.

The output file lists the commands from the script file and their outputs. If no outfile is specified, the name used is that of infile and .Rout is appended to outfile. To stop all the usual R command line information from being written to the outfile, add this as first line to my_script.R file: options(echo=FALSE) . If the command is run like this R CMD BATCH –no-save my_script.R , then nothing will be saved in the .Rdata file which can get often very large. More on this can be found on the help pages: $ R CMD BATCH –help or ?BATCH .

(2.3) Another alternative for running R programs as silently as possible.

Argument –slave makes R run as ‘quietly’ as possible.

Create an R script, here named test.R , like this one:

In the given example the number 10 is passed on from the command-line as an argument to the R script which is used to return to STDOUT the first 10 rows of the iris sample data. If several arguments are provided, they will be interpreted as one string that needs to be split it in R with the strsplit function.

(4) Submitting R script to a Linux cluster via Torque

Create the following shell script my_script.sh

R CMD BATCH –no-save my_script.R

This script doesn’t need to have executable permissions. Use the following qsub command to send this shell script to the Linux cluster from the directory where the R script my_script.R is located. To utilize several CPUs on the Linux cluster, one can divide the input data into several smaller subsets and execute for each subset a separate process from a dedicated directory.

(5) Submitting jobs to Torque or any other queuing/scheduling system via the BatchJobs package. This package provides one of the most advanced resources for submitting jobs to queuing systems from within R. A related package is BiocParallel from Bioconductor which extends many functionalities of BatchJobs to genome data analysis. Useful documentation for BatchJobs : Technical Report, GitHub page, Slide Show, Config samples.

## Loads configuration file. Here .BatchJobs.R containing just this line:

## The template file torque.tmpl is expected to be in the current working

## director. It can be downloaded from here:

reg <- makeRegistry(id=”BatchJobTest”, work.dir=”results”)

## Constructs a registry object. Output files from R will be stored under directory “results”,

## while the standard objects from BatchJobs will be stored in the directory “BatchJobTest-files” .

ids <- batchMap(reg, fun=f, 1:10)

done <- submitJobs(reg, resources=list(walltime=3600, nodes=”1:ppn=4″, memory=”4gb”))

Object-Oriented Programming (OOP)

R supports two systems for object-oriented programming (OOP). An older S3 system and a more recently introduced S4 system. The latter is more formal, supports multiple inheritance, multiple dispatch and introspection. Many of these features are not available in the older S3 system. In general, the OOP approach taken by R is to separate the class specifications from the specifications of generic functions (function-centric system). The following introduction is restricted to the S4 system since it is nowadays the preferred OOP method for R. More information about OOP in R can be found in the following introductions: Vincent Zoonekynd’s introduction to S3 Classes, S4 Classes in 15 pages, Christophe Genolini’s S4 Intro, The R.oo package, BioC Course: Advanced R for Bioinformatics, Programming with R by John Chambers and R Programming for Bioinformatics by Robert Gentleman.

Define S4 Classes

(A) Define S4 Classes with setClass() and new()

prototype=prototype(a=y[1:2,]), # Defines default value (optional)

return(paste(“expected matrix, but obtained”, class(object@a)))

  • Class : the name of the class
  • representation : the slots that the new class should have and/or other classes that this class extends.
  • prototype : an object providing default data for the slots.
  • contains : the classes that this class extends.
  • validity , access , version : control arguments included for compatibility with S-Plus.
  • where : the environment to use to store or remove the definition as meta data.

(B) The function new creates an instance of a class (here myclass )

An object of class “myclass”

[1,] 1 11 21 31 41

[2,] 2 12 22 32 42

Error in validObject(.Object) :

invalid class “myclass” object: expected matrix, but obtained data.frame

  • Class : the name of the class
  • . : Data to include in the new object with arguments according to slots in class definition.

(C) A more generic way of creating class instances is to define an initialization method (details below)

new(“myclass”, a = y)

new(“myclass”, a = y)> new(“myclass”, a = y)

An object of class “myclass”

(D) Usage and helper functions

initialize(.Object=myobj, a=as.matrix(cars[1:3,])) # Creates a new S4 object from an old one.

# removeClass(“myclass”) # Removes object from current session; does not apply to associated methods.

(E) Inheritance: allows to define new classes that inherit all properties (e.g. data slots, methods) from their existing parent classes

setClass(“myclass2”, representation(c = “numeric”, d = “numeric”))

setClass(“myclass3”, contains=c(“myclass1”, “myclass2”))

new(“myclass3”, a=letters[1:4], b=letters[1:4], c=1:4, d=4:1)

An object of class “myclass3”

Class “myclass1” [in “.GlobalEnv”]

Class: character character

Class “myclass2” [in “.GlobalEnv”]

Class: numeric numeric

Class “myclass3” [in “.GlobalEnv”]

Class: character character numeric numeric

The argument contains allows to extend existing classes; this propagates all slots of parent classes.

(F) Coerce objects to another class

(G) Virtual classes are constructs for which no instances will be or can be created. They are used to link together classes which may have distinct representations (e.g. cannot inherit from each other) but for which one wants to provide similar functionality. Often it is desired to create a virtual class and to then have several other classes extend it. Virtual classes can be defined by leaving out the representation argument or including the class VIRTUAL :

setClass(“myVclass”, representation(a = “character”, “VIRTUAL”))

(H) Functions to introspect classes

  • getClass(“myclass”)
  • getSlots(“myclass”)
  • slotNames(“myclass”)
  • extends(“myclass2”)

      Assign Generics and Methods

      setMethod(f=”acc”, signature=”myclass”, definition=function(x) <

      [1,] 1 11 21 31 41

      [2,] 2 12 22 32 42

      (B.1) Replacement method using custom accessor function ( acc <- )

      setReplaceMethod(f=”acc”, signature=”myclass”, definition=function(x, value) <

      ## After this the following replace operations with ‘acc’ work on new object class

      acc(myobj)[1,1] <- 999 # Replaces first value

      colnames(acc(myobj)) <- letters[1:5] # Assigns new column names

      rownames(acc(myobj)) <- letters[1:10] # Assigns new row names

      An object of class “myclass”

      a 999 11 21 31 41

      b 2 12 22 32 42

      (B.2) Replacement method using ” [ ” operator ( [<- )

      An object of class “myclass”

      a 999 999 21 31 41

      b 2 12 22 32 42

      (C) Define behavior of ” [ ” subsetting operator (no generic required!)

      definition=function(x, i, j, . drop) <

      myobj[1:2,] # Standard subsetting works now on new class

      An object of class “myclass”

      a 999 999 21 31 41

      b 2 12 22 32 42

      (D) Define print behavior

      cat(“An instance of “, “\””, class(object), “\””, ” with “, length(acc(object)[,1]), ” elements”, “\n”, sep=””)

      print(as.data.frame(rbind(acc(object)[1:2,], . =rep(“. “, length(acc(object)[1,])),

      myobj # Prints object with custom method

      An instance of “myclass” with 10 elements

      a 999 999 21 31 41

      b 2 12 22 32 42

      i 9 19 29 39 49

      j 10 20 30 40 50

      (E) Define a data specific function (here randomize row order)

      setMethod(f=”randomize”, signature=”myclass”, definition=function(x) <

      j 10 20 30 40 50

      b 2 12 22 32 42

      (F) Define a graphical plotting function and allow user to access it with generic plot function

      (G) Functions to inspect methods

      • showMethods(class=”myclass”)
      • findMethods(“randomize”)
      • getMethod(“randomize”, signature=”myclass”)
      • existsMethod(“randomize”, signature=”myclass”)

          Building R Packages

          To get familiar with the structure, building and submission process of R packages, users should carefully read the documentation on this topic available on these sites:

          Short Overview of Package Building Process

        (A) Automatic package building with the package.skeleton function:

        Note: this is an optional but very convenient function to get started with a new package. The given example will create a directory named mypackage containing the skeleton of the package for all functions, methods and classes defined in the R script(s) passed on to the code_files argument. The basic structure of the package directory is described here. The package directory will also contain a file named ‘ Read-and-delete-me ‘ with the following instructions for completing the package:

        • Edit the help file skeletons in man , possibly combining help files for multiple functions.
        • Edit the exports in NAMESPACE , and add necessary imports.
        • Put any C/C++/Fortran code in src .
        • If you have compiled code, add a useDynLib() directive to NAMESPACE .
        • Run R CMD build to build the package tarball.
        • Run R CMD check to check the package tarball.
        • Read Writing R Extensions for more information.

        (B) Once a package skeleton is available one can build the package from the command-line (Linux/OS X):

        Subsequently, the package tarball needs to be checked for errors with:

        $ R CMD INSTALL -l tempdir mypackage_1.0.tar.gz

        $ zip -r mypackage mypackage

        prompt(myfct) # writes help file myfct.Rd

        promptClass(“myclass”) # writes file myclass-class.Rd

        promptMethods(“mymeth”) # writes help file mymeth.Rd

          • The resulting *.Rd help files can be edited in a text editor and properly rendered and viewed from within R like this:

        Rd2txt(“./mypackage/man/myfct.Rd”) # renders *.Rd files as they look in final help pages

        checkRd(“./mypackage/man/myfct.Rd”) # checks *.Rd help file for problems

        (E) Submit package to a public repository

        Reproducible Research by Integrating R with Latex or Markdown

        R Programming Exercises

        Exercise Slides

        Sample Scripts

        Batch Operations on Many Files

        ## (2) Create some files as sample data

        for(i in month.name) <

        mydf <- data.frame(Month=month.name, Rain=runif(12, min=10, max=100), Evap=runif(12, min=1000, max=2000))

        write.table(mydf, file=paste(i , “.infile”, sep=””), quote=F, row.names=F, sep=”\t”)

        x <- read.table(files[i], header=TRUE, row.names=1, comment.char = “A”, sep=”\t”)

        x <- data.frame(x, sum=apply(x, 1, sum), mean=apply(x, 1, mean)) # calculates sum and mean for each data frame

        assign(files[i], x) # generates data frame object and names it after content in variable ‘i’

        print(files[i], quote=F) # prints loop iteration to screen to check its status

        write.table(x, paste(files[i], c(“.out”), sep=””), quote=FALSE, sep=”\t”, col.names = NA)

        name_df <- data.frame(Old_name=sort(files), New_name=sort(month.abb))

        for(i in seq(along=name_df[,1])) <

        x <- read.table(as.vector(name_df[i,1]), header=TRUE, row.names=1, comment.char = “A”, sep=”\t”)

        x <- data.frame(x, sum=apply(x, 1, sum), mean=apply(x, 1, mean))

        assign(as.vector(name_df[i,2]), x) # generates data frame object and names it after ‘i’ entry in column 2

        write.table(x, paste(as.vector(name_df[i,2]), c(“.out”), sep=””), quote=FALSE, sep=”\t”, col.names = NA)

        all_files <- data.frame(files=NULL, Month=NULL, Gain=NULL , Loss=NULL, sum=NULL, mean=NULL) # creates empty data frame container

        for(i in seq(along=files)) <

        x <- read.table(files[i], header=TRUE, row.names=1, comment.char = “A”, sep=”\t”)

        x <- data.frame(x, sum=apply(x, 1, sum), mean=apply(x, 1, mean))

        x <- data.frame(file=rep(files[i], length(x[,1])), x) # adds file tracking column to x

        all_files <- rbind(all_files, x) # appends data from all files to data frame ‘all_files’

        write.table(all_files, file=”all_files.xls”, quote=FALSE, sep=”\t”, col.names = NA)

        source(“my_script.R”) # execute from R console

        $ R CMD BATCH my_script.R # execute from shell

        Large-scale Array Analysis

        Sample script to perform large-scale expression array analysis with complex queries: lsArray.R . To demo what the script does, run it like this:

        Graphical Procedures: Feature Map Example

        Script to plot feature maps of genes or chromosomes: featureMap.R. To demo what the script does, run it like this:

        Sequence Analysis Utilities

        Pattern Matching and Positional Parsing of Sequences

        Functions for importing sequences into R, retrieving reverse and complement of nucleotide sequences, pattern searching, positional parsing and exporting search results in HTML format: patternSearch.R. To demo what the script does, run it like this:

        Identify Over-Represented Strings in Sequence Sets

        Functions for finding over-represented words in sets of DNA, RNA or protein sequences: wordFinder.R. To demo what the script does, run it like this:

        Translate DNA into Protein

        Script ‘translateDNA.R’ for translating NT sequences into AA sequences (required codon table). To demo what the script does, run it like this:

        Subsetting of Structure Definition Files (SDF)

        Script for importing and subsetting SDF files: sdfSubset.R. To demo what the script does, run it like this:

        Managing Latex BibTeX Databases

        Script for importing BibTeX databases into R, retrieving the individual references with a full-text search function and viewing the results in R or in HubMed: BibTex.R. To demo what the script does, run it like this:

        Loan Payments and Amortization Tables

        This script calculates monthly and annual mortgage or loan payments, generates amortization tables and plots the results: mortgage.R. To demo what the script does, run it like this:

        Course Assignment: GC Content, Reverse & Complement

        Apply the above information to write a function that calculates for a set of DNA sequences their GC content and generates their reverse and complement. Here are some useful commands that can be incorporated in this function:

        x <- as.integer(runif(20, min=1, max=5))

        paste(x, sep = “”, collapse =””)

        z1 <- data.frame(ID=seq(along=z1), Seq=z1)

        ## Use ‘apply’ or ‘for loop’ to apply the above operations to all sequences in sample data frame ‘z1’

        ## Calculate in the same loop the GC content for each sequence using the following command

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.