As I mentioned in previous posts, I often have to work with Next Generation Sequencing data. This implies dealing with several variables that are text data or sequences of characters that might also contain spaces or numbers, e.g. gene names, functional categories or amino acid change annotations. This type of data is called string in programming language.
Finding matches is one of the most common tasks involving strings. In doing so, it is sometimes necessary to format or recode this kind of variables, as well as search for patterns.
Some R functions I have found quite useful when handling this data include the following ones:
- colsplit ( ) in the reshape package. It allows to split up a column based on a regular expression
- grepl ( ) for subsetting based on string values that match a given pattern. Here again we use regular expressions to describe the pattern
As you can see by the arguments of these functions, it might be useful when manipulating strings, to get comfortable handling regular expressions. More information on regular expressions to build up patterns can be found at this tutorial and in the regex R Documentation.
Some other useful functions are included in the stringr package. As the title of the package says: “Make it easier to work with strings”:
- str_detect ( ) detects the presence or absence of a pattern in a string. It is based on the grepl function listed above
- fixed ( ) this functions looks for matches based on fixed characters, instead of regular expressions
This is just a brief summary of some options availabe in R. Any other tips on string handling?