# Some tools to do a report (with Latex)

In another post I have already talked about different topics not directly associated to Statistic. At this time I will write about reports combining the use of R and Latex. Specifically about how to export tables to Latex and insert latex typesetting in R figures.

Obviously our technical work is very important but the final presentation which should be taken care of. Sometimes the problem is to move our results from the statistical software to the text editor so I will talk in this post about my favourite tools to perform this task.

When looking for a text editor to do a report, there are may options to choose between (MS Word, Latex, knitr/Sweave, Markdown, etc.). I pick out Latex versus MS Word given its possibility to easily include math language. The connection between knitr/Sweave and Markdown with R is straightforward (these two software imbibe R code), in future posts I will talk about them.

R tables to Latex

As I commented  I first work with R and then move the results to Latex. One of the simplest and most effective tools is the function xtable() that we can find in the package of the same name, xtable.

• xtable(), this function creates the code that we write in latex from an R table.

Other important packages in R with extra functionalities to convert object from R to Latex code are:

• Hmisc: you can use the function latex() to create a tex file.
• texreg: highly configurable and designed to convert model summary outputs.

Figures

Many people will know the graphics of Latex, called TikZ (see page examples), but what is not so known is that this type of graphics allow a wonderful connection between R and Latex. It allows that formulas, numbers, etc. displayed in the format of choice in Latex. To do this we simply have to follow the next steps:

1. In R: Create the tikz images from R using tikz() function, see this, in package tikzDevice ( install.packages(“tikzDevice”, repos=”http://R-Forge.R-project.org&#8221;) ). This function will create a tex file that can be compiled into PDF, but the only thing you need to do is to include the tex file into your Latex document.
2. In Latex: Insert in Latex document with function \input{normal.tex} and \input{symbol-regression.tex}. The R figure uses directly your format in Latex. You need include on preambule of Latex \usepackage{tikz}.

Will you tell us your “tricks” when making reports?

# Dealing with strings in R

As I mentioned in previous posts, I often have to work with Next Generation Sequencing data. This implies dealing with several variables that are text data or sequences of characters that might also contain spaces or numbers, e.g. gene names, functional categories or amino acid change annotations. This type of data is called string in programming language.

Finding matches is one of the most common tasks involving strings. In doing so, it is sometimes necessary to format or recode this kind of variables, as well as search for patterns.

Some R functions I have found quite useful when handling this data include the following ones:

• colsplit ( ) in the reshape package. It allows to split up a column based on a regular expression
• grepl ( ) for subsetting based on string values that match a given pattern. Here again we use regular expressions to describe the pattern

As you can see by the arguments of these functions, it might be useful when manipulating strings, to get comfortable handling regular expressions. More information on regular expressions to build up patterns can be found at this tutorial and in the regex R Documentation.

Some other useful functions are included in the stringr package. As the title of the package says: “Make it easier to work with strings”:

• str_detect ( ) detects the presence or absence of a pattern in a string. It is based on the grepl function listed above
• fixed ( ) this functions looks for matches based on fixed characters, instead of regular expressions

Once again a Hadley Wickham package, along with reshape and plyr. The three of them containing a set of helpful features for handling data frames and lists.

This is just a brief summary of some options availabe in R. Any other tips on string handling?

# Interview with…Moisés Gómez Mateu

Moisés Gómez Mateu is a PhD student at the Universitat Politècnica de Catalunya (UPC) where he works as research assistant.

Contact Moisés

1. Why do you like Biostatistics?

I have been very curious since I was a child; that’s why I like statistics. Moreover, statistics related to biology and medicine allows helping people to improve their quality of life.

2. Could you give us some insight in your current field of research?

I focus my thesis on survival analysis, especially on the issue of composite endpoints in clinical trials. The main aim is to analyze what is the best primary endpoint to use, extend the statistical theory, and make practical tools available to researchers by means of a library in R, an on-line platform, etc.

3. Did you find it difficult to move from the private sector to the University?

No. In fact, I left my job as a consultant in a marketing research company to study the MSc Statistics at the UPC, and I think it was a very good decision.

4. Which are, in your opinion, the main advantages of being a researcher?

It is very satisfactory. The results you get or the research you conduct have nothing to do with the private sector. One usually investigates issues that are related to thing you like and to help improving science in general, not only to earn money.

5. What do you think of the situation of young biostatisticians in Spain?

The reality is that several colleagues and friends are working-studying abroad or looking for opportunities …

6. What would be the 3 main characteristics or skills you would use to describe a good biostatistician?

Curiosity, Analytical skills and Creativity.

7. Which do you think are the main qualities of a good mentor?

Expertise, Modesty and Open-mindedness.

Selected publications:

• Gómez G, Gómez-Mateu M, Dafni U. Informed Choice of Composite Endpoints in Cardiovascular Trials.Submitted.
• Gómez G, Gómez-Mateu M. The Asymptotic Relative Efficiency and the ratio of sample sizes when testing two different null hypotheses. Submitted.

# Hierarchical models: special structures within repeated measures models.

Living creatures tend to organize their lives within structured communities such as families, schools, hospitals, towns or countries. For instance, students of the same age living in a town could be grouped into different classes according to their grade level, family income, school district and other features of interest. Other examples related with health care workers and patients show clear hierarchical data structures as well.
Hierarchical or nested structures (usually known as HLM) are very common throughout many research areas. The starting point of this data pattern was set up in the field of social sciences. Most research studies in this area were focused on educational data, where the main interest was to examine the relationship between inputs such as students, teachers or school resources, and student outcomes (academic achievement, self-concept, career aspirations…). Under this scenario, researchers emphasized that individuals who are drawn from an institution (classroom, school, town, …) will be more homogenous than subjects randomly sampled from a larger population: students belonging to the same classroom have the experience of being part of the same environment (places, district, teachers,…) and experiences. Due to this fact, observations based on these individuals are not fully independent.

As noted, hierarchical models consider that exist dependency among observations within the same study unit. Till last decades, owing to the lack of software development, ordinary least squares regression (OLSR), classical regression, has been used to estimate the aforementioned relationships. On consequence, results obtained from OLSR show too small standard errors leading to a higher probability of rejection of a null hypothesis than if: (1) an appropriate statistical analysis was performed; (2) data included truly independent observations. It is clear that the main issue that researchers must address is the non-independence of the observations.
Hierarchical modeling is similar to OLSR. It can be seen as an extension of the classical regression, where at least two levels are defined in the predictive model. On the base level (also called the individual level, referred as level1), the analysis is similar to OLSR: an outcome variable is defined as function of a linear combination of one or more independent level 1 variables:

$Y_{ij} = \beta_{0j}+X_{1}\beta_{1j} +\ldots+\beta_{kj}X_{k}+\epsilon_{ij}$

where $Y_{ij}$ is the value of the outcome variable of the $i$th individual of group $j$, $\beta_{0j}$ represents the intercept of group $j$, $\beta_{1j}$ is the slope of variable $X_{1}$ of group $j$. On subsequent levels, the level 1 slope and intercept become dependent variables being predicted from level 2 variables:

$\beta_{0j} = \delta_{00}+W_{1}\delta_{01} +\ldots+\delta_{0k}W_{k}+u_{0j}$
$\beta_{1j} = \delta_{10}+W_{1}\delta_{11} +\ldots+\delta_{1k}W_{k}+u_{1j}$

Though this process, it is possible to model the effects of level 1 variables and level 2 variables on the desired outcome. In the figure of this post, one can observe that there are three main levels: patients (level1) belong to hospitals (level 2) where, at the same time, hospitals are located in certain neighborhoods (level 3).

This kind of modeling is essential to account for individual – and group level variation in estimating group-level regression coefficients. However, in certain cases, the classical and HLM approaches coincide: (1) when there is very little group-level variation and (2) when the number of groups is small and consequently, there is not enough information to accurately estimate group-level variation. In this setting, HLM gain little from classical OLRS.

Now, it is your opportunity. You know where is worth the effort of applying the HLM methods instead of classical regression.