Categoricals

Plotting bar charts in R, geom_bar vs geom_col

Plotting the Nightingale data made me realize that there are more to plotting a bar chart than first meets the eye. While a histogram visualize the distribution of a numerical variable, a bar plot visualize the relationship between a categorical variable and a numerical variable. ggplot has two functions for plotting bar charts, geom_bar and geom_col. In short, geom_bar() counts the categorical values for you, while geom_col() takes the summarized numerical value as input.

For loop for Multiple Trend in Proportions

When you have only one parameter to test, following my previous tutorial for test for trends in proportions will be sufficient. However, if you have many independent variables to be tested across several dependent variables, it may become quite tedious to do them all one by one. Therefore, I wrote a for-loop that will create all the summary-tables, perform the test for trends in proportions for each table, add the test result to the count matrix and save the output neatly in a csv/excel format. Here I explain each step in the process.

How to ‘Pivot Wider’ when you have only character values

Reshaping your data by pivoting from long to wide, or wide to long is used frequently when wrangling your data. The pivot_longer() and pivot_wider() function from the tidyr package does this job excellent in most cases. I have however had some issues when reshaping data containing only characters. This is my solution to this issue.

Test for trend in proportions

The test for trends in proportions is also known as the Cochran Armitage test. It performs Chi-squared test for trend in proportions and is used to test whether there is a difference between groups considering the size of the groups. It takes count data from contingency tables where you have one nominal variable with two levels (i.e “Mutated”, “Wild-type”) and the other variable is an ordinal value with minimum 3 values where the variables is naturally ranked

Chi-square in R

The Chi-square test is used to compare differences between two or more categorical variables. All variables must be ordinal or nominal and summarized as a frequency table. It is a non-parametric test, meaning that it is suitable also for data that is not normally distributed. Some of the assumptions for performing a Chi-square test are: Each observation is independent of all the others (one observation per subject), and the categories must be mutually exclusive so that a subject fits into only one of the categories.

Fisher-Exact in R

Disregarding the problematic side of Fisher, the statistical methods he developed are still very useful. Read any clinical paper, and I guarantee you that a Fisher exact test has been performed. Fisher-Exact is a statistical test used for 2x2 contingency tables of categorical data. It is particularity useful for small sample sizes where other tests, like the Chi square test would be unsuitable. Fisher-Exact from a 2x2 table: First you need to enter your data, and I will use some real life examples.