Biologist's bioinformatics notes

One of the data formats we encounter is matching pair columns either alternating or as batches. For eg.

============================================================

x1 y1 x2 y2

1 2 3 4

============================================================

=============================================================

x1 x2 y1 y2

1 2 3 4

============================================================.

Expected format is

==============================================================

1 2

X 1 2

Y 3 4

==============================================================

We can imagine any such combinations with number and texts. In this note, we would learn how to change such data from wider format to longer format. Let us look at example data

You can download this data from here. Data has following characteristics:

There are 3 groups with columns.
Each column is preceded by entity names (for eg. gene names, dog names or cat names)
All the "Name" columns have same headers. While importing into R, R converts these names to unique by appending numbers to the duplicate headers.

Final output would be some thing like this:

Now this format looks like youtube/ebaumsworld before and after health photos. Please note that that missing data "NA" is converted to 0 (zero) which is not correct. Please change the code as per your convenience. Here is the code:

=================================================================

df %>%
    rename_at(vars(starts_with("Name")), ~paste("Name", seq(1:(ncol(df)/2)),sep=".")) %>%
    pivot_longer(everything(),names_to = c(".value", "set"),names_pattern = "(\\D+)(\\d)") %>%
    drop_na()%>%
    pivot_wider(names_from="set", values_from="Group.", names_prefix = "Group.", values_fill = 0) %>%
    arrange(Name.)

====================================================================

Explanation for the code:

==================================================================

df %>% -- loads df data
rename_at(vars(starts_with("Name")), ~paste("Name", seq(1:(ncol(df)/2)),sep=".")) %>% -- Change the name of columns starting with word "Name" to Name1, Name2, Name3
pivot_longer(everything(),names_to = c(".value", "set"),names_pattern = "(\\D+)(\\d)") %>% -- Convert the data to longer format. But break down the column names and store the later part of the split in a column named "set". Split pattern (\\D+)- any thing that is not digit and more of non-digit things (+). (\\d) - any thing that is digit.
drop_na()%>% -- Remove empty rows
pivot_wider(names_from="set", values_from="Group.", names_prefix = "Group.", values_fill = 0) %>% -- Change to wider format and replace missing values (NA) with zero.
arrange(Name.) -- sort the final data frame (tibble) on column "Name." (. is part of the column name. You should edit it.It may cause problems in future).

==================================================================

Note: This trick works only if alternating columns share same kind of naming pattern. For eg. Name.digits, Name_digits, Namedigits (x.1, y.1, x1,y1, x_1,y_1 etc).

Recent Posts

Links

Aug 26, 2020 - Clean data with altenating match pair columns