Biologist's bioinformatics notes

One of the posts on biostars on R is interesting. I am re-posting here as I have my own solution to this problem and also it is enough modified to be a different solution. In addition, OP included suggestions made by another user.

Example data:

=================================

df <-
    structure(
        list(
            Family = c("AKCO", "AKDC", "ALZC", "ARCT", "COCZ"),
            Env1_Females = c(2L, 3L, 0L, 1L, 1L),
            Env1_Males = c(4L,
                           3L, 0L, 5L, 2L),
            Env2_Females = c(8L, 3L, 5L, 4L, 1L),
            Env2_Males = c(2L,
                           7L, 4L, 6L, 2L),
            P1 = c("AK4", "AK6", "AL2", "AR3", "CO5"),
            P2 = c("CO1", "DC2", "CZ4", "CT4", "CZ2")
        ),
        class = "data.frame",
        row.names = c(NA,-5L)
    )

======================================

Requirement is to:

Expand (to wide format) the data frame
New data frame should have columns for Env and Sex with appropriate values.
Each member of the family should be represented unique and should have unique ID

First let us print the data frame

===============================================================

> df
Family Env1_Females Env1_Males Env2_Females Env2_Males P1 P2
1   AKCO            2          4            8          2 AK4 CO1
2   AKDC            3          3            3          7 AK6 DC2
3   ALZC            0          0            5          4 AL2 CZ4
4   ARCT            1          5            4          6 AR3 CT4
5   COCZ            1          2            1          2 CO5 CZ2

================================================================

Here is the solution and output:

================================================================

df %>%
    rename_with(~str_remove_all(.,"[em]*ales|Env")) %>%
    pivot_longer(names_sep="_",values_to="count", names_to=c("Env","sex"),-c(Family,P1,P2)) %>%
    uncount(count) %>%
    group_by(Family) %>%
    mutate(id = str_c(Family,"_",row_number())) %>%
    ungroup %>%
    select(-Family) %>%
    relocate(id)
# A tibble: 63 x 5
   id      P1    P2    Env   sex
   <chr>   <chr> <chr> <chr> <chr>
1 AKCO_1 AK4   CO1   1     F
2 AKCO_2 AK4   CO1   1     F
3 AKCO_3 AK4   CO1   1     M
4 AKCO_4 AK4   CO1   1     M
5 AKCO_5 AK4   CO1   1     M
6 AKCO_6 AK4   CO1   1     M
7 AKCO_7 AK4   CO1   2     F
8 AKCO_8 AK4   CO1   2     F
9 AKCO_9 AK4   CO1   2     F
10 AKCO_10 AK4   CO1   2     F
# … with 53 more rows

=============================================================

Now let's look at the code:

=============================================================

df %>% -- Loads the data frame
rename_with(~str_remove_all(.,"[em]*ales|Env")) %>% -- renames the columns. While renaming columns, we are deleting words "emales", "ales" and "ENV". [em]*ales is a regex for "emales" and "ales". * for zero or more than occurrences of em
pivot_longer(names_sep="_",values_to="count", names_to=c("Env","sex"),-c(Family,P1,P2)) %>% - Convert the data from wide to long format. Take the values from "count" column. Break the column names based on "_" and store the values in columns "Env" and "sex". Do not touch columns "Family", "P1", "P2"
uncount(count) %>% - Populate (duplicate) the rows as many times as in column "count"
group_by(Family) %>% - Internally, group the data by family column
mutate(id = str_c(Family,"_",row_number())) %>% - Create a new column by name ID. Column values are created by appending the family column values and number of rows with in the each family group.
ungroup %>% -- ungroup the data
select(-Family) %>% -- remove family column
relocate(id) -- move 'id' column to the first place in the data frame

Recent Posts

Links

Aug 1, 2020 - Uncount, multiple replacement and grouping using dplyr