One of the posts on biostars on R is interesting. I am re-posting here as I have my own solution to this problem and also it is enough modified to be a different solution. In addition, OP included suggestions made by another user.
Example data:
=================================
df <-
structure(
list(
Family = c("AKCO", "AKDC", "ALZC", "ARCT", "COCZ"),
Env1_Females = c(2L, 3L, 0L, 1L, 1L),
Env1_Males = c(4L,
3L, 0L, 5L, 2L),
Env2_Females = c(8L, 3L, 5L, 4L, 1L),
Env2_Males = c(2L,
7L, 4L, 6L, 2L),
P1 = c("AK4", "AK6", "AL2", "AR3", "CO5"),
P2 = c("CO1", "DC2", "CZ4", "CT4", "CZ2")
),
class = "data.frame",
row.names = c(NA,-5L)
)
structure(
list(
Family = c("AKCO", "AKDC", "ALZC", "ARCT", "COCZ"),
Env1_Females = c(2L, 3L, 0L, 1L, 1L),
Env1_Males = c(4L,
3L, 0L, 5L, 2L),
Env2_Females = c(8L, 3L, 5L, 4L, 1L),
Env2_Males = c(2L,
7L, 4L, 6L, 2L),
P1 = c("AK4", "AK6", "AL2", "AR3", "CO5"),
P2 = c("CO1", "DC2", "CZ4", "CT4", "CZ2")
),
class = "data.frame",
row.names = c(NA,-5L)
)
======================================
Requirement is to:
- Expand (to wide format) the data frame
- New data frame should have columns for Env and Sex with appropriate values.
- Each member of the family should be represented unique and should have unique ID
First let us print the data frame
===============================================================
> df
Family Env1_Females Env1_Males Env2_Females Env2_Males P1 P2
1 AKCO 2 4 8 2 AK4 CO1
2 AKDC 3 3 3 7 AK6 DC2
3 ALZC 0 0 5 4 AL2 CZ4
4 ARCT 1 5 4 6 AR3 CT4
5 COCZ 1 2 1 2 CO5 CZ2
Family Env1_Females Env1_Males Env2_Females Env2_Males P1 P2
1 AKCO 2 4 8 2 AK4 CO1
2 AKDC 3 3 3 7 AK6 DC2
3 ALZC 0 0 5 4 AL2 CZ4
4 ARCT 1 5 4 6 AR3 CT4
5 COCZ 1 2 1 2 CO5 CZ2
================================================================
Here is the solution and output:
================================================================
df %>%
rename_with(~str_remove_all(.,"[em]*ales|Env")) %>%
pivot_longer(names_sep="_",values_to="count", names_to=c("Env","sex"),-c(Family,P1,P2)) %>%
uncount(count) %>%
group_by(Family) %>%
mutate(id = str_c(Family,"_",row_number())) %>%
ungroup %>%
select(-Family) %>%
relocate(id)
# A tibble: 63 x 5
id P1 P2 Env sex
<chr> <chr> <chr> <chr> <chr>
1 AKCO_1 AK4 CO1 1 F
2 AKCO_2 AK4 CO1 1 F
3 AKCO_3 AK4 CO1 1 M
4 AKCO_4 AK4 CO1 1 M
5 AKCO_5 AK4 CO1 1 M
6 AKCO_6 AK4 CO1 1 M
7 AKCO_7 AK4 CO1 2 F
8 AKCO_8 AK4 CO1 2 F
9 AKCO_9 AK4 CO1 2 F
10 AKCO_10 AK4 CO1 2 F
# … with 53 more rows
rename_with(~str_remove_all(.,"[em]*ales|Env")) %>%
pivot_longer(names_sep="_",values_to="count", names_to=c("Env","sex"),-c(Family,P1,P2)) %>%
uncount(count) %>%
group_by(Family) %>%
mutate(id = str_c(Family,"_",row_number())) %>%
ungroup %>%
select(-Family) %>%
relocate(id)
# A tibble: 63 x 5
id P1 P2 Env sex
<chr> <chr> <chr> <chr> <chr>
1 AKCO_1 AK4 CO1 1 F
2 AKCO_2 AK4 CO1 1 F
3 AKCO_3 AK4 CO1 1 M
4 AKCO_4 AK4 CO1 1 M
5 AKCO_5 AK4 CO1 1 M
6 AKCO_6 AK4 CO1 1 M
7 AKCO_7 AK4 CO1 2 F
8 AKCO_8 AK4 CO1 2 F
9 AKCO_9 AK4 CO1 2 F
10 AKCO_10 AK4 CO1 2 F
# … with 53 more rows
=============================================================
Now let's look at the code:
=============================================================
- df %>% -- Loads the data frame
- rename_with(~str_remove_all(.,"[em]*ales|Env")) %>% -- renames the columns. While renaming columns, we are deleting words "emales", "ales" and "ENV". [em]*ales is a regex for "emales" and "ales". * for zero or more than occurrences of em
- pivot_longer(names_sep="_",values_to="count", names_to=c("Env","sex"),-c(Family,P1,P2)) %>% - Convert the data from wide to long format. Take the values from "count" column. Break the column names based on "_" and store the values in columns "Env" and "sex". Do not touch columns "Family", "P1", "P2"
- uncount(count) %>% - Populate (duplicate) the rows as many times as in column "count"
- group_by(Family) %>% - Internally, group the data by family column
- mutate(id = str_c(Family,"_",row_number())) %>% - Create a new column by name ID. Column values are created by appending the family column values and number of rows with in the each family group.
- ungroup %>% -- ungroup the data
- select(-Family) %>% -- remove family column
- relocate(id) -- move 'id' column to the first place in the data frame