Sometimes, R makes life difficult. Following here is a data frame where rows are duplicate and there are several NAs.
> test
indiv.ID X86912632 X86920881 X86922082 X86927699
1 Alxis_3702 CTGA <NA> <NA> <NA>
2 Alxis_3702 TCTG <NA> <NA> <NA>
3 Alxis_3702 <NA> G <NA> <NA>
4 Alxis_3702 <NA> <NA> C <NA>
5 Alxis_3702 <NA> <NA> <NA> <NA>
6 Alxis_3702 <NA> <NA> <NA> <NA>
7 Alxis_3702 <NA> <NA> <NA> <NA>
8 Alxis_3702 <NA> <NA> <NA> <NA>
9 Alxis_3702 <NA> <NA> <NA> <NA>
10 Alxis_3702 <NA> <NA> <NA> <NA>
Life would be easier if there is a function that removes all rows with NAs and append values or some other function to each column (sum, average, min, max etc). Then convert to wide format where duplicate column names are appended some string. But such a function doesn't exist. some thing like a collapse dataframe function.
Ok. we have a round about way. Let us say, we have imported the data as data frame by name test. Following is the code:
========================
test %>% gather(k, v, -indiv.ID) %>% na.omit() %>% mutate(k=make.unique(k)) %>% spread(k,v)
========================
What does the code mean (every step before %>%):
indiv.ID X86912632 X86912632.1 X86920881 X86922082
1 Alxis_3702 CTGA TCTG G C
Tip: How do you convert text "<NA>" in text file to NA values in data frame? While importing define na argument as below:
=========
test=read.csv("test2.txt", header = T, sep = "\t", stringsAsFactors = F, na="<NA>")
=========
> test
indiv.ID X86912632 X86920881 X86922082 X86927699
1 Alxis_3702 CTGA <NA> <NA> <NA>
2 Alxis_3702 TCTG <NA> <NA> <NA>
3 Alxis_3702 <NA> G <NA> <NA>
4 Alxis_3702 <NA> <NA> C <NA>
5 Alxis_3702 <NA> <NA> <NA> <NA>
6 Alxis_3702 <NA> <NA> <NA> <NA>
7 Alxis_3702 <NA> <NA> <NA> <NA>
8 Alxis_3702 <NA> <NA> <NA> <NA>
9 Alxis_3702 <NA> <NA> <NA> <NA>
10 Alxis_3702 <NA> <NA> <NA> <NA>
Life would be easier if there is a function that removes all rows with NAs and append values or some other function to each column (sum, average, min, max etc). Then convert to wide format where duplicate column names are appended some string. But such a function doesn't exist. some thing like a collapse dataframe function.
Ok. we have a round about way. Let us say, we have imported the data as data frame by name test. Following is the code:
========================
test %>% gather(k, v, -indiv.ID) %>% na.omit() %>% mutate(k=make.unique(k)) %>% spread(k,v)
========================
What does the code mean (every step before %>%):
- Use data frame test
- Convert the data to long format (for this you would need tidyr library)
- Remove rows with NA
- Convert any duplicate names to unique names and replace the column values ( you would need dplyr library)
- Convert the data to wide format
indiv.ID X86912632 X86912632.1 X86920881 X86922082
1 Alxis_3702 CTGA TCTG G C
Tip: How do you convert text "<NA>" in text file to NA values in data frame? While importing define na argument as below:
=========
test=read.csv("test2.txt", header = T, sep = "\t", stringsAsFactors = F, na="<NA>")
=========