Looping Through .dat Files In R And Extracting Only Specific Data As Columns
I have 900+ folders in my local drive and each folder has a single .dat extension file. I want to loop through each folder to access the file in it to fetch only specific data and
Solution 1:
another approach, i this case it's only reading the file you provided but it can read multiple files.
I add some intermediate results to show what the code is actually doing...
library(tidyverse)
library(data.table)
library(zoo)
# create a data.frame with the desired files
filenames <- list.files( path = getwd(), pattern = "*.dat$", recursive = TRUE, full.names = TRUE )
# > filenames#[1] "C:/Users/********/Documents/Git/udls2/test.dat"#read in the files, using data.table's fread.. here I grep lines starting with UNIQUE-ID or TYPES. create your desired regex-pattern
pattern <- "^UNIQUE-ID|^TYPES"
content.list <- lapply( filenames, function(x) fread( x, sep = "\n", header = FALSE )[grepl( pattern, V1 )] )
# > content.list# [[1]]# V1# 1: UNIQUE-ID - CPD0-1108# 2: TYPES - D-Ribofuranose# 3: UNIQUE-ID - URIDINE# 4: TYPES - Pyrimidine#add all content to a single data.table
dt <- rbindlist( content.list )
# > dt# V1# 1: UNIQUE-ID - CPD0-1108# 2: TYPES - D-Ribofuranose# 3: UNIQUE-ID - URIDINE# 4: TYPES - Pyrimidine#split the text in a variable-name and it's content
dt <- dt %>% separate( V1, into = c("var", "content"), sep = " - ")
# > dt# var content# 1: UNIQUE-ID CPD0-1108# 2: TYPES D-Ribofuranose# 3: UNIQUE-ID URIDINE# 4: TYPES Pyrimidine#add an increasing id for every UNIQUE-ID
dt[var == "UNIQUE-ID", id := seq.int( 1: nrow( dt[var=="UNIQUE-ID", ]))]
# > dt# var content id# 1: UNIQUE-ID CPD0-1108 1# 2: TYPES D-Ribofuranose NA# 3: UNIQUE-ID URIDINE 2# 4: TYPES Pyrimidine NA#fill down id vor all variables found
dt[, id := na.locf( dt$id )]
# > dt# var content id# 1: UNIQUE-ID CPD0-1108 1# 2: TYPES D-Ribofuranose 1# 3: UNIQUE-ID URIDINE 2# 4: TYPES Pyrimidine 2#cast
dcast(dt, id ~ var, value.var = "content")
# id TYPES UNIQUE-ID# 1: 1 D-Ribofuranose CPD0-1108# 2: 2 Pyrimidine URIDINE
Solution 2:
One File
Break it up into a few logical actions:
text2chunks <-function(txt){
chunks <- split(txt,cumsum(grepl("^\\[Data Chunk.*\\]$", txt)))
Filter(function(a) grepl("^\\[Data Chunk.*\\]$", a[1]), chunks)}
chunk2dataframe <-function(vec, hdrs =NULL, sep =" - "){
s <- stringi::stri_split(vec, fixed=sep, n=2L)
s <- Filter(function(a)length(a)==2L, s)
df <- as.data.frame(setNames(lapply(s, `[[`,2), sapply(s, `[[`,1)),
stringsAsFactors=FALSE)if(!is.null(hdrs)) df <- df[names(df)%in% make.names(hdrs)]
df
}
hdrs
is an optional vector of column names that you want to keep; if not provided (or NULL
), then all key/value pairs are returned as columns.
hdrs <- c("UNIQUE-ID", "TYPES", "COMMON-NAME")
Using the data (below), I have lines
which is a character
vector from a single file:
head(lines)
# [1] "Authors:"# [2] "# Pallavi Subhraveti"# [3] "# Quang Ong"# [4] "# Please see the license agreement regarding the use of and distribution of this file."# [5] "# The format of this file is defined at http://bioinformatics.ai.sri.com"# [6] "# Version: 21.5"
str(text2chunks(lines))
# List of 2# $ 1: chr [1:5] "[Data Chunk 1]""UNIQUE-ID - CPD0-1108""TYPES - D-Ribofuranose""COMMON-NAME - β-D-ribofuranose" ...# $ 2: chr [1:6] "[Data Chunk 2]""// something out of place here?""UNIQUE-ID - URIDINE""TYPES - Pyrimidine" ...
str(lapply(text2chunks(lines), chunk2dataframe, hdrs=hdrs))
# List of 2# $ 1:'data.frame': 1 obs. of 3 variables:# ..$ UNIQUE.ID : chr "CPD0-1108"# ..$ TYPES : chr "D-Ribofuranose"# ..$ COMMON.NAME: chr "β-D-ribofuranose"# $ 2:'data.frame': 1 obs. of 3 variables:# ..$ UNIQUE.ID : chr "URIDINE"# ..$ TYPES : chr "Pyrimidine"# ..$ COMMON.NAME: chr "β-D-ribofuranose or something"
And the final product:
dplyr::bind_rows(lapply(text2chunks(lines), chunk2dataframe, hdrs=hdrs))
# UNIQUE.ID TYPES COMMON.NAME
# 1 CPD0-1108 D-Ribofuranose β-D-ribofuranose
# 2 URIDINE Pyrimidine β-D-ribofuranose or something
Since you want to iterate this over many functions, it makes sense to create a convenience function for this:
text2dataframe <- function(txt) {
dplyr::bind_rows(lapply(text2chunks(txt), chunk2dataframe, hdrs=hdrs))
}
Many Files
Untested, but should work:
files <- list.files(path="C:/Users/robbie/Desktop/Organism_Data/",
pattern="compounds.dat", recursive=TRUE, full.names=TRUE)
alldata <- lapply(files, readLines)
allframes <- lapply(alldata, text2dataframe)
oneframe <- dplyr::bind_rows(allframes)
Notes:
- I'm using
stringi::stri_split
instead ofstrsplit
simply for its convenience argumentn=
; doing the same in base R is not hard with a couple extra lines of code. - I'm using
dplyr::bind_rows
because it deals very well with missing columns and differing order; baserbind.data.frame
can be used with some extra effort/care. data.frame
-izing things tends to nudge column names a little, just be aware.
Data:
# lines <- readLines("some_filename.dat")
fulltext <- 'Authors:
# Pallavi Subhraveti# Quang Ong# Please see the license agreement regarding the use of and distribution of this file.# The format of this file is defined at http://bioinformatics.ai.sri.com# Version: 21.5# File Name: compounds.dat# Date and time generated: October 24, 2017, 14:52:45# Attributes:# UNIQUE-ID# TYPES
[Data Chunk 1]
UNIQUE-ID - CPD0-1108
TYPES - D-Ribofuranose
COMMON-NAME - β-D-ribofuranose
DO-NOT-CARE - 42
[Data Chunk 2]
// something out of place here?
UNIQUE-ID - URIDINE
TYPES - Pyrimidine
COMMON-NAME - β-D-ribofuranose or something
DO-NOT-CARE - 43
'
lines <- strsplit(fulltext, '[\r\n]+')[[1]]
Post a Comment for "Looping Through .dat Files In R And Extracting Only Specific Data As Columns"