Skip to content Skip to sidebar Skip to footer

Looping Through .dat Files In R And Extracting Only Specific Data As Columns

I have 900+ folders in my local drive and each folder has a single .dat extension file. I want to loop through each folder to access the file in it to fetch only specific data and

Solution 1:

another approach, i this case it's only reading the file you provided but it can read multiple files.

I add some intermediate results to show what the code is actually doing...

library(tidyverse)
library(data.table)
library(zoo)

# create a data.frame with the desired files
filenames <- list.files( path = getwd(), pattern = "*.dat$", recursive = TRUE, full.names = TRUE ) 

# > filenames#[1] "C:/Users/********/Documents/Git/udls2/test.dat"#read in the files, using data.table's fread.. here I grep lines starting with UNIQUE-ID or TYPES. create your desired regex-pattern
pattern <- "^UNIQUE-ID|^TYPES"
content.list <- lapply( filenames, function(x) fread( x, sep = "\n", header = FALSE )[grepl( pattern, V1 )] )

# > content.list# [[1]]#                        V1# 1:  UNIQUE-ID - CPD0-1108# 2: TYPES - D-Ribofuranose# 3:    UNIQUE-ID - URIDINE# 4:     TYPES - Pyrimidine#add all content to a single data.table
dt <- rbindlist( content.list )

# > dt#                        V1# 1:  UNIQUE-ID - CPD0-1108# 2: TYPES - D-Ribofuranose# 3:    UNIQUE-ID - URIDINE# 4:     TYPES - Pyrimidine#split the text in a variable-name and it's content
dt <- dt %>% separate( V1, into = c("var", "content"), sep = " - ")

# > dt#          var        content# 1: UNIQUE-ID      CPD0-1108# 2:     TYPES D-Ribofuranose# 3: UNIQUE-ID        URIDINE# 4:     TYPES     Pyrimidine#add an increasing id for every UNIQUE-ID
dt[var == "UNIQUE-ID", id := seq.int( 1: nrow( dt[var=="UNIQUE-ID", ]))]

# > dt#          var        content id# 1: UNIQUE-ID      CPD0-1108  1# 2:     TYPES D-Ribofuranose NA# 3: UNIQUE-ID        URIDINE  2# 4:     TYPES     Pyrimidine NA#fill down id vor all variables found
dt[, id := na.locf( dt$id )]

# > dt#          var        content id# 1: UNIQUE-ID      CPD0-1108  1# 2:     TYPES D-Ribofuranose  1# 3: UNIQUE-ID        URIDINE  2# 4:     TYPES     Pyrimidine  2#cast
dcast(dt, id ~ var, value.var = "content")

#    id          TYPES UNIQUE-ID# 1:  1 D-Ribofuranose CPD0-1108# 2:  2     Pyrimidine   URIDINE

Solution 2:

One File

Break it up into a few logical actions:

text2chunks <-function(txt){
  chunks <- split(txt,cumsum(grepl("^\\[Data Chunk.*\\]$", txt)))
  Filter(function(a) grepl("^\\[Data Chunk.*\\]$", a[1]), chunks)}
chunk2dataframe <-function(vec, hdrs =NULL, sep =" - "){
  s <- stringi::stri_split(vec, fixed=sep, n=2L)
  s <- Filter(function(a)length(a)==2L, s)
  df <- as.data.frame(setNames(lapply(s, `[[`,2), sapply(s, `[[`,1)),
                      stringsAsFactors=FALSE)if(!is.null(hdrs)) df <- df[names(df)%in% make.names(hdrs)]
  df
}

hdrs is an optional vector of column names that you want to keep; if not provided (or NULL), then all key/value pairs are returned as columns.

hdrs <- c("UNIQUE-ID", "TYPES", "COMMON-NAME")

Using the data (below), I have lines which is a character vector from a single file:

head(lines)
# [1] "Authors:"# [2] "#    Pallavi Subhraveti"# [3] "#    Quang Ong"# [4] "# Please see the license agreement regarding the use of and distribution of this file."# [5] "# The format of this file is defined at http://bioinformatics.ai.sri.com"# [6] "# Version: 21.5"                                                                       
str(text2chunks(lines))
# List of 2#  $ 1: chr [1:5] "[Data Chunk 1]""UNIQUE-ID - CPD0-1108""TYPES - D-Ribofuranose""COMMON-NAME - &beta;-D-ribofuranose" ...#  $ 2: chr [1:6] "[Data Chunk 2]""// something out of place here?""UNIQUE-ID - URIDINE""TYPES - Pyrimidine" ...
str(lapply(text2chunks(lines), chunk2dataframe, hdrs=hdrs))
# List of 2#  $ 1:'data.frame':    1 obs. of  3 variables:#   ..$ UNIQUE.ID  : chr "CPD0-1108"#   ..$ TYPES      : chr "D-Ribofuranose"#   ..$ COMMON.NAME: chr "&beta;-D-ribofuranose"#  $ 2:'data.frame':    1 obs. of  3 variables:#   ..$ UNIQUE.ID  : chr "URIDINE"#   ..$ TYPES      : chr "Pyrimidine"#   ..$ COMMON.NAME: chr "&beta;-D-ribofuranose or something"

And the final product:

dplyr::bind_rows(lapply(text2chunks(lines), chunk2dataframe, hdrs=hdrs))
#   UNIQUE.ID          TYPES                        COMMON.NAME
# 1 CPD0-1108 D-Ribofuranose              &beta;-D-ribofuranose
# 2   URIDINE     Pyrimidine &beta;-D-ribofuranose or something

Since you want to iterate this over many functions, it makes sense to create a convenience function for this:

text2dataframe <- function(txt) {
  dplyr::bind_rows(lapply(text2chunks(txt), chunk2dataframe, hdrs=hdrs))
}

Many Files

Untested, but should work:

files <- list.files(path="C:/Users/robbie/Desktop/Organism_Data/",
                    pattern="compounds.dat", recursive=TRUE, full.names=TRUE)
alldata <- lapply(files, readLines)
allframes <- lapply(alldata, text2dataframe)
oneframe <- dplyr::bind_rows(allframes)

Notes:

  • I'm using stringi::stri_split instead of strsplit simply for its convenience argument n=; doing the same in base R is not hard with a couple extra lines of code.
  • I'm using dplyr::bind_rows because it deals very well with missing columns and differing order; base rbind.data.frame can be used with some extra effort/care.
  • data.frame-izing things tends to nudge column names a little, just be aware.

Data:

# lines <- readLines("some_filename.dat")
fulltext <- 'Authors:
#    Pallavi Subhraveti#    Quang Ong# Please see the license agreement regarding the use of and distribution of this file.# The format of this file is defined at http://bioinformatics.ai.sri.com# Version: 21.5# File Name: compounds.dat# Date and time generated: October 24, 2017, 14:52:45# Attributes:#    UNIQUE-ID#    TYPES
[Data Chunk 1]
UNIQUE-ID - CPD0-1108
TYPES - D-Ribofuranose
COMMON-NAME - &beta;-D-ribofuranose
DO-NOT-CARE - 42
[Data Chunk 2]
// something out of place here?
UNIQUE-ID - URIDINE
TYPES - Pyrimidine
COMMON-NAME - &beta;-D-ribofuranose or something
DO-NOT-CARE - 43
'
lines <- strsplit(fulltext, '[\r\n]+')[[1]]

Post a Comment for "Looping Through .dat Files In R And Extracting Only Specific Data As Columns"