Convert unstructured csv file to a data frame

Tags:

I am learning R for text mining. I have a TV program schedule in form of CSV. The programs usually start at 06:00 AM and goes on until 05:00 AM the next day which is called a broadcast day. For example: the programs for 15/11/2015 start at 06:00 AM and ends at 05:00 AM the next day.

Here is a sample code showing how the schedule looks like:

 read.table(textConnection("Sunday|\n 01-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|\nMonday|\n 02-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|"), header = F, sep = "|", stringsAsFactors = F)

whose output is as follows:

  V1|V2
Sunday |  
01-Nov-15 |       
6 | Tom  
some information about the program |       
23.3 |  Jerry  
some information about the program |       
5 | Avatar  
some information about the program |       
5.3 | Panda  
some information about the program |       
Monday  |       
02-Nov-15|       
6  Jerry  
some information about the program |      
6.25 | Panda  
some information about the program |      
23.3 | Avatar  
some information about the program |       
7.25 |   Tom  
some information about the program |

I want to convert the above data into a form of data.frame

Date            |Program|Synopsis
2015-11-1 06:00 |Tom    | some information about the program
2015-11-1 23:30 |Jerry  | some information about the program
2015-11-2 05:00 |Avatar | some information about the program
2015-11-2 05:30 |Panda  | some information about the program
2015-11-2 06:00 |Jerry  | some information about the program
2015-11-2 06:25 |Panda  | some information about the program
2015-11-2 23:30 |Avatar | some information about the program
2015-11-3 07:25 |Tom    | some information about the program

I am thankful for any suggestions/tips regarding functions or packages I should have a look at.

850

asked Nov 15 '15 11:11

Navin MSN

2 Answers

An alternative solution with data.table:

library(data.table)
library(zoo)
library(splitstackshape)

txt <- textConnection("Sunday|\n 01-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|\nMonday|\n 02-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|")
tv <- readLines(txt)
DT <- data.table(tv)[, tv := gsub('[|]$', '', tv)]

wd <- levels(weekdays(1:7, abbreviate = FALSE))

DT <- DT[, temp := tv %chin% wd
         ][, day := tv[temp], by = 1:nrow(tvDT)
           ][, day := na.locf(day)
             ][, temp := NULL
               ][, idx := rleid(day)
                 ][, date := tv[2], by = idx
                   ][, .SD[-c(1,2)], by = idx]

DT <- cSplit(DT, sep="|", "tv", "long")[, lbl := rep(c("Time","Program","Info")), by = idx]
DT <- dcast(DT, idx + day + date + rowid(lbl) ~ lbl, value.var = "tv")[, lbl := NULL]

DT <- DT[, datetime := as.POSIXct(paste(as.character(date), sprintf("%01.2f",as.numeric(as.character(Time)))), format = "%d-%b-%y %H.%M")
   ][, datetime := datetime + (+(datetime < shift(datetime, fill=datetime[1]) & datetime < 6) * 24 * 60 * 60)
     ][, .(datetime, Program, Info)]

The result:

> DT
              datetime Program                               Info
1: 2015-11-01 06:00:00     Tom some information about the program
2: 2015-11-01 23:30:00   Jerry some information about the program
3: 2015-11-02 05:00:00  Avatar some information about the program
4: 2015-11-02 06:00:00     Tom some information about the program
5: 2015-11-02 23:30:00   Jerry some information about the program
6: 2015-11-03 05:00:00  Avatar some information about the program

Explanation:

1: read data, convert to a data.table & remove trailing |:

txt <- textConnection("Sunday|\n 01-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|\nMonday|\n 02-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|")
tv <- readLines(txt)
DT <- data.table(tv)[, tv := gsub('[|]$', '', tv)]

2: extract the weekdays into a new column

wd <- levels(weekdays(1:7, abbreviate = FALSE)) # a vector with the full weekdays
DT[, temp := tv %chin% wd
   ][, day := tv[temp], by = 1:nrow(tvDT)
     ][, day := na.locf(day)
       ][, temp := NULL]

3: create an index per day & create a column with the dates

DT[, idx := rleid(day)][, date := tv[2], by = idx]

4: remove unnecessary lines

DT <- DT[, .SD[-c(1,2)], by = idx]

5: split the time and the program-name into separate rows & create a label column

DT <- cSplit(DT, sep="|", "tv", "long")[, lbl := rep(c("Time","Program","Info")), by = idx]

6: reshape into wide format using the 'rowid' function from the development version of data.table

DT <- dcast(DT, idx + day + date + rowid(idx2) ~ idx2, value.var = "tv")[, idx2 := NULL]

7: create a dattime column & set the late night time to the next day

DT[, datetime := as.POSIXct(paste(as.character(date), sprintf("%01.2f",as.numeric(as.character(Time)))), format = "%d-%b-%y %H.%M")
   ][, datetime := datetime + (+(datetime < shift(datetime, fill=datetime[1]) & datetime < 6) * 24 * 60 * 60)]

8: keep the needed columns

DT <- DT[, .(datetime, Program, Info)]

111

answered Oct 02 '22 10:10

Jaap

It's a bit of a mess, but it seems to work:

df <- read.table(textConnection(txt <- "Sunday|\n 01-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|\nMonday|\n 02-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|"), header = F, sep = "|", stringsAsFactors = F)
cat(txt)
Sys.setlocale("LC_TIME", "English") # if needed
weekdays <- format(seq.Date(Sys.Date(), Sys.Date()+6, 1), "%A")
days <- split(df, cumsum(df$V1 %in% weekdays))
lapply(days, function(dayDF) {
  tmp <- cbind.data.frame(V1=dayDF[2, 1], do.call(rbind, split(unlist(dayDF[-c(1:2), ]), cumsum(!dayDF[-(1:2), 2]==""))), stringsAsFactors = F)
  tmp[, 1] <- as.Date(tmp[, 1], "%d-%B-%y")
  tmp[, 2] <- as.numeric(tmp[, 2])
  tmp[, 5] <- NULL
  idx <- c(FALSE, diff(tmp[, 2])<0)
  tmp[idx, 1] <- tmp[idx, 1] + 1
  return(tmp)
}) -> days
days <- transform(do.call(rbind.data.frame, days), V1=as.POSIXct(paste(V1, sprintf("%.2f", V11)), format="%Y-%m-%d %H.%M"), V11=NULL)  
names(days) <- c("Date", "Synopsis", "Program")
rownames(days) <- NULL
days[, c(1, 3, 2)]
#                  Date Program                            Synopsis
# 1 2015-11-01 06:00:00     Tom  some information about the program
# 2 2015-11-01 23:30:00   Jerry  some information about the program
# 3 2015-11-02 05:00:00  Avatar  some information about the program
# 4 2015-11-02 06:00:00     Tom  some information about the program
# 5 2015-11-02 23:30:00   Jerry  some information about the program
# 6 2015-11-03 05:00:00  Avatar  some information about the program

answered Oct 02 '22 12:10

lukeA

Related questions
                            
                                RODBC Cannot allocate memory
                            
                                rearrange data.frame to get the sequential order of products
                            
                                Which regular expression engine type does R use as a standard?
                            
                                Remove duplicated words from data frame
                            
                                sapply() with strsplit in R
                            
                                R shiny Dashboard: How to add vertical scrollbar to dashboard sidebar?
                            
                                Rearranging data frame columns in R (mutate, dplyr)
                            
                                change column names with same name in dataframe
                            
                                R: calculate the quotient of 2 columns by ID
                            
                                return multiple values from sapply/lapply
                            
                                How to prevent data.table to force numeric variables into character variables without manually specifying these?
                            
                                Retry for-loop R loop if error
                            
                                Read excel file with formulas in cells into R
                            
                                Convert character field to quarter date in R
                            
                                Error in train from Caret
                            
                                How do you use R in Python to return an R graph through Django?
                            
                                Fill in columns with 1 or 0, if that colname matches a variable found in other named columns within the same dataframe
                            
                                data.table is not displayed on first call after being modified in a function [duplicate]
                            
                                Merging a data frame from a list of data frames [duplicate]
                            
                                How can I find the pixel-wise standard deviation?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Convert unstructured csv file to a data frame

Tags:

dataframe

r

reshape

Navin MSN

People also ask

2 Answers

Jaap

lukeA

Recent Activity

Donate For Us