Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert unstructured csv file to a data frame

I am learning R for text mining. I have a TV program schedule in form of CSV. The programs usually start at 06:00 AM and goes on until 05:00 AM the next day which is called a broadcast day. For example: the programs for 15/11/2015 start at 06:00 AM and ends at 05:00 AM the next day.

Here is a sample code showing how the schedule looks like:

 read.table(textConnection("Sunday|\n 01-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|\nMonday|\n 02-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|"), header = F, sep = "|", stringsAsFactors = F)

whose output is as follows:

  V1|V2
Sunday |  
01-Nov-15 |       
6 | Tom  
some information about the program |       
23.3 |  Jerry  
some information about the program |       
5 | Avatar  
some information about the program |       
5.3 | Panda  
some information about the program |       
Monday  |       
02-Nov-15|       
6  Jerry  
some information about the program |      
6.25 | Panda  
some information about the program |      
23.3 | Avatar  
some information about the program |       
7.25 |   Tom  
some information about the program |      

I want to convert the above data into a form of data.frame

Date            |Program|Synopsis
2015-11-1 06:00 |Tom    | some information about the program
2015-11-1 23:30 |Jerry  | some information about the program
2015-11-2 05:00 |Avatar | some information about the program
2015-11-2 05:30 |Panda  | some information about the program
2015-11-2 06:00 |Jerry  | some information about the program
2015-11-2 06:25 |Panda  | some information about the program
2015-11-2 23:30 |Avatar | some information about the program
2015-11-3 07:25 |Tom    | some information about the program

I am thankful for any suggestions/tips regarding functions or packages I should have a look at.

like image 850
Navin MSN Avatar asked Nov 15 '15 11:11

Navin MSN


People also ask

Can we convert unstructured data into structured data?

To comprehend Unstructured Data, Unstructured Data to Structured Data Conversion is necessary. Businesses employ cutting-edge technologies like Natural Language Processing (NLP) and Artificial Intelligence (AI) to perform Unstructured Data to Structured Data Conversion.

How do you convert unstructured data to structured data in spark?

To process unstructured data either we can use spark built-in functions or need to create our own functions to transform the unstructured data into a structural form based on the requirements.

How do you read unstructured data in Python?

In the below example we take a text file and read the file segregating each of the lines in it. Next we can divide the output into further lines and words. The original file is a text file containing some paragraphs describing the python language. When we execute the above code, it produces the following result.

How do you structure unstructured data in Excel?

The Keys to parsing in unstructured data: To first Assign each row a "Record ID", that helps with how to treat each row. Get rid of the blank rows. Use the “Generate Rows” tool to put each Description and Value on a single row, when there are multiple Descriptions and Values on a single row.


2 Answers

An alternative solution with data.table:

library(data.table)
library(zoo)
library(splitstackshape)

txt <- textConnection("Sunday|\n 01-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|\nMonday|\n 02-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|")
tv <- readLines(txt)
DT <- data.table(tv)[, tv := gsub('[|]$', '', tv)]

wd <- levels(weekdays(1:7, abbreviate = FALSE))

DT <- DT[, temp := tv %chin% wd
         ][, day := tv[temp], by = 1:nrow(tvDT)
           ][, day := na.locf(day)
             ][, temp := NULL
               ][, idx := rleid(day)
                 ][, date := tv[2], by = idx
                   ][, .SD[-c(1,2)], by = idx]

DT <- cSplit(DT, sep="|", "tv", "long")[, lbl := rep(c("Time","Program","Info")), by = idx]
DT <- dcast(DT, idx + day + date + rowid(lbl) ~ lbl, value.var = "tv")[, lbl := NULL]

DT <- DT[, datetime := as.POSIXct(paste(as.character(date), sprintf("%01.2f",as.numeric(as.character(Time)))), format = "%d-%b-%y %H.%M")
   ][, datetime := datetime + (+(datetime < shift(datetime, fill=datetime[1]) & datetime < 6) * 24 * 60 * 60)
     ][, .(datetime, Program, Info)]

The result:

> DT
              datetime Program                               Info
1: 2015-11-01 06:00:00     Tom some information about the program
2: 2015-11-01 23:30:00   Jerry some information about the program
3: 2015-11-02 05:00:00  Avatar some information about the program
4: 2015-11-02 06:00:00     Tom some information about the program
5: 2015-11-02 23:30:00   Jerry some information about the program
6: 2015-11-03 05:00:00  Avatar some information about the program

Explanation:

1: read data, convert to a data.table & remove trailing |:

txt <- textConnection("Sunday|\n 01-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|\nMonday|\n 02-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|")
tv <- readLines(txt)
DT <- data.table(tv)[, tv := gsub('[|]$', '', tv)]

2: extract the weekdays into a new column

wd <- levels(weekdays(1:7, abbreviate = FALSE)) # a vector with the full weekdays
DT[, temp := tv %chin% wd
   ][, day := tv[temp], by = 1:nrow(tvDT)
     ][, day := na.locf(day)
       ][, temp := NULL]

3: create an index per day & create a column with the dates

DT[, idx := rleid(day)][, date := tv[2], by = idx]

4: remove unnecessary lines

DT <- DT[, .SD[-c(1,2)], by = idx]

5: split the time and the program-name into separate rows & create a label column

DT <- cSplit(DT, sep="|", "tv", "long")[, lbl := rep(c("Time","Program","Info")), by = idx]

6: reshape into wide format using the 'rowid' function from the development version of data.table

DT <- dcast(DT, idx + day + date + rowid(idx2) ~ idx2, value.var = "tv")[, idx2 := NULL]

7: create a dattime column & set the late night time to the next day

DT[, datetime := as.POSIXct(paste(as.character(date), sprintf("%01.2f",as.numeric(as.character(Time)))), format = "%d-%b-%y %H.%M")
   ][, datetime := datetime + (+(datetime < shift(datetime, fill=datetime[1]) & datetime < 6) * 24 * 60 * 60)]

8: keep the needed columns

DT <- DT[, .(datetime, Program, Info)]
like image 111
Jaap Avatar answered Oct 02 '22 10:10

Jaap


It's a bit of a mess, but it seems to work:

df <- read.table(textConnection(txt <- "Sunday|\n 01-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|\nMonday|\n 02-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|"), header = F, sep = "|", stringsAsFactors = F)
cat(txt)
Sys.setlocale("LC_TIME", "English") # if needed
weekdays <- format(seq.Date(Sys.Date(), Sys.Date()+6, 1), "%A")
days <- split(df, cumsum(df$V1 %in% weekdays))
lapply(days, function(dayDF) {
  tmp <- cbind.data.frame(V1=dayDF[2, 1], do.call(rbind, split(unlist(dayDF[-c(1:2), ]), cumsum(!dayDF[-(1:2), 2]==""))), stringsAsFactors = F)
  tmp[, 1] <- as.Date(tmp[, 1], "%d-%B-%y")
  tmp[, 2] <- as.numeric(tmp[, 2])
  tmp[, 5] <- NULL
  idx <- c(FALSE, diff(tmp[, 2])<0)
  tmp[idx, 1] <- tmp[idx, 1] + 1
  return(tmp)
}) -> days
days <- transform(do.call(rbind.data.frame, days), V1=as.POSIXct(paste(V1, sprintf("%.2f", V11)), format="%Y-%m-%d %H.%M"), V11=NULL)  
names(days) <- c("Date", "Synopsis", "Program")
rownames(days) <- NULL
days[, c(1, 3, 2)]
#                  Date Program                            Synopsis
# 1 2015-11-01 06:00:00     Tom  some information about the program
# 2 2015-11-01 23:30:00   Jerry  some information about the program
# 3 2015-11-02 05:00:00  Avatar  some information about the program
# 4 2015-11-02 06:00:00     Tom  some information about the program
# 5 2015-11-02 23:30:00   Jerry  some information about the program
# 6 2015-11-03 05:00:00  Avatar  some information about the program
like image 24
lukeA Avatar answered Oct 02 '22 12:10

lukeA