Sample data (emp.data
)
Beth 4.00 0
Dan 3.75 0
Kathy 4.00 10
Mark 5.00 20
Mary 5.50 22
Susie 4.25 18
I can read it into a data.frame
using read.table
, then convert it to data.table
:
library(data.table)
df <- read.table("emp.data", col.names = c("Name", "PayRate", "HoursWorked"))
DT <- as.data.table(df, key = HoursWorked)
Calculate the pay (filter out zero hours):
DT[HoursWorked > 0, .(Name, Pay = PayRate * HoursWorked),]
Name Pay
1: Kathy 40.0
2: Mark 100.0
3: Mary 121.0
4: Susie 76.5
That works fine; however, I consider there's an extra step in converting. Since there's fread()
in data.table
, why not use it directly?
readDT <- fread("emp.data", header=FALSE, sep="\t")
V1
1: Beth 4.00 0
2: Dan 3.75 0
3: Kathy 4.00 10
4: Mark 5.00 20
5: Mary 5.50 22
6: Susie 4.25 18
str(readDT)
Classes 'data.table' and 'data.frame': 6 obs. of 1 variable:
$ V1: chr "Beth 4.00 0" "Dan 3.75 0" "Kathy 4.00 10" "Mark 5.00 20" ...
- attr(*, ".internal.selfref")=<externalptr>
The data is recognized as one column; obviously this doesn't work.
Question
How to read this data using fread()
properly? (If possible, set the column names as well.)
To read tab-separated values files with Python, we'll take advantage of the fact that they're similar to CSVs. We'll use Python's csv library and tell it to split things up with tabs instead of commas. Just set the delimiter argument to "\t" . That's it!
For files beyond 100 MB in size fread() and read_csv() can be expected to be around 5 times faster than read. csv() .
The short answer is that you can't read xlsx files with fread , xlsx files have a very different format to the text files that fread is designed for.
The fread() function returns the number of full items successfully read, which can be less than count if an error occurs, or if the end-of-file is met before reaching count. If size or count is 0, the fread() function returns zero, and the contents of the array and the state of the stream remain unchanged.
This has been fixed recently in the devel version, v1.9.5 (will be soon available on CRAN as v1.9.6):
require(data.table) # v1.9.5+
fread("~/Downloads/tmp.txt")
# V1 V2 V3
# 1: Beth 4.00 0
# 2: Dan 3.75 0
# 3: Kathy 4.00 10
# 4: Mark 5.00 20
# 5: Mary 5.50 22
# 6: Susie 4.25 18
See README.md
in the project page for more info. fread
gained strip.white
argument (amidst other functionalities / bug fixes) which is by default TRUE
.
Update: it also has col.names
argument now:
fread("~/Downloads/tmp.txt", col.names = c("Name", "PayRate", "HoursWorked"))
# Name PayRate HoursWorked
# 1: Beth 4.00 0
# 2: Dan 3.75 0
# 3: Kathy 4.00 10
# 4: Mark 5.00 20
# 5: Mary 5.50 22
# 6: Susie 4.25 18
Using awk
to remove the white spaces and then reading with fread
worked for me.
DT <- fread("awk '{$1=$1}1' emp.data")
DT
# V1 V2 V3
#1: Beth 4.00 0
#2: Dan 3.75 0
#3: Kathy 4.00 10
#4: Mark 5.00 20
#5: Mary 5.50 22
#6: Susie 4.25 18
str(DT)
#Classes ‘data.table’ and 'data.frame': 6 obs. of 3 variables:
#$ V1: chr "Beth" "Dan" "Kathy" "Mark" ...
#$ V2: num 4 3.75 4 5 5.5 4.25
#$ V3: int 0 0 10 20 22 18
# - attr(*, ".internal.selfref")=<externalptr>
I was able to replicate the same problem with the OP's code
fread("emp.data", header=FALSE, sep="\t")
# V1
#1: Beth 4.00 0
#2: Dan 3.75 0
#3: Kathy 4.00 10
#4: Mark 5.00 20
#5: Mary 5.50 22
#6: Susie 4.25 18
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With