Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Determine file type in R based on the content

Tags:

file

r

In linux we can use file command to get the file type based on the content of the file (not extension). Is there any similar function in R?

like image 356
user1436187 Avatar asked Nov 19 '15 07:11

user1436187


People also ask

How do I know the type of a file in R?

There are several ways to check data type in R. We can make use of the “typeof()” function, “class()” function and even the “str()” function to check the data type of an entire dataframe.

How can file type be determined?

Right-click the file. Select the Properties option. In the Properties window, similar to what is shown below, see the Type of file entry, which is the file type and extension.

What type of file is -- R --?

An R file is a script written in R, a programming language used for statistical analysis and graphing purposes. It contains code that can be executed within the R software environment. R files may include commands that create objects (functions, values, etc.)

How do I get filenames without extension in R?

There's a built in file_path_sans_ext from the standard install tools package that grabs the file without the extension. After tested, I think it's better to put file path in basename() as file_path_sans_ext(basename(filepath)) .


2 Answers

Old question but maybe relevant for people getting here via google: You can use dqmagic, a wrapper around libmagic for R, to determine the file type based on the files content. Since file uses the same library, the results are the same, e.g.:

library(dqmagic)
file_type("DESCRIPTION")
#> [1] "ASCII text"
file_type("src/file.cpp")
#> [1] "C source, ASCII text"

vs.

$ file DESCRIPTION src/file.cpp 
DESCRIPTION:  ASCII text
src/file.cpp: C source, ASCII text

Disclaimer: I am the author of the package.

like image 80
Ralf Stubner Avatar answered Sep 22 '22 02:09

Ralf Stubner


dqmagic is not on CRAN. Below an R solution which uses linux's "file" command (actually BSD's 'file' v5.35 dated October 2018, packaged in Ubuntu 19.04, according to man page)

file_full_path <- "/home/user/Documents/an_RTF_document.doc"
file_mime_type <- system2(command = "file",
  args = paste0(" -b --mime-type ", file_full_path), stdout = TRUE) # "text/rtf"
# Gives the list of potentially allowed extension for this mime type:
file_possible_ext <- system2(command = "file",
  args = paste0(" -b --extension ", file_full_path),
  stdout = TRUE) # "???". "doc/dot" for MsWord files.

It could be necessary to check that the actual extension is known to be a valid extension for the given mime type (for instance, readtext::readtext() reads an RTF file but fails if it is saved as *.doc).

file.basename <- basename(file_full_path)
file.base_without_ext <-sub(pattern = "(.*)\\..*$",
  replacement = "\\1", file.basename)
file.nchar_ext <- nchar(file.basename) - 
  nchar(file.base_without_ext)-1 # 3 or 4 (doc, docx, odt...)
file_ext <- substring(file.basename, nchar(file.basename) -
  file.nchar_ext +1) # doc, rtf...
if (file_mime_type == "text/rtf"){
   file_possible_ext <- "rtf"
} # in some (all?) cases, for an rtf mime-type, 
  #'file' outputs "???" as allowed extension

# Returns TRUE if the actual extension is known to 
# be a valid extension for the given mime type:
length(grep(file_ext, file_possible_ext, ignore.case = TRUE)) > 0
like image 39
mayeulk Avatar answered Sep 22 '22 02:09

mayeulk