Finding and Converting HTML Files and Moving Them En-Masse

Question

I'm using Mathematica to work with a large array of website files, which I've mirrored onto my own system. They are spread across several hundred directories, with tons of sub-directories. So for example, I have:

/users/me/test/directory1
/users/me/test/directory1/subdirectory2 [times a hundred]
/users/me/test/directory2
/users/me/test/directory2/subdirectory5 [etc. etc.]

What I need to do is to go into each directory, Import[] all the HTML files as Plaintext, and then put them in another directory elsewhere on my system named after 'directory1'. So far, with Do[] loops I have been able to do a rough version: the best case I have right now, however, is dumping the ".txt" files in the original directory, which isn't an ideal solution as they're still spread all over my system.

To find my files, I use directoryfiles = FileNames["*.htm*", {"*"}, Infinity];

Some additional vexing problems:

(1) Duplicates: Is there a way for Mathematica to deal with duplicates - i.e. if we run into another index_en.html can it be renamed as index_en_1.html?

(2) Directories: Because of all the directories, unless I use Mathematica to constantly SetDirectory and CreateDirectory over and over again, it keeps running into trouble.

This all seems a bit confusing. Basically, is there an efficient way for Mathematica to find a ton of HTML files spread across hundreds of directories/subdirectories, Import them as plaintext, and export them somewhere else [it's important for me to know they came from directory1, but that's it].

-- edited for clarity below --

Here is the code that I currently have:

SetDirectory[
  "/users/me/web/"];
dirlist = FileNames[];
directoryPrefix = 
  "/users/me/web/";
plainHTMLBucket = "";
Do[
  directory = directoryPrefix <> dirname;
  exportPrefix = 
   "/users/me/desktop/bucket/";
  SetDirectory[directory];
  allFiles = FileNames["*.htm*", {"*"}, Infinity];
  plainHTMLBucket = "";
  Do[
   plainHTML = Import[filename, "Plaintext"];
   plainHTMLBucket = AppendTo[plainHTMLBucket, plainHTML];
   , {filename, allFiles}];
  Export[exportPrefix <> dirname <> ".txt", plainHTMLBucket];
  Print["We Have Reached Here"];
  , {dirname, dirlist}];

What's wrong with it from my perspective? Besides being messy, it's my workaround: I would much rather have all the files separated rather than one big one - i.e. take each import and export as a separate file, but in a directory called 'directory1' albeit somewhere else. The problem is when it comes to mirroring these directories (the directories don't exist, but I am having trouble using CreateDirectory[] to dynamically do so).

My apologies for the confusion here - I know it shows with this question..

WReach · Accepted Answer

The following code might do the trick:

mapFileNames[source_, filenames_, target_] :=
  Module[{depth = FileNameDepth[source]}
  , FileNameJoin[{target, FileNameDrop[#, depth]}]& /@ filenames
  ]

htmlTreeToPlainText[source_, target_] :=
  Module[{htmlFiles, textFiles, targetDirs}
  , htmlFiles = FileNames["*.html", source, Infinity]
  ; textFiles = StringReplace[
                  mapFileNames[source, htmlFiles, target]
                  , f__~~".html"~~EndOfString :> f~~".txt"
                  ]
  ; targetDirs = DeleteDuplicates[FileNameDrop[#, -1]& /@ textFiles]
  ; If[FileExistsQ[target], DeleteDirectory[target, DeleteContents -> True]]
  ; Scan[CreateDirectory[#, CreateIntermediateDirectories -> True]&, targetDirs]
  ; Scan[
      Export[#[[2]], Import[#[[1]], "Plaintext"], "Text"] &
    , Transpose[{htmlFiles, textFiles}]
    ]
  ]

Example use (warning: the target directory will be deleted first!):

htmlTreeToPlainText["/users/me/web", "/users/me/desktop/bucket"]

How It Works

The various Mathematica FileName... functions are helpful in this context. First, we start by defining the helper function mapFileNames that takes a source directory, a list of file names that lie within the source directory, and a target directory. It returns a list of file paths that name the corresponding locations underneath the target directory.

mapFileNames[source_, filenames_, target_] :=
  Module[{depth = FileNameDepth[source]}
  , FileNameJoin[{target, FileNameDrop[#, depth]}]& /@ filenames
  ]

The function uses FileNameDrop to drop the leading source path elements from each filename and FileNameJoin to prepend the target path onto the front of each result. The number of leading elements to drop is determined by applying FileNameDepth to the source path.

For example:

In[83]:= mapFileNames["/a/b", {"/a/b/x.txt", "/a/b/c/y.txt"}, "/d"]
Out[83]= {"/d/x.txt", "/d/c/y.txt"}

Using this function, we can convert a list of HTML file paths under a source directory (source) into corresponding list of text file paths under the target directory (target):

htmlFiles = FileNames["*.html", source, Infinity]

textFiles = StringReplace[
              mapFileNames[source, htmlFiles, target]
              , f__~~".html"~~EndOfString :> f~~".txt"
              ]

These statements retrieve the list of HTML files, map them to the target directory, and then change the file extension from .html to .txt. We can now extract the necessary directory names from the resulting text files:

targetDirs = DeleteDuplicates[FileNameDrop[#, -1]& /@ textFiles]

Again FileNameDrop is used, this time to drop the filename portion from each text file's path.

Next, we need to delete the target directory (if it already exists) and create the new required directories:

If[FileExistsQ[target], DeleteDirectory[target, DeleteContents -> True]]

Scan[CreateDirectory[#, CreateIntermediateDirectories -> True]&, targetDirs]

We can now perform the HTML-to-text transformation, safe in the knowledge that the target directories already exist:

Scan[
  Export[#[[2]], Import[#[[1]], "Plaintext"], "Text"] &
, Transpose[{htmlFiles, textFiles}]
]

acl · Answer

To set the current directory, do something like

SetDirectory["~/Desktop/"]

Now, suppose I wish to obtain a list of all directories in the current directory. I can do

dirs=Pick[
   #,
   (FileType[#] == Directory) & /@ #
   ] &@FileNames[]

which returns a list of the names of all the directories under the current directory that you've set earlier (I use nested pure functions which may be confusing...). You can then do fn to each of the dirs by Scan[fn,dirs]. So, you could assign the Pick[] construct to a function, then use it to recurse down your tree.

This is straightforward but I am not sure it's what you want. Maybe you could be a little more explicit on what you're after so I/we do not sit down and code the wrong thing.

Finding and Converting HTML Files and Moving Them En-Masse

Tags:

wolfram-mathematica

canadian_scholar

2 Answers

WReach

acl

Recent Activity

Donate For Us

Finding and Converting HTML Files and Moving Them En-Masse

Tags:

wolfram-mathematica

canadian_scholar

2 Answers

WReach

acl

Related questions

Recent Activity

Donate For Us