I'm using Mathematica to work with a large array of website files, which I've mirrored onto my own system. They are spread across several hundred directories, with tons of sub-directories. So for example, I have:
/users/me/test/directory1
/users/me/test/directory1/subdirectory2 [times a hundred]
/users/me/test/directory2
/users/me/test/directory2/subdirectory5 [etc. etc.]
What I need to do is to go into each directory, Import[]
all the HTML files as Plaintext, and then put them in another directory elsewhere on my system named after 'directory1'. So far, with Do[]
loops I have been able to do a rough version: the best case I have right now, however, is dumping the ".txt" files in the original directory, which isn't an ideal solution as they're still spread all over my system.
To find my files, I use directoryfiles = FileNames["*.htm*", {"*"}, Infinity];
Some additional vexing problems:
(1) Duplicates: Is there a way for Mathematica to deal with duplicates - i.e. if we run into another index_en.html can it be renamed as index_en_1.html?
(2) Directories: Because of all the directories, unless I use Mathematica to constantly SetDirectory
and CreateDirectory
over and over again, it keeps running into trouble.
This all seems a bit confusing. Basically, is there an efficient way for Mathematica to find a ton of HTML files spread across hundreds of directories/subdirectories, Import them as plaintext, and export them somewhere else [it's important for me to know they came from directory1, but that's it].
-- edited for clarity below --
Here is the code that I currently have:
SetDirectory[
"/users/me/web/"];
dirlist = FileNames[];
directoryPrefix =
"/users/me/web/";
plainHTMLBucket = "";
Do[
directory = directoryPrefix <> dirname;
exportPrefix =
"/users/me/desktop/bucket/";
SetDirectory[directory];
allFiles = FileNames["*.htm*", {"*"}, Infinity];
plainHTMLBucket = "";
Do[
plainHTML = Import[filename, "Plaintext"];
plainHTMLBucket = AppendTo[plainHTMLBucket, plainHTML];
, {filename, allFiles}];
Export[exportPrefix <> dirname <> ".txt", plainHTMLBucket];
Print["We Have Reached Here"];
, {dirname, dirlist}];
What's wrong with it from my perspective? Besides being messy, it's my workaround: I would much rather have all the files separated rather than one big one - i.e. take each import and export as a separate file, but in a directory called 'directory1' albeit somewhere else. The problem is when it comes to mirroring these directories (the directories don't exist, but I am having trouble using CreateDirectory[]
to dynamically do so).
My apologies for the confusion here - I know it shows with this question..
The following code might do the trick:
mapFileNames[source_, filenames_, target_] :=
Module[{depth = FileNameDepth[source]}
, FileNameJoin[{target, FileNameDrop[#, depth]}]& /@ filenames
]
htmlTreeToPlainText[source_, target_] :=
Module[{htmlFiles, textFiles, targetDirs}
, htmlFiles = FileNames["*.html", source, Infinity]
; textFiles = StringReplace[
mapFileNames[source, htmlFiles, target]
, f__~~".html"~~EndOfString :> f~~".txt"
]
; targetDirs = DeleteDuplicates[FileNameDrop[#, -1]& /@ textFiles]
; If[FileExistsQ[target], DeleteDirectory[target, DeleteContents -> True]]
; Scan[CreateDirectory[#, CreateIntermediateDirectories -> True]&, targetDirs]
; Scan[
Export[#[[2]], Import[#[[1]], "Plaintext"], "Text"] &
, Transpose[{htmlFiles, textFiles}]
]
]
Example use (warning: the target directory will be deleted first!):
htmlTreeToPlainText["/users/me/web", "/users/me/desktop/bucket"]
How It Works
The various Mathematica FileName...
functions are helpful in this context. First, we start by defining the helper function mapFileNames
that takes a source directory, a list of file names that lie within the source directory, and a target directory. It returns a list of file paths that name the corresponding locations underneath the target directory.
mapFileNames[source_, filenames_, target_] :=
Module[{depth = FileNameDepth[source]}
, FileNameJoin[{target, FileNameDrop[#, depth]}]& /@ filenames
]
The function uses FileNameDrop
to drop the leading source path elements from each filename and FileNameJoin
to prepend the target path onto the front of each result. The number of leading elements to drop is determined by applying FileNameDepth
to the source path.
For example:
In[83]:= mapFileNames["/a/b", {"/a/b/x.txt", "/a/b/c/y.txt"}, "/d"]
Out[83]= {"/d/x.txt", "/d/c/y.txt"}
Using this function, we can convert a list of HTML file paths under a source directory (source
) into corresponding list of text file paths under the target directory (target
):
htmlFiles = FileNames["*.html", source, Infinity]
textFiles = StringReplace[
mapFileNames[source, htmlFiles, target]
, f__~~".html"~~EndOfString :> f~~".txt"
]
These statements retrieve the list of HTML files, map them to the target directory, and then change the file extension from .html
to .txt
. We can now extract the necessary directory names from the resulting text files:
targetDirs = DeleteDuplicates[FileNameDrop[#, -1]& /@ textFiles]
Again FileNameDrop
is used, this time to drop the filename portion from each text file's path.
Next, we need to delete the target directory (if it already exists) and create the new required directories:
If[FileExistsQ[target], DeleteDirectory[target, DeleteContents -> True]]
Scan[CreateDirectory[#, CreateIntermediateDirectories -> True]&, targetDirs]
We can now perform the HTML-to-text transformation, safe in the knowledge that the target directories already exist:
Scan[
Export[#[[2]], Import[#[[1]], "Plaintext"], "Text"] &
, Transpose[{htmlFiles, textFiles}]
]
To set the current directory, do something like
SetDirectory["~/Desktop/"]
Now, suppose I wish to obtain a list of all directories in the current directory. I can do
dirs=Pick[
#,
(FileType[#] == Directory) & /@ #
] &@FileNames[]
which returns a list of the names of all the directories under the current directory that you've set earlier (I use nested pure functions which may be confusing...). You can then do fn
to each of the dirs
by Scan[fn,dirs]
. So, you could assign the Pick[]
construct to a function, then use it to recurse down your tree.
This is straightforward but I am not sure it's what you want. Maybe you could be a little more explicit on what you're after so I/we do not sit down and code the wrong thing.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With