This question may or may not be truly Haskell-specific, but it concerns a slight annoyance that I am facing with a certain programming task.
I have written a program in Haskell which is mostly universal for the type of problem I am trying to solve, but includes two dependent components: a run-time estimation function for a script, calculated based on trial runs at a certain benchmark, and a file-name conversion function, which is tailored to the naming scheme of the files I was working with. Naturally, if I want to use the script with performances other than the benchmark, or I find that the estimates are too conservative, I would like to change the function used to estimate the run-time, and likewise I would like to be able to modify the file-name conversion function if I ever need to work with different files with different naming schemes.
However, the (remote) computer that I am running my scripts on does not have GHC or runhaskell installed, so I am having to modify, compile, and re-upload the code from my local machine, which is a bit of a hassle. My question is, is there an easy way to implement changes in some components of my code without having to recompile in order for the changes to be reflected at call-time?
I apologize if my description is vague, and have included the gory details below, as I do not want to clutter my question with unnecessary details from the outset, should the details prove unnecessary.
I am writing this code in Haskell mainly because that is the language that I best know how to implement the methods in; while I understand that other languages might be more portable, I am not sufficiently familiar with other languages in order to implement this without having to read a lot of documentation and go through multiple revisions in order to get it to work. If achieving the flexibility I want with Haskell is impractical, I can appreciate that, but I would rather know that Haskell cannot do it than receive suggestions of other languages that can.
I am writing code to run independent jobs on a load-sharing cluster, and I therefore want to most closely estimate the time required for a particular job, without under-shooting and causing the job to be terminated, and without over-shooting and thereby lowering the priority of the jobs. I am basing my time estimate on the size of the inputs to the job program, and I have gathered enough real-world data to derive an approximate quadratic relation between size and time.
The way I am currently assigning time-estimates, and thereby establishing a job order, for the inputs is by parsing the output of du
with a Haskell script, performing a computation, and writing the time results to a new file, which is then read in a loop by the job-assignment script.
The job is being run for paired files, which share a common name up to a certain point, where the last common element I wish to retain is an 's', with no further 's' characters in either name from then on. Therefore, I am traversing the names backwards and dropping until I reach an 's'. My code is included below. It is liberal with comments, which might help or might confuse. Some of them are highly specific to the task I am working with.
-- size2time.hs
-- A Haskell script to convert file sizes into job-times, based on observed job-times for
-- various file sizes
--
--
-- This file may be compiled via the following command:
-- > ghc size2time.hs
--
-- Should any edits be made, ensure that the compiled executable is updated accordingly
--
-- The executable is to be run with the following usage
--
-- > ./size2time inputfile outputfile
--
-- where inputfile is the name of a file whose first column contains the sizes, in MB, of each fq.gz
-- (including both paired-end reads), and whose second column contains the corresponding file names, as
-- generated by
--
-- > du -m $( ls DIR/*.fq.gz ) >inputfile
--
-- where DIR is the directory containing the fq.gz files
--
-- output is the name of a file that will be created by the execution of this script, whose first
-- column will contain the run-time, in minutes, of the corresponding job (the times are based on
-- jobs run on Intel CPUs with 12 cores and 2GB of RAM, and therefore will potentially be
-- inapplicable to jobs run on CPUs of different manufacturers, with different numbers of cores,
-- and/or with different allocated RAM), and whose second column contains the scrubbed names of
-- the jobs to be run. The greater time-value for any given pair is used, with only one member of
-- each pair retained, as the file-names of each member of a pair are identical after scrubbing
--
-- import modules for command line arguments, list operations, map operations
import System.Environment
import Data.List
import qualified Data.Map as Map
main = do
args <- getArgs -- parse command line arguments: inputfile, outputfile, <ignored>
let infile = head args
outfile = head . tail $ args
contents <- readFile infile -- read the inputfile
let sf = lines contents -- split into lines
tf = map size2time sf -- peform size2time mapping
st = map sample tf -- scrub filename
stu = Map.toList . Map.fromListWith (max) $ st -- take only the longer of the two times of the paired reads
tsu = map flip2 stu -- put time first
stsu = sort tsu -- sort by time, ascending
tsustr = map unwords . map (\(x,y) -> [show x, y]) $ stsu -- convert back to string
tsulns = unlines tsustr -- join individual lines
writeFile outfile tsulns -- write to the outputfile
{- given a string, with the size of a file and the name of the file,
- returns a tuple with the estimated job-time and the unmodified name
- of the file.
-
- The size-time conversion is extrapolated from experimental data,
- with only the upper extremes considered in order to prevent timeout,
- rounding in the quadratic term, and a linear-degree time padding added
- to allow for upper extremes. If modifications are to be made to any
- coefficients, it is recommended that only linear and constant terms be increased,
- and decreases should only be made after performing sufficient alignments to collect
- enough (file size)--(actual computation time) pairs to verify that the padding is excessive,
- and to determine coefficients that more closely follow the trend of the actual data, with
- the conditions that no data point must exceed the approximation curve, and that sufficient padding
- must be provided to allow for potential inconsistency in the time required for any given size of alignment.
-}
size2time :: String -> (Int,String)
size2time sfstring = let (size:file:[]) = words sfstring -- parses out size and filename
x = fromIntegral (read size :: Int) -- floating point from numeric string
time = floor $ 0.000025 * x ^ 2 + 0.03 * x + 10 -- apply floored conversion
tfstring = (time,file)
in tfstring
{-
- removes all characters in the file-name after 's', which properly scrubs files of the format
- *--Hs--R?.fq.gz, where the ? is either 1 or 2. For filenames formatted in different ways,
- or for alternative naming of the BAM file to be generated, this function must be modified
- to suit the scenario.
-}
sample :: (a,String) -> (String,a)
sample (x,f) = let s = reverse . dropWhile (/= 's') . reverse $ f
in (s,x)
{-
- Reverses the order of a tuple, e.g. so that a Map may be made with a key to be found in the
- current second position of the tuple.
-}
flip2 :: (a,b) -> (b,a)
flip2 (x,y) = (y,x)
Open a command window and navigate to the directory where you want to keep your Haskell source files. Run Haskell by typing ghci or ghci MyFile. hs. (The "i" in "GHCi" stands for "interactive", as opposed to compiling and producing an executable file.)
If you have installed the Haskell Platform, open a terminal and type ghci (the name of the executable of the GHC interpreter) at the command prompt. Alternatively, if you are on Windows, you may choose WinGHCi in the Start menu. And you are presented with a prompt. The Haskell system now attentively awaits your input.
GHCi is the interactive interface to GHC. From the command line, enter "ghci" (or "ghci -W") followed by an optional filename to load. Note: We recommend using "ghci -W", which tells GHC to output useful warning messages in more situations. These warnings help to avoid common programming errors.
I don't think there's a clear solution to your problem.
Without an interpreter or compiler on the remote machine, it's not possible to modify your Haskell source on that machine and then convert it into a machine-readable form.
As others have said, perhaps you could implement configuration files or command line options that allow likely-to-be-modified data to be specified at run time.
Or, assuming your remote machine has gcc
installed, you could have GHC compile your Haskell code into C on your local machine, transfer it to the remote machine, try your best to make sense of how it translated your code, and make changes to the C code and recompile on the remote machine.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With