Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to use CompUnit modules for collected data?

I have developed a module for processing a collection of documents.

One run of the software collects information about them. The data is stored in two structures called %processed and %symbols. The data needs to be cached for subsequent runs of the software on the same set of documents, some of which can change. (The documents are themselves cached using CompUnit modules).

Currently the data structures are stored / restored as follows:

# storing
'processed.raku`.IO.spurt: %processed.raku;
'symbols.raku`.IO.spurt: %symbol.raku;

# restoring
my %processed = EVALFILE 'processed.raku';
my %symbols = EVALFILE 'symbols.raku';

Outputting these structures into files, which can be quite large, can be slow because the hashes are parsed to create the Stringified forms, and slow on input because they are being recompiled.

It is not intended for the cached files to be inspected, only to save state between software runs.

In addition, although this is not a problem for my use case, this technique cannot be used in general because Stringification (serialisation) does not work for Raku closures - as far as I know.

I was wondering whether the CompUnit modules could be used because they are used to store compiled versions of modules. So perhaps, they could be used to store a 'compiled' or 'internal' version of the data structures?

Is there already a way to do this?

If there isn't, is there any technical reason it might NOT be possible?

like image 937
Richard Hainsworth Avatar asked Feb 25 '21 11:02

Richard Hainsworth


People also ask

How can data collection and analysis be used to manipulate data?

Regardless of collection method, after data are digitized, analytic and statistical software can be used to manipulate the data set in multiple ways to answer diverse questions.

What is the best method for collecting data?

Based on the data you want to collect, decide which method is best suited for your research. Experimental research is primarily a quantitative method. Interviews/focus groups and ethnography are qualitative methods. Surveys, observations, archival research and secondary data collection can be quantitative or qualitative methods.

How to collect data in a research paper?

A step-by-step guide to data collection Step 1: Define the aim of your research. Before you start the process of data collection, you need to identify exactly... Step 2: Choose your data collection method. Based on the data you want to collect, decide which method is best suited... Step 3: Plan your ...

Is it possible to collect and rely on qualitative data?

You can collect and rely largely on qualitative data. Whether this is an option depends to a large extent on what your program is about.


3 Answers

(There's a good chance that you've already tried this and/or it isn't a good fit for your usecase, but I thought I'd mention it just in case it's helpful either to you or to anyone else who finds this question.)

Have you considered serializing the data to/from JSON with JSON::Fast? It has been optimized for (de)serialization speed in a way that basic stringification hasn't been / can't be. That doesn't allow for storing Blocks or other closures – as you mentioned, Raku doesn't currently have a good way to serialize them. But, since you mentioned that isn't an issue, it's possible that JSON would fit your usecase.

[EDIT: as you pointed out below, this can make support for some Raku datastructures more difficult. There are typically (but not always) ways to work around the issue by specifying the datatype as part of the serialization step:

use JSON::Fast;
my $a = <a a a b>.BagHash;
my $json = $a.&to-json;
my BagHash() $b = from-json($json);
say $a eqv $b # OUTPUT: «True»

This gets more complicated for datastructures that are harder to represent in JSON (such as those with non-string keys). The JSON::Class module could also be helpful, but I haven't tested its speed.]

like image 81
codesections Avatar answered Oct 21 '22 11:10

codesections


After looking at other answers and looking at the code for Precompilation, I realised my original question was based on a misconception.

The Rakudo compiler generates an intermediate "byte code", which is then used at run time. Since modules are self-contained units for compilation purposes, they can be precompiled. This intermediate result can be cached, thus significantly speeding up Raku programs.

When a Raku program uses code that has already been compiled, the compiler does not compile it again.

I had thought of the precompilation cache as a sort of storage of the internal state of a program, but it is not quite that. That is why - I think - @ralph was confused by the question, because I was not asking the right sort of question.

My question is about the storage (and restoration) of data. JSON::Fast, as discussed by @codesections is very fast because it is used by the Rakudo compiler at a low level and so is highly optimised. Consequently, restructuring data upon restoration will be faster than restoring native data types because the slow rate-determining step is storage and restoration from "disk", which JSON does very quickly.

Interestingly, the CompUnit modules I mentioned use low level JSON functions that make JSON::Fast so quick.

I am now considering other ways of storing data using optimised routines, perhaps using a compression/archiving module. It will come down to testing which is fastest. It may be that the JSON route is the fastest.

So this question does not have a clear answer because the question itself is "incorrect".

like image 31
Richard Hainsworth Avatar answered Oct 21 '22 10:10

Richard Hainsworth


Update As @RichardHainsworth notes, I was confused by their question, though felt it should be helpful to answer as I did. Based on his reaction, and his decision not to accept @codesection's answer, which at that point was the only other answer, I concluded it was best to delete this answer to encourage others to answer. But now Richard has provided an answer that provides good resolution, I'm undeleting it in the hope that's now more useful.


TL;DR Instead of using EVALFILE, store your data in a module which you then use. There are simple ways to do this that would be minimal but useful improvements over EVALFILE. There are more complex ways that might be better.

A small improvement over EVALFILE

I've decided to first present a small improvement so you can solidify your shift in thinking from EVALFILE. It's small in two respects:

  • It should only take a few minutes to implement.

  • It only gives you a small improvement over EVALFILE.

I recommend you properly consider the rest of this answer (which describes more complex improvements with potentially bigger payoffs instead of this small one) before bothering to actually implement what I describe in this first section. I think this small improvement is likely to turn out to be redundant beyond serving as a mental bridge to later sections.


Write a program, say store.raku, that creates a module, say data.rakumod:

use lib '.';
my %hash-to-store = :a, :b;
my $hash-as-raku-code = %hash-to-store .raku;
my $raku-code-to-store = "unit module data; our %hash = $hash-as-raku-code";
spurt 'data.rakumod', $raku-code-to-store;

(My use of .raku is of course overly simplistic. The above is just a proof of concept.)

This form of writing your data will have essentially the same performance as your current solution, so there's no gain in that respect.


Next, write another program, say, using.raku, that uses it:

use lib '.';
use data;
say %data::hash; # {a => True, b => True}

useing the module will entail compiling it. So the first time you use this approach for reading your data instead of EVALFILE it'll be no faster, just as it was no faster to write it. But it should be much faster for subsequent reads. (Until you next change the data and have to rebuild the data module.)

This section also doesn't deal with closure stringification, and means you're still doing a data writing stage that may not be necessary.

Stringifying closures; a hack

One can extend the approach of the previous section to include stringifications of closures.

You "just" need to access the source code containing the closures; use a regex/parse to extract the closures; and then write the matches to the data module. Easy! ;)

For now I'll skip filling in details, partly because I again think this is just a mental bridge and suggest you read on rather than try to do as I just described.

Using CompUnits

Now we arrive at:

I was wondering whether the CompUnit modules could be used because they are used to store compiled versions of modules. So perhaps, they could be used to store a 'compiled' or 'internal' version of the data structures?

I'm a bit confused by what you're asking here for two reasons. First, I think you mean the documents ("The documents are themselves cached using CompUnit modules"), and that documents are stored as modules. Second, if you do mean the documents are stored as modules, then why wouldn't you be able to store the data you want stored in them? Are you concerned about hiding the data?

Anyhow, I will presume that you are asking about storing the data in the document modules, and that you're interested in ways to "hide" that data.


One simple option would be to write the data as I did in the first section, but insert the our %hash = $hash-as-raku-code"; etc code at the end, after the actual document, rather than at the start.

But perhaps that's too ugly / not "hidden" enough?


Another option might be to add Pod blocks with Pod block configuration data at the end of your document modules.

For example, putting all the code into a document module and throwing in a say just as a proof-of-concept:

# ...
# Document goes here
# ...

# At end of document:
=begin data :array<foo bar> :hash{k1=>v1, k2=>v2} :baz :qux(3.14)
=end data

say $=pod[0].config<array>; # foo bar

That said, that's just code being executed within the module; I don't know if the compiled form of the module retains the config data. Also, you need to use a "Pod loader" (cf Access pod from another Raku file). But my guess is you know all about such things.

Again, this might not be hidden enough, and there are constraints:

  • The data can only be literal scalars of type Str, Int, Num, or Bool, or aggregations of them in Arrays or Hashs.

  • Data can't have actual newlines in it. (You could presumably have double quoted strings with \ns in them.)

Modifying Rakudo

Aiui, presuming RakuAST lands, it'll be relatively easy to write Rakudo plugins that can do arbitrary work with a Raku module. And it seems like a short hop from RakuAST macros to basic is parsed macros which in turn seem like a short hop from extracting source code (eg the source of closures) as it goes thru the compiler and then spitting it back out into the compiled code as data, possibly attached to Pod declarator blocks that are in turn attached to code as code.

So, perhaps just wait a year or two to see if RakuAST lands and gets the hooks you need to do what you need to do via Rakudo?

like image 2
raiph Avatar answered Oct 21 '22 12:10

raiph