Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Should I keep source files in memory while parsing?

Tags:

c

parsing

I'm writing the front-end part of an interpreter and I initially disliked the idea of just dumping all the source files into memory and then referencing that text directly. So the tokenizer reads from a char buffers and builds the token stream.

However, I have reached the parsing side of things and it hit me that because I would want to output nice errors and warnings that show the malformed line of source code. I guess I could put column numbers in the tokens, but then by error messages would be like getting directions by telephone: "It's in file X, on line Y, column Z, right next to the curly brace, you know the one. If you hit the semicolon, you've gone to far."

I seem to have put myself into a situation where I want to have my cake and eat it too. I want nice messages, but I don't want to hog memory.

It there something I'm missing? Or is loading the source in memory the way to go?

like image 469
that_individual Avatar asked Jun 07 '26 16:06

that_individual


2 Answers

When there's an error to report to the user, it hardly matters how long in milliseconds it takes to report it.

I'd keep your tokenized stream in memory to keep your interpreter fast. (Actually, you should switch to a threaded interpreter or even a bad one pass compiler to enhance the execution rate).

When you encounter an error, go to the disk, fetch the line(s) of interest, and show them to the user. If he doesn't make any errors, this costs you zero. If he makes a small number of errors, that may be tiny bit inefficient but the user won't know. If he makes large number of errors, the file content of the files containing errors are going to read by the OS into its local cache, which is likely bigger than your programs anyway, and so access will be more efficient than if you kept the source entirely on the disk.

like image 172
Ira Baxter Avatar answered Jun 09 '26 07:06

Ira Baxter


Better idea: mmap your sources in the first place, if you can. Fall back to slurping the whole file if you're reading from a pipe or something.

After parsing, you may want to call madvise(MADV_DONTNEED) (but only if it was originally mmaped) to advise the kernel to drop it from the cache (but still keep it available for errors) ... but this is probably not necessary, and may even not be a good idea, depending on your compiler design (e.g. are identifiers still pointing, or are they interned to a single, separate, allocation).

like image 21
o11c Avatar answered Jun 09 '26 05:06

o11c



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!