Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

bash append file from multiple thread

I'm working on big data, I'm trying to parallelize my process functions. I can use several threads and process every user is a different thread (I have 200k users).

Every thread should append the first n lines of a file that produce, in an output file, shared between all the threads.

I wrote a Java program that execute head -n 256 thread_processed.txt >> output (every thread will do this)

I need the output file to be wrote in an atomic way.

If the thread A wrote lines from 0 to 9 and threads B wrote lines from 10 to 19 the output should be: [0...9 10... 19]. Lines can't overlaps, it can't be something like [0 1 2 17 18 3 4 ...]

How I can manage concurrent write access to the output file in a bash script?

like image 870
Progeny Avatar asked Feb 06 '17 19:02

Progeny


People also ask

How to append multiple lines to a file in Bash?

To append multiple lines with printf command: Similarly, cating the command provides a similar output as: Another method we can use to append multiple lines to a file in bash is to use the heredoc. A heredoc is a redirection feature that allows you to pass multiple lines to a command or a file.

How do you handle multiple threads in a bash script?

When one starts to code using multiple threads, it quickly becomes clear that such threads will usually require some handling. For example, take the fictive example where we start five concurrent periods (and processes) of sleep in a Bash script; When we start the script (after making it executable using chmod +x rest.sh ), we see no output!

How do I append to an existing file in Linux?

Privileged access to your Linux system as root or via the sudo command. To make a new file in Bash, you normally use > for redirection, but to append to an existing file, you would use >>. Take a look at the examples below to see how it works.

What is Bash multi-threaded coding?

Interesting is the very small increase in overall processing time (0.002 seconds) which can be easily explained by the time required to start a subshell and the time required to initiate a background process. In Bash, multi-threaded coding will normally involve background threads from a main one-line script or full Bash script.


1 Answers

sem from GNU Parallel should be able to do it:

sem --id mylock "head -n 256 thread_processed.txt >> output"

It will start a mutex named mylock.

If you are concerned that someone might read output while the head is running:

sem --id mylock "cp output o2; head -n 256 thread_processed.txt >> o2; mv o2 output"
like image 97
Ole Tange Avatar answered Oct 05 '22 06:10

Ole Tange