I have a text file that has the following format: <pre class="prettyprint"><code>characters(that I want to keep) (space) characters(that I want to remove) </code></pre> So for example: <pre class="prettyprint"><code>foo garbagetext hello moregarbage keepthis removethis (etc.) </code></pre> So I was trying to use the grep command in Linux to keep only the characters in each line up to and not including the first blank space. I have tried numerous attempts such as: <pre class="prettyprint"><code>grep '*[[:space:]]' text1.txt > text2.txt grep '*[^\s]' text1.txt > text2.txt grep '/^[^[[:space:]]]+/' text1.txt > text2.txt </code></pre> trying to piece together from different examples, but I have had no luck. They all produce a blank <code>text2.txt</code> file. I am new to this. What am I doing wrong? *EDIT: The parts I want to keep include capital letters. So I want to keep any/all characters up to and not including the blank space (removing everything from the blank space onward) in each line. **EDIT: The garbage text (that I want to remove) can contain anything, including spaces, special characters, etc. So for example: <pre class="prettyprint"><code>AA rough, cindery lava [n -S] </code></pre> After running <code>grep -o '[^ ]*' text1.txt > text2.txt</code>, the line above becomes: <pre class="prettyprint"><code>AA rough, cindery lava [n -S] </code></pre> in text2.txt. (All I want to keep is <code>AA</code>) <hr> SOLUTION (provided by Rohit Jain with further input by beny23): <pre class="prettyprint"><code> grep -o '^[^ ]*' text1.txt > text2.txt </code></pre>

You are putting quantifier <code>*</code> at the wrong place. Try instead this: - <pre class="prettyprint"><code>grep '^[^\s]*' text1.txt > text2.txt </code></pre> or, even better: - <pre class="prettyprint"><code>grep '^\S*' text1.txt > text2.txt </code></pre> <code>\S</code> means match non-whitespace character. And anchor <code>^</code> is used to match at the beginning of the line.

I realize this has long since been answered with the grep solution, but for future generations I'd like to note that there are at least two other solutions for this particular situation, both of which are more efficient than grep. Since you are not doing any complex text pattern matching, just taking the first column delimited by a space, you can use some of the utilities which are column-based, such as awk or cut. Using awk <pre class="prettyprint"><code>$ awk '{print $1}' text1.txt > text2.txt </code></pre> Using cut <pre class="prettyprint"><code>$ cut -f1 -d' ' text1.txt > text2.txt </code></pre> Benchmarks on a ~1.1MB file <pre class="prettyprint"><code>$ time grep -o '^[^ ]*' text1.txt > text2.txt real 0m0.064s user 0m0.062s sys 0m0.001s $ time awk '{print $1}' text1.txt > text2.txt real 0m0.021s user 0m0.017s sys 0m0.004s $ time cut -f1 -d' ' text1.txt > text2.txt real 0m0.007s user 0m0.004s sys 0m0.003s </code></pre> <code>awk</code> is about 3x faster than <code>grep</code>, and <code>cut</code> is about 3x faster than that. Again, there's not much difference for this small file for just one run, but if you're writing a script, e.g., for re-use, or doing this often on large files, you might appreciate the extra efficiency.

grep: match all characters up to (not including) first blank space

Tags:

regex

grep

whitespace

I have a text file that has the following format:

characters(that I want to keep) (space) characters(that I want to remove)

So for example:

foo garbagetext hello moregarbage keepthis removethis (etc.)

So I was trying to use the grep command in Linux to keep only the characters in each line up to and not including the first blank space. I have tried numerous attempts such as:

grep '*[[:space:]]' text1.txt > text2.txt grep '*[^\s]' text1.txt > text2.txt grep '/^[^[[:space:]]]+/' text1.txt > text2.txt

trying to piece together from different examples, but I have had no luck. They all produce a blank text2.txt file. I am new to this. What am I doing wrong?

*EDIT:

The parts I want to keep include capital letters. So I want to keep any/all characters up to and not including the blank space (removing everything from the blank space onward) in each line.

**EDIT:

The garbage text (that I want to remove) can contain anything, including spaces, special characters, etc. So for example:

AA rough, cindery lava [n -S]

After running grep -o '[^ ]*' text1.txt > text2.txt, the line above becomes:

AA rough, cindery lava [n -S]

in text2.txt. (All I want to keep is AA)

SOLUTION (provided by Rohit Jain with further input by beny23):

 grep -o '^[^ ]*' text1.txt > text2.txt

723

asked Feb 03 '13 20:02

lord_sneed

2 Answers

You are putting quantifier * at the wrong place.

Try instead this: -

grep '^[^\s]*' text1.txt > text2.txt

or, even better: -

grep '^\S*' text1.txt > text2.txt

\S means match non-whitespace character. And anchor ^ is used to match at the beginning of the line.

186

answered Nov 07 '22 00:11

Rohit Jain

I realize this has long since been answered with the grep solution, but for future generations I'd like to note that there are at least two other solutions for this particular situation, both of which are more efficient than grep.

Since you are not doing any complex text pattern matching, just taking the first column delimited by a space, you can use some of the utilities which are column-based, such as awk or cut.

Using awk

$ awk '{print $1}' text1.txt > text2.txt

Using cut

$ cut -f1 -d' ' text1.txt > text2.txt

Benchmarks on a ~1.1MB file

$ time grep -o '^[^ ]*' text1.txt > text2.txt  real    0m0.064s user    0m0.062s sys     0m0.001s $ time awk '{print $1}' text1.txt > text2.txt  real    0m0.021s user    0m0.017s sys     0m0.004s $ time cut -f1 -d' ' text1.txt > text2.txt  real    0m0.007s user    0m0.004s sys     0m0.003s

awk is about 3x faster than grep, and cut is about 3x faster than that. Again, there's not much difference for this small file for just one run, but if you're writing a script, e.g., for re-use, or doing this often on large files, you might appreciate the extra efficiency.

answered Nov 07 '22 01:11

Steve

Related questions
                            
                                Fuzzy regular expressions
                            
                                Splitting strings through regular expressions by punctuation and whitespace etc in java
                            
                                Laravel pattern validation pipe character issue
                            
                                The Hostname Regex
                            
                                Java regex: Repeating capturing groups
                            
                                Php find string with regex
                            
                                Listing all files matching a full-path pattern in R
                            
                                Shouldn't "static" patterns always be static?
                            
                                Groovy Regex: Capture group in Switch Statement
                            
                                RegEx for including alphanumeric and special characters
                            
                                What are the valid characters for Registry keys and valuenames?
                            
                                How can I capture multiple matches from the same Perl regex?
                            
                                replacing all regex matches in single line
                            
                                Random string that matches a regexp [duplicate]
                            
                                Why String.replaceAll() in java requires 4 slashes "\\\\" in regex to actually replace "\"?
                            
                                Writing a syntax highlighter
                            
                                Using sed to replace beginning of line when match found
                            
                                R regular expressions: unexpected behavior of "[:digit:]"
                            
                                How do I use preg_match to test for spaces?
                            
                                how to use one line regular expression to get matched content

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With