Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Huge program size C++ with std::regex

Tags:

c++

regex

gcc

c++11

I want to compile a small C++ program with std::regex (standard regular expressions library).
Compiler: gcc/g++ 4.9.2 on Fedora 21.

#include <string.h>
#include <iostream>
#include <regex>
using namespace std;
int main () {
cout << "Content-Type: text/html\n\n";
  std::string s ("there is a subsequence in the string\n");
  std::regex e ("\\b(sub)([^ ]*)");   // matches words beginning by "sub"
  std::cout << std::regex_replace (s,e,"sub-$2");
}

It's not possible to compile program with std::regex without -std=c++11, so the suitable instruction for compiling in terminal is:

g++ -std=c++11 code.cpp -o prog

My main question is: the source code is very small, but why is the final file size of the compiled program so huge: 480 kilobytes?

Is it because of the influence of -std=c++11?

What happened and how can I reduce the size of the final binary program?

UPD1.

Using -Os flag is actually good way to reduce program size with std::regex to 100-120 KB from 480 KB. But it's strange, that even optimized file with std::regexp more than standart 7-12 kylobytes for C/C++ programs with few strings source code. For example, it's possible to turn the same regex replace trick with RE2 regexp library (in Fedora 21 "yum install re2-devel") in 8.5 KB binary file:

#include <string.h>
#include <iostream>
#include "re2/re2.h"
using namespace std;
int main () {
cout << "Content-Type: text/html\n\n";
std::string s ("there is a subsequence in the string\n");
RE2::GlobalReplace(&s, "\\b(sub)([^ ]*)", "sub-\\2");
cout << s;
}

Compiled with:

g++ -lre2 code.cpp -o prog

UPD2.

objdump for std::regex program:

0   .interp 00000013    08048154    08048154    00000154    2**0
1   .note.ABI-tag   00000020    08048168    08048168    00000168    2**2
2   .note.gnu.build-id  00000024    08048188    08048188    00000188    2**2
3   .gnu.hash   000000b0    080481ac    080481ac    000001ac    2**2
4   .dynsym 000006c0    0804825c    0804825c    0000025c    2**2
5   .dynstr 00000b36    0804891c    0804891c    0000091c    2**0
6   .gnu.version    000000d8    08049452    08049452    00001452    2**1
7   .gnu.version_r  000000d0    0804952c    0804952c    0000152c    2**2
8   .rel.dyn    00000038    080495fc    080495fc    000015fc    2**2
9   .rel.plt    000002b8    08049634    08049634    00001634    2**2
10  .init   00000023    080498ec    080498ec    000018ec    2**2
11  .plt    00000580    08049910    08049910    00001910    2**4
12  .text   0001f862    08049e90    08049e90    00001e90    2**4
13  .fini   00000014    080696f4    080696f4    000216f4    2**2
14  .rodata 00000dc8    08069740    08069740    00021740    2**6
15  .eh_frame_hdr   00003ab4    0806a508    0806a508    00022508    2**2
16  .eh_frame   0000f914    0806dfbc    0806dfbc    00025fbc    2**2
17  .gcc_except_table   00000e00    0807d8d0    0807d8d0    000358d0    2**2
18  .init_array 00000008    0807feec    0807feec    00036eec    2**2
19  .fini_array 00000004    0807fef4    0807fef4    00036ef4    2**2
20  .jcr    00000004    0807fef8    0807fef8    00036ef8    2**2
21  .dynamic    00000100    0807fefc    0807fefc    00036efc    2**2
22  .got    00000004    0807fffc    0807fffc    00036ffc    2**2
23  .got.plt    00000168    08080000    08080000    00037000    2**2
24  .data   00000240    08080180    08080180    00037180    2**6
25  .bss    000001b4    080803c0    080803c0    000373c0    2**6
26  .comment    0000002c    00000000    00000000    000373c0    2**0

objdump for RE2 program:

0   .interp 00000013    08048154    08048154    00000154    2**0
1   .note.ABI-tag   00000020    08048168    08048168    00000168    2**2
2   .note.gnu.build-id  00000024    08048188    08048188    00000188    2**2
3   .gnu.hash   00000034    080481ac    080481ac    000001ac    2**2
4   .dynsym 00000180    080481e0    080481e0    000001e0    2**2
5   .dynstr 00000298    08048360    08048360    00000360    2**0
6   .gnu.version    00000030    080485f8    080485f8    000005f8    2**1
7   .gnu.version_r  000000a0    08048628    08048628    00000628    2**2
8   .rel.dyn    00000010    080486c8    080486c8    000006c8    2**2
9   .rel.plt    00000090    080486d8    080486d8    000006d8    2**2
10  .init   00000023    08048768    08048768    00000768    2**2
11  .plt    00000130    08048790    08048790    00000790    2**4
12  .text   00000332    080488c0    080488c0    000008c0    2**4
13  .fini   00000014    08048bf4    08048bf4    00000bf4    2**2
14  .rodata 00000068    08048c08    08048c08    00000c08    2**2
15  .eh_frame_hdr   00000044    08048c70    08048c70    00000c70    2**2
16  .eh_frame   0000015c    08048cb4    08048cb4    00000cb4    2**2
17  .gcc_except_table   00000028    08048e10    08048e10    00000e10    2**0
18  .init_array 00000008    08049ee4    08049ee4    00000ee4    2**2
19  .fini_array 00000004    08049eec    08049eec    00000eec    2**2
20  .jcr    00000004    08049ef0    08049ef0    00000ef0    2**2
21  .dynamic    00000108    08049ef4    08049ef4    00000ef4    2**2
22  .got    00000004    08049ffc    08049ffc    00000ffc    2**2
23  .got.plt    00000054    0804a000    0804a000    00001000    2**2
24  .data   00000004    0804a054    0804a054    00001054    2**0
25  .bss    00000090    0804a080    0804a080    00001058    2**6
26  .comment    0000002c    00000000    00000000    00001058    2**0

Main difference is in 12.text: in first case used size - 0001f862 (129122); second - only 00000332 (818).

like image 262
Сергей Иванопуло Avatar asked Mar 08 '15 19:03

Сергей Иванопуло


1 Answers

In the case of RE2, most of the actual implementation is in a shared library, which doesn't become part of your executable file. It is loaded into memory separately when you run the program.

In the case of std::regex, this is actually just an alias for std::basic_regex<char>, which is a template. The compiler instantiates the template and builds it into your program directly. Although it is possible for templates to be instantiated inside shared libraries, they often aren't, and the std::basic_regex<char> is not in the shared library in your case.

As an example. Here is how to create a separate shared library with the regex instantiation:

main.cpp:

#include <iostream>
#include <string>
#include "regex.hpp"

int main () {
  std::cout << "Content-Type: text/html\n\n";
  std::string s {"There is a subsequence in the string\n"};
  std::basic_regex<char> e {"\\b(sub)([^ ]*)"};
  std::cout << std::regex_replace (s,e,"sub-$2");
}

regex.hpp:

#include <regex>

extern template class std::basic_regex<char>;

extern template std::string
  std::regex_replace(
    const std::string&,
    const std::regex&,
    const char*,
    std::regex_constants::match_flag_type
  );

regex.cpp:

#include "regex.hpp"

template class std::basic_regex<char>;

template std::string 
  std::regex_replace(
    const std::string&,
    const std::regex&,
    const char*,
    std::regex_constants::match_flag_type
  );

build steps:

g++ -std=c++11 -Os -c -o main.o main.cpp
g++ -std=c++11 -Os -c -fpic regex.cpp
g++ -shared -o libregex.so regex.o
g++ -o main main.o libregex.so -Wl,-rpath,. -L. -lregex

On my system, the resulting file sizes are:

       main:  13936
libregex.so: 196936
like image 177
Vaughn Cato Avatar answered Sep 28 '22 15:09

Vaughn Cato