I want to import a large log file into (Postgres-)SQL Certain string columns are very repetitive for example column 'event_type' has 1 of 10 different string values. I have a rough understanding of normalizing data. Firstly, is it correct to assume that : It's beneficial (for storage size and indexing and query speed) to store event_type in a separate table (possibly with a foreign key relation)? In order to normalize I would have to check for the distinct values of event_type in the raw log and insert them into the event_types table. There are many field types like event_types. So Secondly: Is there a way to tell the database to create and maintain this kind of table when inserting the data? Are there other strategies to accomplish this? I'm working with pandas.

This is a typical situation when starting to build a database from data hitherto stored otherwise, such as in a log file. There is a solution - as usual - but it is not a very fast one. Perhaps you can write a log message handler to process messages as they come in; provided the flux (messages/second) is not too large you won't notice the overhead, especially if you can forget about writing the message to a flat text file. Firstly, on the issue of normalization. Yes, you should always normalize and to the so-called 3rd Normal Form (3NF). This basically implies that any kind of real-world data (such as your event_type) is stored once and once only. (There are cases where you could relax this a little and go to 2NF - usually only when the real-world data requires very little storage, such as an ISO country code, a M/F(male/female) choice, etc - but in most other cases 3NF will be better.) In your specific case, let's say that your event_type is a <code>char(20)</code> type. Ten such events with their corresponding <code>int</code> codes easily fit on a single database page, typically 4kB of disk space. If you have 1,000 log messages with event_type as a <code>char(20)</code> then you need 20kB just to store that information, or five database pages. If you have other such items in your log message then the storage reduction becomes correspondingly larger. Other items such as <code>date</code> or <code>timestamp</code> can be stored in their native format (4 and 8 bytes, respectively) for smaller storage, better performance and increased functionality (such as comparing dates or looking at ranges). Secondly, you cannot tell the database to create such tables, you have to do that yourself. But once created, a stored procedure can parse your log messages and put the data in the right tables. In the case of log messages, you can do something like this (assuming you want to do the parsing in the database and thus not in python): <pre class="prettyprint"><code>CREATE FUNCTION ingest_log_message(mess text) RETURNS int AS $$ DECLARE parts text[]; et_id int; log_id int; BEGIN parts := regexp_split_to_array(mess, ','); -- Whatever your delimiter is -- Assuming: -- parts[1] is a timestamp -- parts[2] is your event_type -- parts[3] is the actual message -- Get the event_type identifier. If event_type is new, INSERT it, else just get the id. -- Do likewise with other log message parts whose unique text is located in a separate table. SELECT id INTO et_id FROM event_type WHERE type_text = quote_literal(parts[2]); IF NOT FOUND THEN INSERT INTO event_type (type_text) VALUES (quote_literal(parts[2])) RETURNING id INTO et_id; END IF; -- Now insert the log message INSERT INTO log_message (dt, et, msg) VALUES (parts[1]::timestamp, et_id, quote_literal(parts[3])) RETURNING id INTO log_id; RETURN log_id; END; $$ LANGUAGE plpgsql STRICT; </code></pre> The tables you need for this are: <pre class="prettyprint"><code>CREATE TABLE event_type ( id serial PRIMARY KEY, type_text char(20) ); </code></pre> and <pre class="prettyprint"><code>CREATE TABLE log_message ( id serial PRIMARY KEY, dt timestamp, et integer REFERENCES event_type msg text ); </code></pre> You can then invoke this function as a simple <code>SELECT</code> statement, which will return the <code>id</code> of the newly insert log message: <pre class="prettyprint"><code>SELECT * FROM ingest_log_message(the_message); </code></pre> Note the use of the <code>quote_literal()</code> function in the function body. This has two important functions: (1) Quotes inside the string are properly escaped (so that words like "isn't" don't mess up the command); and (2) It guards against SQL-injection by malicious generators of log messages. All of the above obviously needs to be tailored to your specific situation.

How to normalize data efficently while INSERTing into SQL table (Postgres)

1 Answers

This is a typical situation when starting to build a database from data hitherto stored otherwise, such as in a log file. There is a solution - as usual - but it is not a very fast one. Perhaps you can write a log message handler to process messages as they come in; provided the flux (messages/second) is not too large you won't notice the overhead, especially if you can forget about writing the message to a flat text file.

Firstly, on the issue of normalization. Yes, you should always normalize and to the so-called 3rd Normal Form (3NF). This basically implies that any kind of real-world data (such as your event_type) is stored once and once only. (There are cases where you could relax this a little and go to 2NF - usually only when the real-world data requires very little storage, such as an ISO country code, a M/F(male/female) choice, etc - but in most other cases 3NF will be better.)

In your specific case, let's say that your event_type is a char(20) type. Ten such events with their corresponding int codes easily fit on a single database page, typically 4kB of disk space. If you have 1,000 log messages with event_type as a char(20) then you need 20kB just to store that information, or five database pages. If you have other such items in your log message then the storage reduction becomes correspondingly larger. Other items such as date or timestamp can be stored in their native format (4 and 8 bytes, respectively) for smaller storage, better performance and increased functionality (such as comparing dates or looking at ranges).

Secondly, you cannot tell the database to create such tables, you have to do that yourself. But once created, a stored procedure can parse your log messages and put the data in the right tables.

In the case of log messages, you can do something like this (assuming you want to do the parsing in the database and thus not in python):

CREATE FUNCTION ingest_log_message(mess text) RETURNS int AS $$
DECLARE
  parts  text[];
  et_id  int;
  log_id int;
BEGIN
  parts := regexp_split_to_array(mess, ','); -- Whatever your delimiter is

  -- Assuming:
  --   parts[1] is a timestamp
  --   parts[2] is your event_type
  --   parts[3] is the actual message

  -- Get the event_type identifier. If event_type is new, INSERT it, else just get the id.
  -- Do likewise with other log message parts whose unique text is located in a separate table.
  SELECT id INTO et_id
  FROM event_type
  WHERE type_text = quote_literal(parts[2]);
  IF NOT FOUND THEN
    INSERT INTO event_type (type_text)
    VALUES (quote_literal(parts[2]))
    RETURNING id INTO et_id;
  END IF;

  -- Now insert the log message
  INSERT INTO log_message (dt, et, msg)
  VALUES (parts[1]::timestamp, et_id, quote_literal(parts[3]))
  RETURNING id INTO log_id;

  RETURN log_id;
END; $$ LANGUAGE plpgsql STRICT;

The tables you need for this are:

CREATE TABLE event_type (
  id        serial PRIMARY KEY,
  type_text char(20)
);

and

CREATE TABLE log_message (
  id        serial PRIMARY KEY,
  dt        timestamp,
  et        integer REFERENCES event_type
  msg       text
);

You can then invoke this function as a simple SELECT statement, which will return the id of the newly insert log message:

SELECT * FROM ingest_log_message(the_message);

Note the use of the quote_literal() function in the function body. This has two important functions: (1) Quotes inside the string are properly escaped (so that words like "isn't" don't mess up the command); and (2) It guards against SQL-injection by malicious generators of log messages.

All of the above obviously needs to be tailored to your specific situation.

157

answered Oct 04 '22 20:10

Patrick

Related questions
                            
                                Choosing elements from python list based on probability
                            
                                Renaming a table in pandas hdfstore
                            
                                Metaprogramming in Python - adding an object method
                            
                                how to extract the designated div table data in lxml?
                            
                                merge two integer variables in a single float in python
                            
                                Configure module logger to flask app logger
                            
                                Mongoengine: How to sort Embedded Document list by Embedded document field
                            
                                sqlalchemy event on column update
                            
                                Adding or removing specific rows or columns in an h5py dataset
                            
                                How to remove single pixels on the borders of a blob?
                            
                                make python @property handle +=, -= etc
                            
                                Django's escapejs filter and XSS
                            
                                Saving in memory file object with pillow
                            
                                python key value list to panda series
                            
                                Any way to pass args from test to setUp() method for a Python unittest test?
                            
                                Python - simplify repeated if statements
                            
                                How to remove doc strings when using Cython distutils?
                            
                                how to add a left or right border to a tkinter Label
                            
                                Python Rope: How to Find all missing imports and errors in all sub modules refactoring
                            
                                Custom event : TypeError: 'NoneType' object is not subscriptable

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to normalize data efficently while INSERTing into SQL table (Postgres)

Tags:

python

sql

pandas

postgresql

Cilvic

People also ask

1 Answers

Patrick

Recent Activity

Donate For Us