Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Most efficient way to search file for byte patterns in Elixir

Tags:

elixir

id3

I am searching for id3 tags in a song file. A file can have id3v1, id3v1 extended tags (located at the end of the file) as well as id3v2 tags (usually located at the beginning). For the id3v1 tags, I can use File.read(song_file) and pull out the last 355 bytes (128 + 227 for the extended tag). However, for the id3v2 tags, I need to search through the file from the beginning looking for a 10 byte id3v2 pattern. I want to avoid any overhead from opening and closing the same file repeatedly as I search for the different tags, so I thought the best way would be to use File.stream!(song_file) and send the file stream to different functions to search for the different tags.

def parse(file_name) do
  file_stream = File.stream!(file_name, [], 1)
  id3v1_tags(file_stream)
  |> add_tags(id3v2_tags(file_stream))
end

def id3v1_tags(file_stream) do
  tags = Tags%{} #struct containing desired tags
  << id3_extended_tag :: binary-size(227), id3_tag :: binary-size(128) >> = Stream.take(file_stream, -355)
  id3_tag = to_string(id3_tag)
  if String.slice(id3_tag,0, 3) == "TAG" do
    Map.put(tags, :title, String.slice(id3_tag, 3, 30))
    Map.put(tags, :track_artist, String.slice(id3_tag, 33, 30))
    ...
  end
  if String.slice(id3_extended_tag, 0, 4) == "TAG+" do
    Map.put(tags, :title, tags.title <> String.slice(id3_extended_tag, 4, 60))
    Map.put(tags, :track_artist, tags.track_artist <> String.slice(id3_extended_tag, 64, 60))
    ...
  end
end

def id3v2_tags(file_stream) do
  search for pattern:
  <<0x49, 0x44, 0x33, version1, version2, flags, size1, size2, size3, size4>>
end

1) Am I saving any runtime by creating the File.stream! once and sending it to the different functions (I will be scanning tens of thousands of files, so saving a bit of time is important)? Or should I just use File.read for the id3v1 tags and File.stream! for the id3v2 tags?

2) I get an error in the line:

  << id3_extended_tag :: binary-size(227), id3_tag :: binary-size(128) >> = Stream.take(file_stream, -355)

because Stream.take(file_stream, -355) is a function, not a binary. How do I turn it into a binary that I can pattern match?

like image 881
Paul B Avatar asked May 27 '15 11:05

Paul B


1 Answers

I believe your implementation is being unnecessarily complex due to the reliance on stream. Make it work, make it pretty then make it fast (but only if necessary).

For simplicity, I would first load everything into memory. Just use File.read!/1. Then you can use the functions in :binary module to search for patterns (:binary.match/2), split it (:binary.split/2) or grab a certain part (:binary.part/3). There is no need to mix File.stream and File.read too, just read it once and pass that same binary around.

Also, very important, don't use the String module. String is meant to work UTF-8 encoded binaries. You want to use the :binary module for all byte level operation.

Finally, Stream.take/2 always returns functions as it is lazy. You want to use Enum.take/2 instead (it accepts streams as streams are also enumerables). Although, as I said, I would skip the stream stuff altogether.

like image 50
José Valim Avatar answered Oct 11 '22 00:10

José Valim