What is the best way to handle large data with Tensorflow.js and tf.Tensor?

Question

I am using tf.Tensor and tf.concat() to handle large training data, and I found continuous using of tf.concat() gets slow. What is the best way to load large data from file to tf.Tensor?

Background

I think it's common way to handle data by array in Javascript. to achieve that, here is the rough steps to go.

steps to load data from file to Array

read line from file
parse line to Javascript's Object
add that object to array by Array.push()
after finish reading line to end, we can use that array with for loop.

so I think I can use tf.concat() in similar way to above.

steps to load data from file to tf.Tensor

read line from file
parse line to Javascript's Object
parse object to tf.Tensor
add tensor to original tensor by tf.concat()
after finish reading line to end, we can use that tf.Tensor

Some code

Here is some code to measure both speed of Array.push() and tf.concat()

Click to copy

import * as tf from "@tensorflow/tfjs"

let t = tf.tensor1d([1])
let addT = tf.tensor1d([2])

console.time()
for (let idx = 0; idx < 50000; idx++) {
    if (idx % 1000 == 0) {
        console.timeEnd()
        console.time()
        console.log(idx)
    }
    t = tf.tidy(() => t.concat(addT))
}


let arr = []
let addA = 1
console.time()
for (let idx = 0; idx < 50000; idx++) {
    if (idx % 1000 == 0) {
        console.timeEnd()
        console.time()
        console.log(idx)
    }
    arr.push(addA)
}

Measurement

We can see stable process on Array.push(), but it gets slow on tf.concat()

For tf.concat()

Click to copy

default: 0.150ms
0
default: 68.725ms
1000
default: 62.922ms
2000
default: 23.199ms
3000
default: 21.093ms
4000
default: 27.808ms
5000
default: 39.689ms
6000
default: 34.798ms
7000
default: 45.502ms
8000
default: 94.526ms
9000
default: 51.996ms
10000
default: 76.529ms
11000
default: 83.662ms
12000
default: 45.730ms
13000
default: 89.119ms
14000
default: 49.171ms
15000
default: 48.555ms
16000
default: 55.686ms
17000
default: 54.857ms
18000
default: 54.801ms
19000
default: 55.312ms
20000
default: 65.760ms

For Array.push()

Click to copy

default: 0.009ms
0
default: 0.388ms
1000
default: 0.340ms
2000
default: 0.333ms
3000
default: 0.317ms
4000
default: 0.330ms
5000
default: 0.289ms
6000
default: 0.299ms
7000
default: 0.291ms
8000
default: 0.320ms
9000
default: 0.284ms
10000
default: 0.343ms
11000
default: 0.327ms
12000
default: 0.317ms
13000
default: 0.329ms
14000
default: 0.307ms
15000
default: 0.218ms
16000
default: 0.193ms
17000
default: 0.234ms
18000
default: 1.943ms
19000
default: 0.164ms
20000
default: 0.148ms

493

asked Apr 30 '19 23:04

ya9do

1 Answers

While the tf.concat and Array.push function look and behave similar there is one big difference:

tf.concat creates a new tensor from the input
Array.push adds the input to the first array

Examples

tf.concat

Click to copy

const a = tf.tensor1d([1, 2]);
const b = tf.tensor1d([3]);
const c = tf.concat([a, b]);

a.print(); // Result: Tensor [1, 2]
b.print(); // Result: Tensor [3]
c.print(); // Result: Tensor [1, 2, 3]

The resulting variable c is a new Tensor while a and b are not changed.

Array.push

Click to copy

const a = [1,2];
a.push(3);

console.log(a); // Result: [1,2,3]

Here, the variable a is directly changed.

Impact on the runtime

For the runtime speed, this means that tf.concat copies all tensor values to a new tensor before adding the input. This obviously takes more time the bigger the array is that needs to be copied. In contrast to that, Array.push does not create a copy of the array and therefore the runtime will be more or less the same no matter how big the array is.

Note, that this is "by design" as tensors are immutable, so every operation on an existing tensor always creates a new tensor. Quote from the docs:

Tensors are immutable, so all operations always return new Tensors and never modify input Tensors.

Therefore, if you need to create a large tensor from input data it is advisable to first read all data from your file and merge it with "vanilla" JavaScript functions before creating a tensor from it.

Handling data too big for memory

In case you have a dataset so big that you need to handle it in chunks because of memory restrictions, you have two options:

Use the trainOnBatch function
Use a dataset generator

Option 1: trainOnBatch

The trainOnBatch function allows to train on a batch of data instead of using the full dataset to it. Therefore, you can split your code into reasonable batches before training them, so you don't have to merge your data together all at once.

Option 2: Dataset generator

The other answer already went over the basics. This will allow you to use a JavaScript generator function to prepare the data. I recommend to use the generator syntax instead of an iterator factory (used in the other answer) as it is the more modern JavaScript syntax.

Exampe (taken from the docs):

Click to copy

function* dataGenerator() {
  const numElements = 10;
  let index = 0;
  while (index < numElements) {
    const x = index;
    index++;
    yield x;
  }
}

const ds = tf.data.generator(dataGenerator);

You can then use the fitDataset function to train your model.

answered Nov 15 '22 05:11

Thomas Dondorf

Related questions
                            
                                use i18n-2 in node.js outside of res scope
                            
                                How do I retarget poorly targeted build files in a React + Electron app?
                            
                                Change paper color Material-UI
                            
                                Iterating array-like objects using forEach, [].forEach.call(...) or Array.prototype.slice.call(...).forEach?
                            
                                what is the correct way to destructure a nested json object with fetch?
                            
                                How to round off to 2 decimal places with format SI prefix?
                            
                                Using Typescript with Styled Components and Material UI
                            
                                Why does console.log not trigger proxy getter trap?
                            
                                DataTables API undefined with Symfony Webpack Encore
                            
                                iOS 12.2: device orientation data still blocked even after manual enabling in settings
                            
                                Access to key value of returned array from the ajax response
                            
                                NativeScript Vue - Conditional nativation based on user's login state
                            
                                Cordova blackberry convenience key
                            
                                In Bootstrap, why doesn't my disabled button have a tooltip?
                            
                                what does the "modules:auto" means in @babel/preset-env field?
                            
                                Input value not changing in Angular
                            
                                How send simple value from axios.post to asp.net core?
                            
                                How do I throw error in one function to be caught by another function?
                            
                                Create artificial zoom transform event
                            
                                What does a return statement do inside a setter?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the best way to handle large data with Tensorflow.js and tf.Tensor?

Tags:

javascript

node.js

tensorflow

deep-learning

tensorflow.js

Question

Background

steps to load data from file to Array

steps to load data from file to tf.Tensor

Some code

Measurement

For tf.concat()

For Array.push()

ya9do

People also ask

1 Answers

Examples

Impact on the runtime

Handling data too big for memory

Option 1: trainOnBatch

Option 2: Dataset generator

Thomas Dondorf

Recent Activity

Donate For Us