Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimizing a file content parser class written in TypeScript

I got a typescript module (used by a VSCode extension) which accepts a directory and parses the content contained within the files. For directories containing large number of files this parsing takes a bit of time therefore would like some advice on how to optimize it.

I don't want to copy/paste the entire class files therefore will be using a mock pseudocode containing the parts that I think are relevant.

class Parser {
    constructor(_dir: string) {
        this.dir = _dir;
    }

    parse() {
        let tree: any = getFileTree(this.dir);

        try {
            let parsedObjects: MyDTO[] = await this.iterate(tree.children);
        } catch (err) {
            console.error(err);
        }
    }

    async iterate(children: any[]): Promise<MyDTO[]> {
        let objs: MyDTO[] = [];

        for (let i = 0; i < children.length; i++) {
            let child: any = children[i];

            if (child.type === Constants.FILE) {
                let dto: FileDTO = await this.heavyFileProcessingMethod(file); // this takes time
                objs.push(dto);
            } else {
                // child is a folder
                let dtos: MyDTO[] = await this.iterateChildItems(child.children);
                let dto: FolderDTO = new FolderDTO();
                dto.files = dtos.filter(item => item instanceof FileDTO);
                dto.folders = dtos.filter(item => item instanceof FolderDTO);
                objs.push(FolderDTO);
            }
        }

        return objs;
    }

    async heavyFileProcessingMethod(file: string): Promise<FileDTO> {
        let content: string = readFile(file); // util method to synchronously read file content using fs

        return new FileDTO(await this.parseFileContent(content));
    }

    async parseFileContent(content): Promise<any[]> {
        // parsing happens here and the file content is parsed into separate blocks
        let ast: any = await convertToAST(content); // uses an asynchronous method of an external dependency to convert content to AST
        let blocks = parseToBlocks(ast); // synchronous method called to convert AST to blocks

        return await this.processBlocks(blocks);
    }

    async processBlocks(blocks: any[]): Promise<any[]> {
        for (let i = 0; i < blocks.length; i++) {
            let block: Block = blocks[i];
            if (block.condition === true) {
                // this can take some time because if this condition is true, some external assets will be downloaded (via internet) 
                // on to the caller's machine + some additional processing takes place
                await processBlock(block);
            }
        }
        return blocks;
    }
}

Still sort of a beginner to TypeScript/NodeJS. I am looking for a multithreading/Java-esque solution here if possible. In the context of Java, this.heavyFileProcessingMethod would be a instance of Callable object and this object would be pushed into a List<Callable> which would then be executed parallelly by an ExecutorService returning List<Future<Object>>.

Basically I want all files to be processed parallelly but the function must wait for all the files to be processed before returning from the method (so the entire iterate method will only take as long as the time taken to parse the largest file).

Been reading on running tasks in worker threads in NodeJS, can something like this be used in TypeScript as well? If so, can it be used in this situation? If my Parser class needs to be refactored to accommodate this change (or any other suggested change) it's no issue.

EDIT: Using Promise.all

async iterate(children: any[]): Promise<MyDTO>[] {
    let promises: Promies<MyDTO>[] = [];

    for(let i = 0; i <children.length; i++) {
        let child: any = children[i];

        if (child.type === Constants.FILE) {
            let promise: Promise<FileDTO> = this.heavyFileProcessingMethod(file); // this takes time
            promises.push(promise);
        } else {
            // child is a folder
            let dtos: Promise<MyDTO>[] = this.iterateChildItems(child.children);
            let promise: Promise<FolderDTO> = this.getFolderPromise(dtos);
            promises.push(promise);
        }
    }

    return promises;
}

async getFolderPromise(promises: Promise<MyDTO>[]): Promise<FolderDTO> {
    return Promise.all(promises).then(dtos => {
        let dto: FolderDTO = new FolderDTO();
        dto.files = dtos.filter(item => item instanceof FileDTO);
        dto.folders = dtos.filter(item => item instanceof FolderDTO);
        return dto;
    })
}
like image 521
user9492428 Avatar asked May 17 '21 19:05

user9492428


1 Answers

first: Typescript is really Javascript

Typescript is just Javascript with static type checking, and those static types are erased when the TS is transpiled to JS. Since your question is about algorithms and runtime language features, Typescript has no bearing; your question is a Javascript one. So right off the bat that tells us the answer to

Been reading on running tasks in worker threads in NodeJS, can something like this be used in TypeScript as well?

is YES.

As to the second part of your question,

can it be used in this situation?

the answer is YES, but...

second: Use Worker Threads only if the task is CPU bound.

Can does not necessarily mean you should. It depends on whether your processes are IO bound or CPU bound. If they are IO bound, you're most likely far better off relying on Javascript's longstanding asynchronous programming model (callbacks, Promises). But if they are CPU bound, then using Node's relatively new support for Thread-based parallelism is more likely to result in throughput gains. See Node.js Multithreading!, though I think this one is better: Understanding Worker Threads in Node.js.

While worker threads are lighter weight than previous Node options for parallelism (spawning child processes), it is still relatively heavy weight compared to threads in Java. Each worker runs in its own Node VM, regular variables are not shared (you have to use special data types and/or message passing to share data). It had to be done this way because Javascript is designed around a single-threaded programming model. It's extremely efficient within that model, but that design makes support for multithreading harder. Here's a good SO answer with useful info for you: https://stackoverflow.com/a/63225073/8910547

My guess is your parsing is more IO bound, and the overhead of spawning worker threads will outweigh any gains. But give it a go and it will be a learning experience. :)

like image 192
Inigo Avatar answered Nov 14 '22 22:11

Inigo