I know my question has a lot of answers on the internet but it's seems i can't find a good answer for it, so i will try to explain what i have and hope for the best,
so what i'm trying to do is reading a big json file that might be has more complex structure "nested objects with big arrays" than this but for simple example:
{
"data": {
"time": [
1,
2,
3,
4,
5,
...
],
"values": [
1,
2,
3,
4,
6,
...
]
}
}
this file might be 200M or more, and i'm using file_get_contents()
and json_decode()
to read the data from the file,
then i put the result in variable and loop over the time and take the time value with the current index to get the corresponding value by index form the values array, then save the time and the value in the database but this taking so much CPU and Memory, is their a better way to do this
a better functions to use, a better json structure to use, or maybe a better data format than json to do this
my code:
$data = json_decode(file_get_contents(storage_path("test/ts/ts_big_data.json")), true);
foreach(data["time"] as $timeIndex => timeValue) {
saveInDataBase(timeValue, data["values"][timeIndex])
}
thanks in advance for any help
Update 06/29/2020:
i have another more complex json structure example
{
"data": {
"set_1": {
"sub_set_1": {
"info_1": {
"details_1": {
"data_1": [1,2,3,4,5,...],
"data_2": [1,2,3,4,5,...],
"data_3": [1,2,3,4,5,...],
"data_4": [1,2,3,4,5,...],
"data_5": 10254552
},
"details_2": [
[1,2,3,4,5,...],
[1,2,3,4,5,...],
[1,2,3,4,5,...],
]
},
"info_2": {
"details_1": {
"data_1": {
"arr_1": [1,2,3,4,5,...],
"arr_2": [1,2,3,4,5,...]
},
"data_2": {
"arr_1": [1,2,3,4,5,...],
"arr_2": [1,2,3,4,5,...]
},
"data_5": {
"text": "some text"
}
},
"details_2": [1,2,3,4,5,...]
}
}, ...
}, ...
}
}
the file size might be around 500MB or More and the arrays inside this json file might have around 100MB of data or more.
and my question how can i get any peace and navigate between nodes of this data with the most efficient way that will not take much RAM and CPU, i can't read the file line by line because i need to get any peace of data when i have to,
is python for example more suitable for handling this big data with more efficient than php ?
please if you can provide a detailed answer i think it will be much help for every one that looking to do this big data stuff with php.
JSON is a great format and way better alternative to XML. In the end JSON is almost one on one convertible to XML and back.
Big files can get bigger, so we don't want to read all the stuff in memory and we don't want to parse the whole file. I had the same issue with XXL size JSON files.
I think the issue lays not in a specific programming language, but in a realisation and specifics of the formats.
I have 3 solutions for you:
Almost as fast as streamed XMLReader, there is a library https://github.com/pcrov/JsonReader. Example:
use pcrov\JsonReader\JsonReader;
$reader = new JsonReader();
$reader->open("data.json");
while ($reader->read("type")) {
echo $reader->value(), "\n";
}
$reader->close();
This library will not read the whole file into memory or parse all the lines. It is step by step on command traverse through the tree of JSON object.
Preprocess file to a different format like XML or CSV. There is very lightweight nodejs libs like https://www.npmjs.com/package/json2csv to CSV from JSON.
For example Redis or CouchDB(import json file to couch db-)
Your problem is basically related to the memory management performed by each specific programming language that you might use in order to access the data from a huge (storage purpose) file.
For example, when you amass the operations by using the code that you just mentioned (as below)
$data = json_decode(file_get_contents(storage_path("test/ts/ts_big_data.json")), true);
what happens is that the memory used by runtime Zend engine increases too much, because it has to allocate certain memory units to store references about each ongoing file handling involved in your code statement - like keeping also in memory a pointer, not only the real file opened - unless this file gets finally overwritten and the memory buffer released (freed) again. It's no wonder that if you force the execution of both file_get_contents() function that reads the file into a string and also the json_decode() function, you force the interpreter to keep in memory all 3 "things": the file itself, the reference created (the string), and also the structure (the json file).
On the contrary if you break the statement in several ones, the memory stack hold by the first data structure (the file) will be unloaded when the operation of "getting its content" then writing it into another variable (or file) is fully performed. As time as you don't define a variable where to save the data, it will still stay in the memory (as a blob - with no name, no storage address, just content). For this reason, it is much more CPU and RAM effective - when working with big data - to break everything in smaller steps.
So you have first to start by simply rewriting your code as follows:
$somefile = file_get_contents(storage_path("test/ts/ts_big_data.json"));
$data = json_decode($somefile, true);
When first line gets executed, the memory hold by ts_big_data.json gets released (think of it as being purged and made available again to other processes).
When second line gets executed, also $somefile's memory buffer gets released, too. The take away point from this is that instead of always having 3 memory buffers used just to store the data structures, you'll only have 2 at each time, if of course ignoring the other memory used to actually construct the file. Not to say that when working with arrays (and JSON files just exactly arrays they are), that dynamically allocated memory increases dramatically and not linear as we might tend to think. Bottom line is that instead of a 50% loss in performance just on storage allocation for the files (3 big files taking 50% more space than just 2 of them), we better manage to handle in smaller steps the execution of the functions 'touching' these huge files.
In order to understand this, imagine that you access only what is needed at a certain moment in time (this is also a principle called YAGNI -You Aren't Gonna Need It - or similar in the context of Extreme Programming Practices - see reference here https://wiki.c2.com/?YouArentGonnaNeedIt something inherited since the C or Cobol old times.
The next approach to follow is to break the file in more pieces, but in a structured one (relational dependent data structure) as is in a database table / tables.
Obviously, you have to save the data pieces again as blobs, in the database. The advantage is that the retrieval of data in a DB is much more faster than in a file (due to the allocation of indexes by the SQL when generating and updating the tables). A table having 1 or two indexes can be accessed in a lightning fast manner by a structured query. Again, the indexes are pointers to the main storage of the data.
One important topic however is that if you still want to work with the json (content and type of data storage - instead of tables in a DB) is that you cannot update it locally without changing it globally. I am not sure what you meant by reading the time related function values in the json file. Do you mean that your json file is continuously changing? Better break it in several tables so each separate one can change without affecting all the mega structure of the data. Easier to manage, easier to maintain, easier to locate the changes.
My understanding is that best solution would be to split the same file in several json files where you strip down the not needed values. BY THE WAY, DO YOU ACTUALLY NEED ALL THE STORED DATA ??
I wouldn't come now with a code unless you explain me the above issues (so we can have a conversation) and thereafter I will accordingly edit my answer. I wrote yesterday a question related to handling of blobs - and storing in the server - in order to accelerate the execution of a data update in a server using a cron process. My data was about 25MB+ not 500+ as in your case however I must understand the use case for your situation.
One more thing, how was created that file that you must process ? Why do you manage only the final form of it instead of intervening in further feeding it with data ? My opinion is that you might stop storing data into it as previously done (and thus stop adding to your pain) and instead transform its today purpose only into historic data storage from now on then go toward storing the future data in something more elastic (as MongoDB or NoSQL databases).
Probably you don't need so much a code as a solid and useful strategy and way of working with your data first.
Programming comes last, after you decided all the detailed architecture of your web project.
My approach will be reading the JSON FILE
in chunks.
If these json objects have a consistent structure, you can easily detect when a json object in a file starts, and ends.
Once you collect a whole object, you insert it into a db, then go on to the next one.
There isn't much more to it. the algorithm to detect the beginning and end of a json object may get complicating depending on your data source, but I hvae done something like this before with a far more complex structure (xml) and it worked fine.
Above answer is taken from => Parse large JSON file
Please see the below references, it can be helpful for your case
=> https://laracasts.com/discuss/channels/general-discussion/how-to-open-a-28-gb-json-file-in-php
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With