Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read huge CSV file with 29 million rows of data using .net

I have a huge .csv file, to be specific a .TAB file with 29 million rows and the file size is around 600 MB. I would need to read this into an IEnumerable collection.

I have tried CsvHelper, GenericParser, and few other solutions but always ending up with an Out of Memory exception

Please suggest a way to do this

I have tried

var deliveryPoints = new List<Point>();

using (TextReader csvreader1 = File.OpenText(@"C:\testfile\Prod\PCDP1705.TAB")) //StreamReader csvreader1 = new StreamReader(@"C:\testfile\Prod\PCDP1705.TAB"))
using (var csvR1 = new CsvReader(csvreader1, csvconfig))
{
     csvR1.Configuration.RegisterClassMap<DeliveryMap>();
     deliveryPoints = csvR1.GetRecords<Point>().ToList();
}

using (GenericParser parser = new GenericParser())
{
     parser.SetDataSource(@"C:\testfile\Prod\PCDP1705.TAB");

     parser.ColumnDelimiter = '\t';
     parser.FirstRowHasHeader = false;
     //parser.SkipStartingDataRows = 10;
     //parser.MaxBufferSize = 4096;
     //parser.MaxRows = 500;
     parser.TextQualifier = '\"';

     while (parser.Read())
     {
         var address = new Point();
         address.PostCodeID = int.Parse(parser[0]);
         address.DPS = parser[1];
         address.OrganisationFlag = parser[2];
         deliveryPoints.Add(address);
     }
}

and

var deliveryPoints = new List<Point>();
csvreader = new StreamReader(@"C:\testfile\Prod\PCDP1705.TAB");
csv = new CsvReader(csvreader, csvconfig);

while (csv.Read())
{
     var address = new Point();
     address.PostCodeID = int.Parse(csv.GetField(0));
     address.DPS = csv.GetField(1);                
     deliveryPoints.Add(address);
}
like image 403
Lee Avatar asked Jun 05 '17 05:06

Lee


People also ask

How to handle large data files in CSV format?

The following are few ways to effectively handle large data files in .csv format. The dataset we are going to use is gender_voice_dataset. One way to process large files is to read the entries in chunks of reasonable size, which are read into the memory and are processed before reading the next chunk.

How to read a 1 million lines of data from a file?

One solution would be to read the whole file in one time (if you have enough memory space, for 1 million row it should be OK) using File.ReadAllLines, store all lines in a string array, then process (i.e. parse using string.Split ...etc.) in your Parallel.Foreach, if the rows order is not important.

How to load data from CSV to data model in Excel?

This works by loading data into Data Model, keeping a link to the original CSV file. This will allow you to load millions of rows. Here’s how to do it. Navigate to Data >> Get & Transform Data >> From File >> From Text/CSV and import the CSV file. After a while, you are going to get a window with the file preview.

How long does it take to read a CSV file in Python?

The time taken is about 4 seconds which might not be that long, but for entries that have millions of rows, the time taken to read the entries has a direct effect on the efficiency of the model. Now, let us use chunks to read the CSV file: Python3 import pandas as pd


2 Answers

The problem is that you are loading entire file into memory. You can compile your code to x64 which will increase memory limit for your program rapidly, but it is not recommended if you can avoid loading entire file into memory.

Notice that calling ToList() forces the CsvReader to load entire file into memory at once:

csvR1.GetRecords<Point>().ToList();

But this will load only one line at a time:

foreach(var record in csvR1.GetRecords<Point>())
{
    //do whatever with the single record
}

This way you can process files of unlimited size

like image 67
Liero Avatar answered Sep 20 '22 01:09

Liero


No need to use 3rd party software. Use Net Library methods

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Data;

namespace ConsoleApplication1
{
    class Program
    {
        static void Main(string[] args)
        {
            StreamReader csvreader = new StreamReader(@"C:\testfile\Prod\PCDP1705.TAB");
            string inputLine = "";
            while ((inputLine = csvreader.ReadLine()) != null)
            {
                var address = new Point();
                string[] csvArray = inputLine.Split(new char[] { ',' });
                address.postCodeID = int.Parse(csvArray[0]);
                address.DPS = csvArray[1];
                Point.deliveryPoints.Add(address);
            }

            //add data to datatable
            DataTable dt = new DataTable();
            dt.Columns.Add("Post Code", typeof(int));
            dt.Columns.Add("DPS", typeof(string));

            foreach (Point point in Point.deliveryPoints)
            {
                dt.Rows.Add(new object[] { point.postCodeID, point.DPS });
            }

        }
    }
    public class Point
    {
        public static List<Point> deliveryPoints = new List<Point>();
        public int postCodeID { get; set; }
        public string DPS { get; set; }

    }
}
like image 35
jdweng Avatar answered Sep 20 '22 01:09

jdweng