Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java code runs out of space memory on AWS but not MacOSX

Tags:

I need another set of eyes on this.

I've written out a zip file into hundreds of gigabytes with this exact code with no modifications locally on MacOSX.

With 100% unchanged code, just deployed to an AWS instance running Ubuntu, this same code runs into Out of Memory issues (heap space).

Here's the code that's being run, streaming MyBatis to a CSV file on disk:

File directory = new File(feedDirectory);
    File file;
    try {
        file = File.createTempFile(("feed-" + providerCode + "-"), ".csv", directory);
    } catch (IOException e) {
        throw new RuntimeException("Unable to create file to write feed to disk: " + e.getMessage(), e);
    }

    String filePath = file.getAbsolutePath();
    log.info(String.format("File name for %s feed is %s", providerCode, filePath));

    // output file
    try (FileOutputStream out = new FileOutputStream(file)) {
        streamData(out, providerCode, startDate, endDate);
    } catch (IOException e) {
        throw new RuntimeException("Unable to write feed to file: " + e.getMessage());
    }

    public void streamData(OutputStream outputStream, String providerCode, Date startDate, Date endDate) throws IOException {
    try (CSVPrinter printer = CsvUtil.openPrinter(outputStream)) {
        StreamingHandler<FStay> handler = stayPrintingHandler(printer);
        warehouse.doForAllStaysByProvider(providerCode, startDate, endDate, handler);
    }
}

private StreamingHandler<FStay> stayPrintingHandler(CSVPrinter printer) {
    StreamingHandler<FStay> handler = new StreamingHandler<>();
    handler.setHandler((stay) -> {
        try {
            EXPORTER.writeStay(printer, stay);
        } catch (IOException e) {
            log.error("Issue with writing output: " + e.getMessage(), e);
        }
    });
    return handler;
}

// The EXPORTER method
 import org.apache.commons.csv.CSVPrinter;
    public void writeStay(CSVPrinter printer, FStay stay) throws IOException {
    List<Object> list = asList(stay);
    printer.printRecord(list);
}

List<Object> asList(FStay stay) {
    List<Object> list = new ArrayList<>(46);
    list.add(stay.getUid());
    list.add(stay.getProviderCode());
    //....
    return list;
}

Here's a graph of the JVM heap space (using jvisualvm) when I run this locally. I've run this consistently with of Java 8 (jdk1.8.0_51 and 1.8.0_112) locally and have gotten great results. Even written out a terabyte of data.

Notice heap looks great

^ In the above, the max heap space is set to 4 gigs, and the most it ever increases to is 1.5 gigs, before going back down to around 500 MB, while streaming data to the CSV file as it's supposed to.

However, when I run this on Ubuntu with jdk 1.8.0_111, the exact same operation will not complete, running out of heap space (java.lang.OutOfMemoryError: Java heap space)

I've upped the Xmx value from 8 gigs to 16 to 25 gigs, and still run out of heap space. Meanwhile... the total size of the file is only 10 Gigs in total... which really perplexes me.

Here's what the JVisualVm graph looks like on the Ubuntu box:

Same code, same operation

I've no doubt it's the exact same code running in both environments, with the same operation being performed in each (same database server providing the same data)

The only differences I can think of at this point are:

  • Operating system - Ubuntu vs Mac OS X
  • Hosted VM in AWS vs hard metal laptop
  • Network speed is faster in AWS between database and Ubuntu server
  • JDK version is 1.8.0_111 in Ubuntu, tried 1.8.0_51 and 1.8.0_112 locally

Can anyone help shed any light on this problem?

Update

I've tried replacing all the 'try-with-resources' statements with explicit flush/close statements and no luck.

What's more, I tried to force a garbage collection on the Ubuntu box as soon as I started to see the data come in, and it had no effect-- there is something definitely stopping the heap from being collected on the Ubuntu machine... while running the exact same code on OS X let me write the full enchilada again no problem.

Update 2

In addition to the differences in the environments above, the only other difference I can think of is if the connection between the servers in AWS is so fast that it streams the data faster than it can flush the data to disk... but that still doesn't explain the issue where I only have 10 gigs of data total, and it blows up a JVM with 20 Gigs of heap space.

Is there any likelihood of there being a bug at the Ubuntu/Java level for this?

Update 3

Tried replacing the output of the CSVPrinter to use an entirely separate library (OpenCSV's CSVWriter in lieu of Apache's CSV library) and the same result occurs.

As soon as this code starts receiving data from the database, the heap starts blowing up and the garbage collector fails to reclaim any memory... but only on Ubuntu. On OS X, everything is reclaimed immediately and the heap never grows.

I've also tried flushing the stream after every write, but had no luck with that as well.

Update 4

Got the heap dump to print out, and according to this I should be looking at the database driver. Specifically the InboundDataHandler in amazon's redshift driver.

I'm using myBatis with a custom result handler. I tried setting the result handler to effectively do nothing when it gets a result (new ResultHandler<>() { // method overridden to do literally nothing}) and I know I'm not holding on to any references there.

Since it's the InboundDataHandler defined by AWS/Redshift... it makes me think it may be lower than the myBatis level... either:

  • Error in the SqlSessionFactory I'm setting up
  • Bug in the Redshift driver that only pops up in Ubuntu / AWS
  • Bug in the result handler I have overwritten

Here's the heap dump screenshot: heap dump screenshot

Here's where I'm setting up my SqlSessionFactoryBean:

 @Bean
public javax.sql.DataSource redshiftDataSource() throws ClassNotFoundException {
    log.info("Got to datasource config");
    // Dynamically load driver at runtime.
    Class.forName(dataWarehouseDriver);
    DataSource dataSource = new DataSource();
    dataSource.setURL(dataWarehouseUrl);
    dataSource.setUserID(dataWarehouseUsername);
    dataSource.setPassword(dataWarehousePassword);
    return dataSource;
}

@Bean
public SqlSessionFactoryBean sqlSessionFactory() throws ClassNotFoundException {
    SqlSessionFactoryBean factoryBean = new SqlSessionFactoryBean();
    factoryBean.setDataSource(redshiftDataSource());
    return factoryBean;
}

Here's the myBatis code I'm running as a test to verify that it's not me holding on to records in my ResultHandler:

warehouse.doForAllStaysByProvider(providerCode, startDate, endDate, new ResultHandler<FStay>() {
            @Override
            public void handleResult(ResultContext<? extends FStay> resultContext) {
                // do nothing

            }
        });

Is there a way I can force the SQL connection to not hang on to records or something? I'll again re-iterate that on my local machine, there is no issue with this memory leak... it only surfaces when running the code in the hosted AWS environment. And in both cases, the Database driver and server are the same.

Update 6 I think it's finally fixed. Thanks to all who pointed me in the direction of the heap dump. That helped narrow it down to the offending class in a huge way.

After that, I did some research on the AWS redshift driver, and it explicitly says that your clients should specify a limit for any operations on large data. So I found out how to do that in my myBatis configuration:

<select id="doForAllStaysByProvider" fetchSize="1000" resultMap="FStayResultMap">        
    select distinct
        f_stay.uid,

And this did the trick.

Mind you, this isn't necessary even when handling much larger data sets downloaded remotely from AWS (Database in AWS, code executing on laptop at home), and this shouldn't be necessary since I'm overriding the myBatis ResultHandler<> which handles each row individually and never holds on to any objects.

Yet something funky happens with the AWS redshift jdbc driver only when it's run in AWS (database in aws, code executing in AWS instance) which causes this InboundDataHandler to never release its resources, unless a fetchSize is specified.

Here's the heap of the server running now, getting much further than it ever has before in AWS, with the heap space never moving above 500Mb, and after i hit 'force gc' in jvisualvm, it shows the 'used' heap at less than 100mb:

it works

Thanks again in a huge way to all those who helped guide this!

like image 312
Cuga Avatar asked Nov 04 '16 01:11

Cuga


People also ask

Which of the following configuration of the virtual machine a user can choose in EC2?

Amazon EC2 allows you to choose between Fixed Performance Instances (e.g. C, M and R instance families) and Burstable Performance Instances (e.g. T2). Burstable Performance Instances provide a baseline level of CPU performance with the ability to burst above the baseline.

Which of the following best describes the Amazon EC2 memory optimized instance type?

Memory Optimized: High Memory instances High Memory instances have the greatest amount of available RAM, providing 6 TB, 9 TB, or 12 TB of memory in a single instance. Like X1/X1e, these are suited to production deployments of hugely memory intensive, real-time databases such as SAP HANA.

Which of the following best describes the Amazon EC2 general purpose instance type?

General purpose instances provide a balance of compute, memory, and networking resources, and can be used for a wide range of workloads.


1 Answers

Finally figured out a solution.

The heap dump was the biggest aid-- it indicated the InboundDataHandler class of Amazon's RedShift/postgres JDCB driver was the prime culprit.

The code to set up the SqlSession appeared legit, so traveling over to Amazon's documentation landed this gem:

To avoid client-side out-of-memory errors when retrieving large data sets using JDBC, you can enable your client to fetch data in batches by setting the JDBC fetch size parameter.

We hadn't run into this before, as we stream results with custom ResultHandlers in MyBatis... but there seems to be something different when the AWS Redshift JDBC driver is running on AWS itself vs outside AWS connecting in.

Taking the guidance from the documentation, we added a 'fetchSize' to our MyBatis select query:

<select id="doForAllStaysByProvider" fetchSize="1000" resultMap="FStayResultMap">        
select distinct
    f_stay.uid,

And voila! Everything worked swimmingly. This is the only change we made and the heap never went above a couple hundred MBs.

You can see in one of the above graphs where the heap goes off the charts, as soon as the data started to be received on Amazon, the heap marches right up linearly and never reclaims an ounce of heap space once it starts.

My guess is the Redshift JDBC driver is doing something different when it's in Amazon's environment for some kind of optimization... that's all I can think of to explain the behavior.

Clearly Amazon knows what's going on since they documented it up front. I may not know the full 'why' of what's happening, but at least everything is resolved in what appears to be a satisfactory way.

Thanks to all those who helped.

like image 53
Cuga Avatar answered Oct 12 '22 13:10

Cuga