Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Backend solution for fetching and transforming data from various third-party APIs

We are building new feature sets for one of our financial application. We have our own SQL server database and we will be calling multiple RESTful APIs that return JSON responses. For e.g. some returns news data, some returns stocks info, some returns finance data and our own sql server database has employee data. So, they all come with their own different data format. This new app we are building is going to aggregate all those data, transform it into meaningful display on web like mint.com does.

  • Web application will display analytical reports based on these data
  • There will be an option to download reports through various templates

We are completely open in terms of technology stack for our backend and middle-tier. As a first thought NoSQL like mongodb and elasticsearch for search and reporting comes to our mind. There will be a web application build on top of these data (stored or retrieved from API), most likely in Asp.net MVC.

We need your input, specially if you have experience with building similar enterprise solution.

Can you please share your opinions on,

  1. What are some good tech stack you would pick for this app?
  2. How would that scale now and in future when APIs data format changes.
  3. Performance is also important since data will be displayed on web UI.
like image 897
Priyank Avatar asked Jun 20 '15 17:06

Priyank


People also ask

What is third party APIs?

Third party APIs are APIs provided by third parties — generally companies such as Facebook, Twitter, or Google — to allow you to access their functionality via JavaScript and use it on your site. One of the most obvious examples is using mapping APIs to display custom maps on your pages.


3 Answers

We have a similar setup to what you are mentioning, using ASP.Net MVC with ElasticSearch (SQL server for relational data, periodically updating ES), aggregating data (XML/JSON) from multiple sources, although with the purpose of improving searching and filtering results instead of reporting. However, I would expect that the scenario you are looking at would also be a suitable match for ElasticSearch, depending on your specific requirements.

1) Since you are already using SQL Server (and I expect are familiar with that), I would suggest combining that with ElasticSearch - the additional mongodb layer seems unnecessary, in terms of maintenance of another technology and development to fit that integration. There is a very good C# library (two actually, ElasticSearch.Net and NEST, used together) that exposes most of the ES functionality.

2) We chose ElasticSearch for its scalability in combination with flexibility and ease-of-use. A challenge you may face could be mapping the documents from C# classes to ElasticSearch documents. In essence, it is incredibly easy to set up, however you do need to do some planning to index data the way you want to search and retrieve it. So if choosing ES as a platform, spend some time with the structure of the documents - by default, dynamic mapping is enabled, so you can pretty much throw any JSON into a document. However, for a production environment, it's better to turn that off and have one or more mappings set up, so they can be queried in a standardized way.

3) Performance is a key factor for us as well, which is why we were looking at Lucene-based engines like Solr and ElasticSearch when doing research, along with NoSQL databases. ElasticSearch outperforms SQL Server by 10 to 1 or better, in most scenarios. Solr vs. ElasticSearch performance depends on scenario, benchmarks and comparisons are around if you Google them. The exception may be if many documents should be retrieved in one query - ES (or Lucene) is not made for that use case, it's best for fast retrieval of fewer results (similar to Google's per page results count) per page. If you need 1000 documents per page/result, a NoSQL database may be a better option.

ElasticSearch is fast to get up and running - install it on a local development box and try it out, you'll get a feel for if it fits.

like image 132
Calle Avatar answered Oct 19 '22 09:10

Calle


From my experience, mongodb is the worst choice for reporting, especially for aggregation. It lacks in good aggregation functionality, has some data type conflicts (such as decimals being stored as strings, which you cannot use in it's built in aggregation framework api) and you'll probably will have to maintain map-reduce functions in javascript for most of the scenarios.

If your application's true nature is only reports, and they do not have to be updated in realtime, I would drop off the on-demand RPC calls to external APIs. I would consider copying ahead as much data as possible and storing it under a schema that is the most convenient for you to work with, and synchronising it afterwards under scheduled, predicted intervals.

I wouldn't be in a hurry making assumptions about that data to be available all the time nor it always to be in the format you expect. You also gain optimisation benefits on running your own copy of it, indexed in the way you want, instead of trying to figure which of the RPCs is your bottleneck.

As for your questions:

1) If you don't mind using Python, I would pick Django on top of PostgresSQL database. Django is a fully featured sturdy ORM + Web framework which is excellent for this kind of work. If not, just stick to a relational SQL database. I heard wonders of Cassandra but haven't tried it yet.

2 + 3) As I mentioned before, replicating the data as much as possible for your own good. After everything is "in house" you can cluster it and tweak it freely. Using a distributed cache against heavy client requests is also a good idea (such as REDIS), instead of generating those reports each time on demand.

like image 27
antonzy Avatar answered Oct 19 '22 07:10

antonzy


I've been using Jasper reports and the Jasper Reports Server to integrate into our web app. Jasper accepts many different datasource types including JSON and SQLServer. The core version is free and allows you to product html amd pdf reports of high complexity. The paid version with the server allows you to easily integrate in your web app. The core is Java spring (partially open source) running on tomcat/jboss and you can interact with it using REST web services or the visualize.js library for your web front end. It uses highcharts which can produce some beautiful results and has options for adhoc reporting and dashboards built from many reports.

See demos here: http://www.jaspersoft.com/

This has an assumed stack of your backend db's and data sources, tomcat with Java Spring, web front end HTML/Javascript.

The tool is used by many large enterprises including Amazon so scalibility so shouldn't be an issue.

If the format of your data changes you'll need to change the report. This is xml formatted editted by GUI with WYSIWYG.

like image 1
kayakpim Avatar answered Oct 19 '22 07:10

kayakpim