Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to query the documents from couchdb and load them into pandas dataframe?

I have downloaded Twitter data on local couchdb server. And it was saved as json files.

I use this code to enter the database in python. 1st import libraries

import couchdb
import pandas as pd
from couchdbkit import Server
import json
import cloudant

next connect to server and choose the database I want to enter.

dbname = couchdb.Server('http://localhost:5984')
db = dbname['Test']
server = couchdb.Server('http://localhost:5984')

I could create and delete databases with python however, I don't know how I can put the data from the server to jupyter notebook. I would like to get the text and time with retweets to analyze it. I can only see one JSON file from python.

If possible I would like to add the all JSON data in the db to pandas dataframe in python so I can analyze it in R too.

The question is: How to query the documents and load them into pandas dataframe?

like image 731
Tateishi Avatar asked Oct 29 '17 02:10

Tateishi


People also ask

How do I add data to CouchDB?

Open the Overview page of the database and select New Document option as shown below. When you select the New Document option, CouchDB creates a new database document, assigning it a new id. You can edit the value of the id and can assign your own value in the form of a string.


1 Answers

All the documents from a CouchDB's database can be pulled from /{db}/_all_docs end-point with include_docs query attribute. The response is a json object where all the docs listed in rows field.

You can either use requests package to work with CouchDB directly and then load the response into pandas with pandas.read_json or use couchdb package that translates json into python objects internally and then load the response directly, i.e. do something like this:

import couchdb
import pandas as pd

couch = couchdb.Server('http://localhost:5984')
db = couch['Test']
rows = db.view('_all_docs', include_docs=True)
data = [row['doc'] for row in rows]
df = pd.DataFrame(data)

Please be aware than reading a complete database into memory could be resource taxing, so you might want to look into skip and limit query parameters of _all_docs end-point to read information in smaller batches.

like image 87
eiri Avatar answered Nov 15 '22 00:11

eiri