samedi 28 mars 2015

SQLite insert performance and setting up SQLite database for cluster analysis

I'm completely new to databases. I'd like to get some advice on how to set up and use a SQLite database for cluster analysis and topic modeling tasks.


I have a 2 GB file where each line is a json object. Here is an example json object from the file:



{"body": "Heath Ledger's Joker...", "subreddit_id": "t5_2qh3s", "name": "t1_clpmhgo", "author": "l3thaln3ss", "created_utc": "1414799999", "subreddit": "movies", "parent_id": "t3_2kwdi3", "score": 1, "link_id": "t3_2kwdi3", "sub_type": "links - high"}


I have created a SQLite database like so:



import json
import sqlite3
import sys

def main(argv):
if len(argv) != 2:
sys.exit("Provide a database name.")

dbName = argv[1]
db = sqlite3.connect(dbName)

db.execute('''CREATE TABLE IF NOT EXISTS Comments
(name text primary key,
author text,
body text,
score integer,
parent_id text,
link_id text,
subreddit text,
subreddit_id text,
sub_type text,
created_utc text,
foreign key (parent_id) references Comment(name));''')

db.commit()
db.close()

if __name__ == "__main__":
main(sys.argv)


Question 1: is this a good initial setup for the database?


I am populating the database like so:



import json
import sqlite3
import sys

def main(argv):
if len(argv) != 2:
sys.exit("Provide a comment file (of json objects) name.")

fname = argv[1]

db = sqlite3.connect("commentDB")
columns = ['name', 'author', 'body', 'score', 'parent_id', 'link_id', 'subreddit', 'subreddit_id', 'sub_type', 'created_utc']

query = "insert or ignore into Comments values (?,?,?,?,?,?,?,?,?,?)"

with open(fname, 'r') as infile:
for comment in infile:
decodedComment = json.loads(comment)
keys = ()
for c in columns:
keys += (decodedComment[c],)
print str(keys)
print
print
c = db.cursor()
c.execute(query, keys)
c.close()
db.commit()

db.close()


if __name__ == "__main__":
main(sys.argv)


Populating the database in this way seems to go very slowly. Is there anything I can do to improve its performance?


Ultimately, I'm going to be clustering subreddits based on shared frequent words in comments, which users comment where, and differences in topic models obtained by analyzing the words found in subreddit comments. Note that I have many more 2 GB files I'd like to work in, so ideally a solution should be relatively scalable. Any general advice on how to setup (especially by improving what I have written) the database to do this sort of work would be greatly appreciated.


Thanks!


Aucun commentaire:

Enregistrer un commentaire