I'm trying to write big JSON (minimum 500MB) files to a database. I wrote a script which works and is kind of memory friendly, but it is very slow. Any suggestion on how to make it more efficient?
My JSON files (remote sensing measurements extracted from google earth engine) are formatted like that:
{"type":"FeatureCollection","features":[{"geometry":{"coordinates":[-55.347046,-12.179673],"geodesic":true,"type":"Point"},"id":"LT52240692005129COA00_2","properties":{"B1":null,"B2":null,"B3":null,"B4":null,"B5":null,"B7":null,"description":"","id":0.0,"name":""},"type":"Feature"},{"geometry":{"coordinates":[-52.726481,-13.374343],"geodesic":true,"type":"Point"},"id":"LT52250692005184COA00_10","properties":{"B1":217,"B2":497,"B3":424,"B4":2633,"B5":1722,"B7":747,"description":"","id":8.0,"name":""},"type":"Feature"}]}
Here is the script that reads the JSON, parses it and write to the database.
import pandas as pd
import json
import sqlite3
# Variables
JSON_file = '../data/LT5oregon.geojson'
db_src = '../data/SR_ee_samples.sqlite'
table_name = 'oregon'
chunk_size = 5000
# Read JSON file
with open(JSON_file) as data_file:
data = json.load(data_file)
# Create database connection
con = sqlite3.connect(db_src)
# Create empty dataframe
df = pd.DataFrame()
# Initialize count for row index
count = 0
# Main loop
for feature in data['features']:
json_feature = feature['properties']
if 'B1' in json_feature:
# Build metadata
meta = feature['id'].split('_')
meta_dict = {'scene_id': meta[0], 'feature_id': int(meta[1])}
# Append meta data to feature data
json_feature.update(meta_dict)
# Append row to df
df = df.append(pd.DataFrame(json_feature, index=[count]))
count += 1
if len(df) >= chunk_size: # When df reaches a certain number of rows, empty it to db
df.to_sql(name = table_name, con = con, if_exists='append')
df = pd.DataFrame()
# write remaining rows to db
df.to_sql(name = table_name, con = con, if_exists='append')
Thanks in advance for any advice
Aucun commentaire:
Enregistrer un commentaire