I am trying to create a largish number of instances (10K-1M) conforming to an imported ontology. Each instance has also a large number of data properties (~100) the values of which are generated dynamically in python.
Creating the bare instances is reasonably fast but when trying to populate them with data properties there seems to be some bottleneck that slows things down quite significantly.
The core of my approach for inserting the data property values is a loop that iterates over instances and sets a value. Like so:
for entity in onto.Entity.instances():
setattr(entity, prop.name, [value])
Debugging the setattr step shows that owlready does quite a lot of stuff (checking things etc) in this step (including writing to sqlite at each iteration). I wonder if there is another way to mass insert data properties that can speed things up.
I am not 100% sure if the database write is the actual time consuming step - but if it is, is it possible to defer and write things out in bulk in the end?
In the owlready book (chapter 11) there is a section about interogating the database directly. There is an example about selecting directly from the db but I would imagine that adding properties while maintaing consistency at the level of owl/ready is not quite trivial?
A quick test showed that you can gain some performances (about x3 in my test), if you are storing the quadstore on disk. If it is in memory, the performance gain is more modest (less than x1.4).
You can also add triples directly in the quadstore (e.g. using onto._add_data_triple_spod(), or even by executing direct SQL queries with default_world.graph.execute("INSERT INTO datas VALUES ...").
The only draw back is that it will not update the local Python copy of the instance. If the instance is NOT loaded in Python, there is no problem. If it is, you may solve the problem easily by deleting the corresponding attribute from the instance (e.g. delattr(entity, prop.name)). This will cause a reload of the attribute's values from the database the next time the attribute values will be needed.
Of course, you may combine both approaches. If you have some performance results, I would be interesting by them. I may consider adding some optimization in future version of Owlready (e.g. a simpler way to do bulk insertions).