Bigger projects with lot of modules and objects inside them may spawn big pickle files (even 10 Mb are problematic), which makes pythoscope startup really slow. Problem with pickle is twofold: it's quite inefficient storage, plus you have to read the whole thing every time. Most of the time Pythoscope requires only parts of the information it possesses about the project. That forces us to consider more efficient types of storage.
- Analyze what information is looked up during dynamic inspection and test generation. It may be possible to still store pickled data in a database with a few columns for search purposes.
- Choose a library. SQLite seems like a good bet, since it's in Python standard library since version 2.5 and allows really easy integration with other tools and languages.
- Port Project class operations related to storing and retrieving/search operations.
- Make sure all tests pass.
- Compare memory footprint by running tools/memory_benchmark.py.
I ended up not using SQLite or any other new storage mechanism besides pickle.
Most of the memory was being taken by lib2to3 ASTs, so I pulled them out from the main Project object into a separate CodeTree objects, which can be pickled and stored separately. Since we only need a single AST in memory at a time this solves most of our problems with pickle. Now, even for bigger projects, Project pickle is pretty small and AST pickles can be pulled when needed.
Results of the speed benchmark for trunk and memory-profiling branch are as follows:
|saving Project pickle time||2.98s||0.22s|
|restoring Project pickle time||0.96s||0.06s|
|size of Project pickle||4.09Mb||358.47Kb|
Memory-profiling branch has higher inspection times because it saves ASTs as it completes each inspection of a module. This is a feature though - time of saving a Project pickle at the end is greatly reduced, as well as the size of the file.