I've got a hybrid system. Academic papers, standards documents, white papers etc...

HSO · on Jan 10, 2020

Sounds good. How do you handle cases where the same document fits multiple categories (ie belongs in multiple locations of your hierarchy at once)?

Ps. Another question: can you tell which nlp libs are best for this purpose in your experience (and how do you eventually search the generated index?)

w_t_payne · on Jan 10, 2020

I can have multiple copies of the same document in the system. My notes are associated with documents via md5 hash, so it'll link them with all copies that are present. At some point I'll get the script to automatically hardlink or symlink duplicates. To be honest though, trying to decide which location best fits a document is quite a good way to engage with the document content -- so even though decisions are often suboptimal, it isn't really the final hierarchy which is the product here, it's what goes on in my brain during the process of trying to organize the papers. As far as the NLP libs question is concerned -- so far this is just an excuse for me to play with SpaCy .... I'm still at quite an early stage and haven't made much progress (my background is more machine vision than NLP - which I haven't touched since I was an undergrad 20 years ago).

HSO · on Jan 10, 2020

Very interesting, thanks for your answer. good point too abt using the rigidity as a forcing function for yourself.

I have quite a big system myself, using bibdesk as my interface to filesystem, and searchability would be very nice indeed. Atm i only use default system tools like spotlight (macos) or mdfind. More custom nlp solution, your post inspires me to think more harder abt that.

j15t · on Jan 11, 2020

What software do you use to rename PDF files and extract their metadata automatically? I have found this difficult to accomplish, despite trying multiple different tools. PDFs from Arxiv are commonly a problem.

w_t_payne · on Jan 21, 2020

So, renaming is done manually (forces me to look at the PDF). I use preview on OSX, or any viewer other than Acrobat Reader on Windows and Linux (Acrobat Reader locks the file preventing renames). Several Python libraries exist for extracting metadata. I'm trying a couple of different approaches. At the moment I just take the title of the PDF and do a search on Google Scholar using the scholarly python library -- but this is really very suboptimal and I want to replace it with something faster and more robust.