Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I've got a hybrid system.

Academic papers, standards documents, white papers etc... get downloaded as PDF and renamed according to a (human-readable) standard scheme that I use (year, title, [author(s)], [paper-type]), and stored in a 'meaningful' filesystem hierarchy. Podcasts and some videos are also stored in the same hierarchy.

Documentation for the software libraries and APIs that I use is downloaded from readthedocs (where possible) and stored in a parallel system that takes account of versioning. (So I can concurrently store differently versioned copies of the documentation for a single library).

I have a simple python script that iterates through my directory hierarchy and produces a sqlite database and a couple of xlsx files with a row-per document (one spreadsheet for reporting document-management metadata and another that allows me to assign labels and write precis notes). The script also extracts the content of the PDFs as plain-text and feeds some NLP tools that I'm playing with.

I use the spreadsheets to keep notes on the files, and the act of manually renaming and sorting the PDFs into the 'right' place in the hierarchy helps me to understand what's in them and remember what I've got. (I'm constantly reorganising the hierarchy as my understanding develops and evolves. My python script keeps everything -- notes and documents and other metadata - in sync).

So far this has scaled OK to around 26,000 PDFs.



Sounds good. How do you handle cases where the same document fits multiple categories (ie belongs in multiple locations of your hierarchy at once)?

Ps. Another question: can you tell which nlp libs are best for this purpose in your experience (and how do you eventually search the generated index?)


I can have multiple copies of the same document in the system. My notes are associated with documents via md5 hash, so it'll link them with all copies that are present. At some point I'll get the script to automatically hardlink or symlink duplicates. To be honest though, trying to decide which location best fits a document is quite a good way to engage with the document content -- so even though decisions are often suboptimal, it isn't really the final hierarchy which is the product here, it's what goes on in my brain during the process of trying to organize the papers. As far as the NLP libs question is concerned -- so far this is just an excuse for me to play with SpaCy .... I'm still at quite an early stage and haven't made much progress (my background is more machine vision than NLP - which I haven't touched since I was an undergrad 20 years ago).


Very interesting, thanks for your answer. good point too abt using the rigidity as a forcing function for yourself.

I have quite a big system myself, using bibdesk as my interface to filesystem, and searchability would be very nice indeed. Atm i only use default system tools like spotlight (macos) or mdfind. More custom nlp solution, your post inspires me to think more harder abt that.


What software do you use to rename PDF files and extract their metadata automatically? I have found this difficult to accomplish, despite trying multiple different tools. PDFs from Arxiv are commonly a problem.


So, renaming is done manually (forces me to look at the PDF). I use preview on OSX, or any viewer other than Acrobat Reader on Windows and Linux (Acrobat Reader locks the file preventing renames). Several Python libraries exist for extracting metadata. I'm trying a couple of different approaches. At the moment I just take the title of the PDF and do a search on Google Scholar using the scholarly python library -- but this is really very suboptimal and I want to replace it with something faster and more robust.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: