Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: Full-text browser history search forever?
235 points by emptysongglass on March 16, 2022 | hide | past | favorite | 83 comments
My biggest force-multiplier is my fish shell history, going on 7 years of command line history.

I want to do the same thing for my web browser. At first I looked at Memex but they disabled browser history search. You have to save or annotate an article first before it becomes searchable. My brain, naturally, does not know ahead of time what could be useful in the future.

Is there any product out there that creates a fully searchable full-text history forever with little fuss?

I'm using Firefox on Linux but could switch to another browser (but not OS) if needed.



Hey. My project Diskernet does this: full text search over browser history.

Put it in "save" mode when using Chrome (linux is fine) and it automatically saves every page you browse (so you can read it offline), and also indexes it for full text search. It's a work in progress and there are bugs (so my advice initialize a git repo in your archive directory, and make regular syncs to a remote in case of failure -- that also gives you a nice snapshotted archive).

Anyway, best of luck to you! :)

Diskernet: https://github.com/crisdosyago/Diskernet


>22120 is licensed under Polyform Strict License 1.0.0 (no modification, no distribution).

So this is basically Freeware / Shareware in the old days but with Source available? i.e not Open Source as defined by OSS. Do we have a term for it? Shared Source? I know Microsoft tried to use it but it was early 2000 anything M$ did at the time were extremely unpopular.

Will be trying this out soon because I just realise either half of the internet in the past 10 years have eroded, or Google simply cant search something I am sure I read about it 6-7 years ago.


I've always heard this refered to as source available.


The claims are impressive: that the network requests are saved, not the document after the page is loaded. I will certainly give it a try!

One downside: only works with Chrome-based browsers. Perhaps a way to implement this more generally is as a proxy server.


How does it work with a JS heavy app?

With an SPA the article might not appear on initial request, but later requests would be JSON which presumably you don't want to return in a search


It doesn't index each resource. Only the page text of the URLs you actually navigate to in a tab.

So this is fine. It won't index a JSON response in an SPA, as search result (tho it does save it), but it will index (and re index) the actual page content, as you browse it and as it updates (eg dynamically loads in an SPA).

Try it out, you'll have an easier time understanding how it works (as long as the current tag release works, heh :) if not try a previous tag).


A screencast would be tremendously helpful!


Good point.


The problem with a proxy server implementation is modern browsers (AFAIK) are unwilling to submit HTTPS requests in the clear to a proxy server rather than use CONNECT. There’s nothing about the protocol that would make that impossible or even inconvenient, browsers are just unwilling to do it. And I can see the reasoning, but if you actually want this to happen, like here, you’re stuck (or have to MITM yourself, which is its own can of worms).


That's the whole purpose behind ZAP and I use it for archiving pages all the time (they use hsqldb as the file format); it works fantastic for that purpose, but does -- as you correctly pointed out -- require MITM-ing the browser to trust their locally generated CA: https://github.com/zaproxy/zaproxy#readme


Thanks for the reference! I never investigated ZAP closely, for some reason it never occured to me it might be able to be used like that (if anything I’d have turned to mitmproxy, but that would require building a substantial amount of stuff to handle the actual archiving).

The problem with MITMing your own browser is (apart from the fact that it is an ugly hack in a security-critical portion of your setup) I don’t think any tool for doing that (including the one you referenced, from what I can find quickly) applies the complex set of important stuff browsers do on top of just verifying chains against a root store.

The bare minimum for me would be HSTS and the HSTS preload list, but I’d also like to see CT and Must-Staple enforcement, OneCRL support, TLD and validity term restrictions for some roots, and so on. (This is more or less what Chrome does from what I know, though I think they have their own equivalent to Mozilla’s OneCRL.)


Given what ZAP is designed to do, I'd bet $1 it will actively strip off any such headers before returning the response to the browser, since I think by definition injecting a MITM cert is the very case such stapling is designed to prevent against :-)

But, corporate proxies must face similar problems since they, too, MITM things, but I'm deeply thankful that I don't work in such an environment in order to know what the behavior is in that circumstance

I hope this doesn't come across as glib, but ZAP is Apache licensed, so if you are able to come up with the security behavior you want, I'd bet they'd welcome any patches to help implement it


Looks awesome! I see you came up with a new file-format to store responses; have you see the webbundle spec: https://wicg.github.io/webpackage/draft-yasskin-wpack-bundle...?


On the archival (rather than authoring) side, the usual format is WARC[1], preserving the complete content and timing of all HTTP requests and responses, but the tooling is clunky to put it mildly.

[1] https://iipc.github.io/warc-specifications/


This is a great project! I have, over the years accumulated a bunch of bookmarks (exported format) and haven't found an easy way to save "snapshots" for them. This fits my use case perfectly without having to roll out my own tool.


I've got a very different goal than you, but with quite a bit of overlap on saving web pages for historical lookup.

How are you pulling and storing the pages?

For context; I feel like there are quite a lot of solutions here, but not many overlapping and often reinventing many wheels. Memes aside (14->15 standards), i wonder if there's a way i can write my version of this that would benefit others by using common formats we can all benefit from.

edit: Interesting, sounds like it's network/proxy request based. Perhaps there's minimal overlap here?


Hey. Any chance this can be used with another chrome based browser such as Brave?


Should be fine. But you may need to fiddle with how the browser binary location is discovered, because we start it automatically.


Anyone who has used this for a significant amount of time can tell me how much gigabytes you've used to store your browser history along with how long you've been using it?


Does it download any blobs (from you) while installing via npm?


I don't think so. It just installs dependencies via npm. If you noticed something amiss please email me


Chromium and Firefox have all your history stored in a sqlite database.

I have a script to extract the last visited website from chrome for example: https://github.com/BarbUk/dotfiles/blob/master/bin/chrome_hi...

For firefox, you can use something like:

sqlite3 ~/.mozilla/firefox/.[dD]efault/places.sqlite "SELECT strftime('%d.%m.%Y %H:%M:%S', visit_date/1000000, 'unixepoch', 'localtime'),url FROM moz_places, moz_historyvisits WHERE moz_places.id = moz_historyvisits.place_id ORDER BY visit_date;"


Does this database store the full text content of the website? I believe that's what OP is asking for.


You can pipe the URLs through something like monolith[1].

https://github.com/Y2Z/monolith


TIL of monolith, thanks. It worked great for old.reddit.com but failed to capture www.reddit.com for mobile


I use Recoll[1] to search through my local files. Haven't tried it myself, but they also have a extension for Firefox[2] that does what you are looking for.

Edit: I just tested it, and it's pretty neat! However they note in their Documentation[3] that this is a web cache and not intended to be an archive. It can simply be turned into an archive though by configuring a large cache size.

[1] https://www.lesbonscomptes.com/recoll/

[2] https://addons.mozilla.org/en-US/firefox/addon/recoll-we/

[3] https://www.lesbonscomptes.com/recoll/faqsandhowtos/IndexWeb...


Not quite what you asked, but perhaps interesting either way: I recently made an extension that keeps text only copies of all visited sites for offline usage [1].

No full text search included. Might be that grepping through the extension data is reasonably fast, even with multiple years of browsing history, however. I too see this data as valuable, so it's probably better to start capturing now and migrate somewhere else later, rather than wait for your desired browser to implement it.

[1] https://addons.mozilla.org/en-GB/firefox/addon/local-cache/


Not exactly what you're asking for but you can setup SingleFile[1] to automatically save each page you visit.

Then there's also ArchiveBox[2] which can convert your browser history into various formats.

[1] https://github.com/gildas-lormeau/SingleFile

[2] https://github.com/ArchiveBox/ArchiveBox


Promnesia is meant for that (I didn't test it though): https://beepb00p.xyz/promnesia.html


hey, author of Promnesia here! It's not doing fulltext history search (at the moment, anyway), it only cares about URLs and the context around them.


Alright, thanks for the update!


Opera versions 9.5 to 12 do this out of the box. The index is stored in `$OPERA_PREFDIR/vps/0000/`.

menu Tools → Preferences… → Advanced → History → Remember visited addresses for history and autocompletion → [X] Remember content on visited pages

search from address field, history panel or about:historysearch


I really miss that era of Opera. So many QOL features were lost in the migration to Chromium and I have never forgiven them. A friend's teen son showed me OperaGX a few weeks ago and while it looked cool, I can't even bring myself to try it.



I had a look into the extension's source: It's stored inside web extensions `storage.local` area, as `[time]: { text }` objects. It also keeps both a time index (timestamp of all existing entries) and a two week "preloaded cache" for quick access, which means all visited sites from the past 14 days are permanently in your browser memory. This could theoretically lead to memory problems, as the websites' texts are never truncated. If you search for text older than that, all entries for the specified time frame (determined using the time index) are retrieved from storage (again, possibly large amount of memory) and then processed.

This is all pretty clever and a reasonable implementation imo. I'm not sure what better way there could be using web extensions (that support FTS) - the only thing I could think of is a WASM SQLite module.


Looks cool but ... 7 users ?


It was only added to Firefox addons site recently but one could always install it manually for a long time. On Chrome, it has 2000+ users - https://chrome.google.com/webstore/detail/falcon/mmifbbohghe.... It works for me since I only use it to search recent history


In Chrome, you can install it from source:

https://github.com/CennoxX/falcon#transparent-installation

As for Firefox, you can right-click the "Add to Firefox" button and save and inspect the extension.


It's not that I don't trust it but that the value of this extensions looks to be on a long period, you need it to be maintained for months/years to be useful.


I see there is no easy way to back up / restore the index. Looks like uninstalling and reinstalling the extension makes it lose the history.


A bit late to the thread, but I also created something to solve this [1]. Currently only works on Mac though, so it does not solve the OPs problem. Hopefully others may find it useful.

The gist is:

- It unifies your browsing history for Firefox, Chrome, most browsers into one sqlite database. - Provides quick (autocomplete style) full text search over that database via a UI.

[1]: https://www.browserparrot.com/


You can use Archivebox. Set it to grab URLs from your browser history database and it will archive them all to disk in whatever formats you want. You can then use whatever tools you want on those local files.

https://archivebox.io/


How much storage does your archivebox take up? Sounds good, but I have to be careful not to hoard data.


I don't actually use it myself.


Self promotion about the shell history:

I have built an app, that allows you to easy access your shell history https://loshadki.app/shellhistory/ and sync via iCloud. MacOS only.


This looks very neat.


I recently started using DEVONthink (MacOS / iOS) after someone mentioned it on here and have found it to be great. Just paid for the Pro license and consider it quite the bargain.

Different model in that you have to choose what to archive (by hitting a button on a browser extension) but in practice I prefer this to the “trawl everything” model. Ymmv, of course.

But the killer feature for me is that it’s a unified “search-first” interface for _all_ your documents - not just your web browsing.


The main problem with FTS isn't the search indexing component it's actually the HTML content parser.

There are TONS of projects like Elasticsearch or just raw Lucene that will allow you to parse text and index it.

HTML? Not so much...

There are just to many problems

Text ads polluting the extracted text is by far the main issue but there are other issues as well including OCR of images, AJAX paginated pages, lazy loaded images that might need OCR, metadata extraction (when was the page published, who was the author, etc).

There are some projects that take this on but Google just does an amazing job and these secondary tools are pretty limited by comparison.

95% accuracy doesn't help because that 5% usually ends up being 100% of your false positives.


I've had a lot of success by running HTML pages through mozilla's readability[0] tool (actually the go port of it[1]) before indexing it.

[0]: https://github.com/mozilla/readability

[1]: https://github.com/go-shiori/go-readability


https://minbrowser.org/ has full-text search on every website you ever visited if I understand correctly.


Probably not helpful, but I built a setup like that for myself years ago. I had a local proxy, originally written with Python Twisted, but later ported to Go, and set my browser to use that, so all requests went through it. Every URL that the proxy saw, it logged and for text/plain or text/html, it also took a copy of the request body and posted that to a web service that I was running. The web service was a Django app with a simple model that would track the URL, timestamp, and some other basic metadata. It would also save a gzipped copy of the request body to S3, and dump it into a SOLR instance. That gave me full-text search over the content of every site I visited and a backup copy of the text in case the original ever went offline. It was incredibly useful.

That was back in the day though, before HTTPS was common (outside ecommerce sites, which I didn't care about indexing), and before so many sites were SPAs that got their content via JS APIs. As more sites went to HTTPS, I realized that I'd have to re-write my proxy to MITM certificates if I wanted it to keep working and that wasn't really something I wanted to mess with. The project was useful, but not useful enough that I was willing to dive into the world of writing a browser plugin that could scrape directly from the DOM, so I eventually abandoned it.


I make https://ibar.app/ which does something like this.


Just signed up for an invite -- looks like a great tool for a tab addict like me.


I think I'd like to approach this from a different angle - A browser extension that just sends the current URL to a http(s) endpoint of my choosing.

I use several machines, plus my phone for browsing. As such my (local) browser history is useless, so I tend to turn it off. Also, I am not in control of it from a privacy point of view (who knows what extensions/browser functions are doing with it?)

With my own endpoint, I can then do what I want with the URLs... put them in a database as a cross machine history index, or schedule a job to index the page contents into a personal search engine, etc.

I've never written a browser extension, but I'm guessing that...

  IF (URL.current <> URL.previous) { sendRequest(host=endpointURL, payload=URL.current) }
...can't be that hard?


Has the disadvantage that you can't index content of sites where you need to be logged in as you don't have access to the browser's cookies


Valid point, but I'm unlikely to want to index those. I don't use cloud apps much in my personal/hobbyist browsing.


Not for Linux :( Works in MacOS but paid software. https://www.stclairsoft.com/HistoryHound/


If it’s just the metadata that you want, you can use activity watcher [1]. They have a browser plug-in.

[1] https://activitywatch.net/


OP wanted full text indexing and search, ActivityWatch does not do that.

It is designed for time tracking, and it only saves limited metadata. The browser plugin in particular: time of loading/switching, time spent on page, URL, title, whether the page is audible or incognito, and the tab count.


I go the opposite direction. I delete browser history on close. My BASH history isn’t reliable.

I am curious how having your history is your greatest force multiplier.


> I am curious how having your history is your greatest force multiplier.

Doing a ctrl + r to fuzzy search the history is very powerful. I could honestly not imagine working without that. Do you never have to type the same command twice, or re-use common commands?


I tend to just make a habit of memorizing commands that I repeatedly use. That or I make a script to automate it if I find myself repeating the same action.


I've seen somebody with a 1k+ lines long bashrc. They apparently put every useful command in there. Not sure how much better that approach is, I suppose it's similar to using bookmarks vs browsing history.


$ wc -l .zsh_history

   71114 .zsh_history


history != config file ;)

You'd have to compare your .zshrc to their .bashrc


If it is something I used recently, I use ctrl+R or up arrow, but if it’s not there anymore, I just retype. It’s not so bad for me.


Bash history does suck, at least out of the box. My history use has become consistent and frequent after I started using fish.


To set in your ~/.bashrc or ~/.bash_profile:

    export HISTTIMEFORMAT="%h %d %H:%M:%S "
    export HISTSIZE=100000
My ~/.bash_history averages 22 bytes per line.


Nyxt browser is doing this pretty well! https://nyxt.atlas.engineer


There was a Windows program out maybe 20 years ago called Elephant Tracks which did this. The developer shut the program down because of piracy.


An interesting spin on this would be to save all text from every web page you have ever viewed in a browser. I am willing to bet most of us have not viewed more than 10G of raw text in the past 10 years


Autosaving internet browsing history is largely a waste of time.

I adopted a save/print to pdf + full text pdf search in case I need something in the future.


I don't think this could be useful to you. SNR would be too low. The fact you were convinced to click on a link someday doesn't mean the link contains anything useful.

EDIT: Perhaps ranking by how much time you've spent reading the content could help with that.

On the other hand, your shell history is already filtered by your brain, it contains mostly potentially useful things.


i think https://memex.garden/ does what you are asking for. but lately they decided to turn their code into source available.


The OP brings up Memex. Says they don’t do it any more.


Indeed. Here's the deprecation notice [1]

[1] https://www.notion.so/Why-we-deprecated-the-history-search-d...


histre.com does this


Please correct me if I'm wrong but it looks like Histre requires an explicit save action.


Histre[0] automatically saves your browsing history (customizeable[0] of course). No explicit save action needed.

[0] https://histre.com/

[1] https://histre.com/articles/customize-logging/


Subscription fatigue.


Doesn't chrome + google do this by default?


No, because some site simply disappear. Either the search engine cant find them or they dont renew their server / domain. For example if HN lost its server or YC somehow decided to shut it down, everything you remember reading on HN will be gone with zero reference.


7 years of command line history?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: