Ask HN: Full-text browser history search forever?

graderjs · on March 16, 2022

Hey. My project Diskernet does this: full text search over browser history.

Put it in "save" mode when using Chrome (linux is fine) and it automatically saves every page you browse (so you can read it offline), and also indexes it for full text search. It's a work in progress and there are bugs (so my advice initialize a git repo in your archive directory, and make regular syncs to a remote in case of failure -- that also gives you a nice snapshotted archive).

Anyway, best of luck to you! :)

Diskernet: https://github.com/crisdosyago/Diskernet

ksec · on March 16, 2022

>22120 is licensed under Polyform Strict License 1.0.0 (no modification, no distribution).

So this is basically Freeware / Shareware in the old days but with Source available? i.e not Open Source as defined by OSS. Do we have a term for it? Shared Source? I know Microsoft tried to use it but it was early 2000 anything M$ did at the time were extremely unpopular.

Will be trying this out soon because I just realise either half of the internet in the past 10 years have eroded, or Google simply cant search something I am sure I read about it 6-7 years ago.

ilovetux · on March 16, 2022

I've always heard this refered to as source available.

danuker · on March 16, 2022

The claims are impressive: that the network requests are saved, not the document after the page is loaded. I will certainly give it a try!

One downside: only works with Chrome-based browsers. Perhaps a way to implement this more generally is as a proxy server.

12907835202 · on March 16, 2022

How does it work with a JS heavy app?

With an SPA the article might not appear on initial request, but later requests would be JSON which presumably you don't want to return in a search

graderjs · on March 16, 2022

It doesn't index each resource. Only the page text of the URLs you actually navigate to in a tab.

So this is fine. It won't index a JSON response in an SPA, as search result (tho it does save it), but it will index (and re index) the actual page content, as you browse it and as it updates (eg dynamically loads in an SPA).

Try it out, you'll have an easier time understanding how it works (as long as the current tag release works, heh :) if not try a previous tag).

rattray · on March 16, 2022

A screencast would be tremendously helpful!

graderjs · on March 16, 2022

Good point.

mananaysiempre · on March 16, 2022

The problem with a proxy server implementation is modern browsers (AFAIK) are unwilling to submit HTTPS requests in the clear to a proxy server rather than use CONNECT. There’s nothing about the protocol that would make that impossible or even inconvenient, browsers are just unwilling to do it. And I can see the reasoning, but if you actually want this to happen, like here, you’re stuck (or have to MITM yourself, which is its own can of worms).

mdaniel · on March 16, 2022

That's the whole purpose behind ZAP and I use it for archiving pages all the time (they use hsqldb as the file format); it works fantastic for that purpose, but does -- as you correctly pointed out -- require MITM-ing the browser to trust their locally generated CA: https://github.com/zaproxy/zaproxy#readme

mananaysiempre · on March 16, 2022

Thanks for the reference! I never investigated ZAP closely, for some reason it never occured to me it might be able to be used like that (if anything I’d have turned to mitmproxy, but that would require building a substantial amount of stuff to handle the actual archiving).

The problem with MITMing your own browser is (apart from the fact that it is an ugly hack in a security-critical portion of your setup) I don’t think any tool for doing that (including the one you referenced, from what I can find quickly) applies the complex set of important stuff browsers do on top of just verifying chains against a root store.

The bare minimum for me would be HSTS and the HSTS preload list, but I’d also like to see CT and Must-Staple enforcement, OneCRL support, TLD and validity term restrictions for some roots, and so on. (This is more or less what Chrome does from what I know, though I think they have their own equivalent to Mozilla’s OneCRL.)

mdaniel · on March 16, 2022

Given what ZAP is designed to do, I'd bet $1 it will actively strip off any such headers before returning the response to the browser, since I think by definition injecting a MITM cert is the very case such stapling is designed to prevent against :-)

But, corporate proxies must face similar problems since they, too, MITM things, but I'm deeply thankful that I don't work in such an environment in order to know what the behavior is in that circumstance

I hope this doesn't come across as glib, but ZAP is Apache licensed, so if you are able to come up with the security behavior you want, I'd bet they'd welcome any patches to help implement it

nynx · on March 16, 2022

Looks awesome! I see you came up with a new file-format to store responses; have you see the webbundle spec: https://wicg.github.io/webpackage/draft-yasskin-wpack-bundle...?

mananaysiempre · on March 16, 2022

On the archival (rather than authoring) side, the usual format is WARC[1], preserving the complete content and timing of all HTTP requests and responses, but the tooling is clunky to put it mildly.

[1] https://iipc.github.io/warc-specifications/

mrmuagi · on March 16, 2022

This is a great project! I have, over the years accumulated a bunch of bookmarks (exported format) and haven't found an easy way to save "snapshots" for them. This fits my use case perfectly without having to roll out my own tool.

lijogdfljk · on March 16, 2022

I've got a very different goal than you, but with quite a bit of overlap on saving web pages for historical lookup.

How are you pulling and storing the pages?

For context; I feel like there are quite a lot of solutions here, but not many overlapping and often reinventing many wheels. Memes aside (14->15 standards), i wonder if there's a way i can write my version of this that would benefit others by using common formats we can all benefit from.

edit: Interesting, sounds like it's network/proxy request based. Perhaps there's minimal overlap here?

niux · on March 16, 2022

Hey. Any chance this can be used with another chrome based browser such as Brave?

graderjs · on March 16, 2022

Should be fine. But you may need to fiddle with how the browser binary location is discovered, because we start it automatically.

ceasesurthinko · on March 17, 2022

Anyone who has used this for a significant amount of time can tell me how much gigabytes you've used to store your browser history along with how long you've been using it?

ewuhic · on March 16, 2022

Does it download any blobs (from you) while installing via npm?

graderjs · on March 16, 2022

I don't think so. It just installs dependencies via npm. If you noticed something amiss please email me

barbuk · on March 16, 2022

Chromium and Firefox have all your history stored in a sqlite database.

I have a script to extract the last visited website from chrome for example: https://github.com/BarbUk/dotfiles/blob/master/bin/chrome_hi...

For firefox, you can use something like:

sqlite3 ~/.mozilla/firefox/.[dD]efault/places.sqlite "SELECT strftime('%d.%m.%Y %H:%M:%S', visit_date/1000000, 'unixepoch', 'localtime'),url FROM moz_places, moz_historyvisits WHERE moz_places.id = moz_historyvisits.place_id ORDER BY visit_date;"

kqr · on March 16, 2022

Does this database store the full text content of the website? I believe that's what OP is asking for.

_dain_ · on March 16, 2022

You can pipe the URLs through something like monolith[1].

https://github.com/Y2Z/monolith

alinspired · on March 18, 2022

TIL of monolith, thanks. It worked great for old.reddit.com but failed to capture www.reddit.com for mobile

ochicial · on March 16, 2022

I use Recoll[1] to search through my local files. Haven't tried it myself, but they also have a extension for Firefox[2] that does what you are looking for.

Edit: I just tested it, and it's pretty neat! However they note in their Documentation[3] that this is a web cache and not intended to be an archive. It can simply be turned into an archive though by configuring a large cache size.

[1] https://www.lesbonscomptes.com/recoll/

[2] https://addons.mozilla.org/en-US/firefox/addon/recoll-we/

[3] https://www.lesbonscomptes.com/recoll/faqsandhowtos/IndexWeb...

phil294 · on March 16, 2022

Not quite what you asked, but perhaps interesting either way: I recently made an extension that keeps text only copies of all visited sites for offline usage [1].

No full text search included. Might be that grepping through the extension data is reasonably fast, even with multiple years of browsing history, however. I too see this data as valuable, so it's probably better to start capturing now and migrate somewhere else later, rather than wait for your desired browser to implement it.

[1] https://addons.mozilla.org/en-GB/firefox/addon/local-cache/

Farow · on March 16, 2022

Not exactly what you're asking for but you can setup SingleFile[1] to automatically save each page you visit.

Then there's also ArchiveBox[2] which can convert your browser history into various formats.

[1] https://github.com/gildas-lormeau/SingleFile

[2] https://github.com/ArchiveBox/ArchiveBox

roneoo · on March 16, 2022

Promnesia is meant for that (I didn't test it though): https://beepb00p.xyz/promnesia.html

karlicoss · on March 16, 2022

hey, author of Promnesia here! It's not doing fulltext history search (at the moment, anyway), it only cares about URLs and the context around them.

roneoo · on March 16, 2022

Alright, thanks for the update!

bmn__ · on March 16, 2022

Opera versions 9.5 to 12 do this out of the box. The index is stored in `$OPERA_PREFDIR/vps/0000/`.

menu Tools → Preferences… → Advanced → History → Remember visited addresses for history and autocompletion → [X] Remember content on visited pages

search from address field, history panel or about:historysearch

xerox13ster · on March 16, 2022

I really miss that era of Opera. So many QOL features were lost in the migration to Chromium and I have never forgiven them. A friend's teen son showed me OperaGX a few weeks ago and while it looked cool, I can't even bring myself to try it.

ysegun · on March 16, 2022

Falcon does this. https://addons.mozilla.org/en-US/firefox/addon/falcon_extens...

phil294 · on March 16, 2022

I had a look into the extension's source: It's stored inside web extensions `storage.local` area, as `[time]: { text }` objects. It also keeps both a time index (timestamp of all existing entries) and a two week "preloaded cache" for quick access, which means all visited sites from the past 14 days are permanently in your browser memory. This could theoretically lead to memory problems, as the websites' texts are never truncated. If you search for text older than that, all entries for the specified time frame (determined using the time index) are retrieved from storage (again, possibly large amount of memory) and then processed.

This is all pretty clever and a reasonable implementation imo. I'm not sure what better way there could be using web extensions (that support FTS) - the only thing I could think of is a WASM SQLite module.

pjerem · on March 16, 2022

Looks cool but ... 7 users ?

ysegun · on March 16, 2022

It was only added to Firefox addons site recently but one could always install it manually for a long time. On Chrome, it has 2000+ users - https://chrome.google.com/webstore/detail/falcon/mmifbbohghe.... It works for me since I only use it to search recent history

danuker · on March 16, 2022

In Chrome, you can install it from source:

https://github.com/CennoxX/falcon#transparent-installation

As for Firefox, you can right-click the "Add to Firefox" button and save and inspect the extension.

pjerem · on March 16, 2022

It's not that I don't trust it but that the value of this extensions looks to be on a long period, you need it to be maintained for months/years to be useful.

danuker · on March 16, 2022

I see there is no easy way to back up / restore the index. Looks like uninstalling and reinstalling the extension makes it lose the history.

iansinnott · on March 20, 2022

A bit late to the thread, but I also created something to solve this [1]. Currently only works on Mac though, so it does not solve the OPs problem. Hopefully others may find it useful.

The gist is:

- It unifies your browsing history for Firefox, Chrome, most browsers into one sqlite database. - Provides quick (autocomplete style) full text search over that database via a UI.

[1]: https://www.browserparrot.com/

_dain_ · on March 16, 2022

You can use Archivebox. Set it to grab URLs from your browser history database and it will archive them all to disk in whatever formats you want. You can then use whatever tools you want on those local files.

https://archivebox.io/

janeway · on March 16, 2022

How much storage does your archivebox take up? Sounds good, but I have to be careful not to hoard data.

_dain_ · on March 16, 2022

I don't actually use it myself.

outcoldman · on March 16, 2022

Self promotion about the shell history:

I have built an app, that allows you to easy access your shell history https://loshadki.app/shellhistory/ and sync via iCloud. MacOS only.

victor106 · on March 16, 2022

This looks very neat.

darkteflon · on March 16, 2022

I recently started using DEVONthink (MacOS / iOS) after someone mentioned it on here and have found it to be great. Just paid for the Pro license and consider it quite the bargain.

Different model in that you have to choose what to archive (by hitting a button on a browser extension) but in practice I prefer this to the “trawl everything” model. Ymmv, of course.

But the killer feature for me is that it’s a unified “search-first” interface for _all_ your documents - not just your web browsing.

burtonator · on March 16, 2022

The main problem with FTS isn't the search indexing component it's actually the HTML content parser.

There are TONS of projects like Elasticsearch or just raw Lucene that will allow you to parse text and index it.

HTML? Not so much...

There are just to many problems

Text ads polluting the extracted text is by far the main issue but there are other issues as well including OCR of images, AJAX paginated pages, lazy loaded images that might need OCR, metadata extraction (when was the page published, who was the author, etc).

There are some projects that take this on but Google just does an amazing job and these secondary tools are pretty limited by comparison.

95% accuracy doesn't help because that 5% usually ends up being 100% of your false positives.

psanford · on March 16, 2022

I've had a lot of success by running HTML pages through mozilla's readability[0] tool (actually the go port of it[1]) before indexing it.

[0]: https://github.com/mozilla/readability

[1]: https://github.com/go-shiori/go-readability

whiterock · on March 16, 2022

https://minbrowser.org/ has full-text search on every website you ever visited if I understand correctly.

thraxil · on March 16, 2022

Probably not helpful, but I built a setup like that for myself years ago. I had a local proxy, originally written with Python Twisted, but later ported to Go, and set my browser to use that, so all requests went through it. Every URL that the proxy saw, it logged and for text/plain or text/html, it also took a copy of the request body and posted that to a web service that I was running. The web service was a Django app with a simple model that would track the URL, timestamp, and some other basic metadata. It would also save a gzipped copy of the request body to S3, and dump it into a SOLR instance. That gave me full-text search over the content of every site I visited and a backup copy of the text in case the original ever went offline. It was incredibly useful.

That was back in the day though, before HTTPS was common (outside ecommerce sites, which I didn't care about indexing), and before so many sites were SPAs that got their content via JS APIs. As more sites went to HTTPS, I realized that I'd have to re-write my proxy to MITM certificates if I wanted it to keep working and that wasn't really something I wanted to mess with. The project was useful, but not useful enough that I was willing to dive into the world of writing a browser plugin that could scrape directly from the DOM, so I eventually abandoned it.

erusev · on March 16, 2022

I make https://ibar.app/ which does something like this.

heliostatic · on March 16, 2022

Just signed up for an invite -- looks like a great tool for a tab addict like me.

Jaruzel · on March 16, 2022

I think I'd like to approach this from a different angle - A browser extension that just sends the current URL to a http(s) endpoint of my choosing.

I use several machines, plus my phone for browsing. As such my (local) browser history is useless, so I tend to turn it off. Also, I am not in control of it from a privacy point of view (who knows what extensions/browser functions are doing with it?)

With my own endpoint, I can then do what I want with the URLs... put them in a database as a cross machine history index, or schedule a job to index the page contents into a personal search engine, etc.

I've never written a browser extension, but I'm guessing that...

  IF (URL.current <> URL.previous) { sendRequest(host=endpointURL, payload=URL.current) }

...can't be that hard?

karottenreibe · on March 16, 2022

Has the disadvantage that you can't index content of sites where you need to be logged in as you don't have access to the browser's cookies

Jaruzel · on March 16, 2022

Valid point, but I'm unlikely to want to index those. I don't use cloud apps much in my personal/hobbyist browsing.

kasperset · on March 16, 2022

Not for Linux :( Works in MacOS but paid software. https://www.stclairsoft.com/HistoryHound/

subutux · on March 16, 2022

If it’s just the metadata that you want, you can use activity watcher [1]. They have a browser plug-in.

[1] https://activitywatch.net/

danuker · on March 16, 2022

OP wanted full text indexing and search, ActivityWatch does not do that.

It is designed for time tracking, and it only saves limited metadata. The browser plugin in particular: time of loading/switching, time spent on page, URL, title, whether the page is audible or incognito, and the tab count.

Hnrobert42 · on March 16, 2022

I go the opposite direction. I delete browser history on close. My BASH history isn’t reliable.

I am curious how having your history is your greatest force multiplier.

dewey · on March 16, 2022

> I am curious how having your history is your greatest force multiplier.

Doing a ctrl + r to fuzzy search the history is very powerful. I could honestly not imagine working without that. Do you never have to type the same command twice, or re-use common commands?

hwers · on March 16, 2022

I tend to just make a habit of memorizing commands that I repeatedly use. That or I make a script to automate it if I find myself repeating the same action.

shiomiru · on March 16, 2022

I've seen somebody with a 1k+ lines long bashrc. They apparently put every useful command in there. Not sure how much better that approach is, I suppose it's similar to using bookmarks vs browsing history.

dmd · on March 16, 2022

$ wc -l .zsh_history

   71114 .zsh_history

dewey · on March 16, 2022

history != config file ;)

You'd have to compare your .zshrc to their .bashrc

Hnrobert42 · on March 18, 2022

If it is something I used recently, I use ctrl+R or up arrow, but if it’s not there anymore, I just retype. It’s not so bad for me.

dmos62 · on March 16, 2022

Bash history does suck, at least out of the box. My history use has become consistent and frequent after I started using fish.

danuker · on March 16, 2022

To set in your ~/.bashrc or ~/.bash_profile:

    export HISTTIMEFORMAT="%h %d %H:%M:%S "
    export HISTSIZE=100000

My ~/.bash_history averages 22 bytes per line.

rounakdatta · on March 16, 2022

Nyxt browser is doing this pretty well! https://nyxt.atlas.engineer

pseingatl · on March 16, 2022

There was a Windows program out maybe 20 years ago called Elephant Tracks which did this. The developer shut the program down because of piracy.

suifbwish · on March 17, 2022

An interesting spin on this would be to save all text from every web page you have ever viewed in a browser. I am willing to bet most of us have not viewed more than 10G of raw text in the past 10 years

nojito · on March 16, 2022

Autosaving internet browsing history is largely a waste of time.

I adopted a save/print to pdf + full text pdf search in case I need something in the future.

ComodoHacker · on March 16, 2022

I don't think this could be useful to you. SNR would be too low. The fact you were convinced to click on a link someday doesn't mean the link contains anything useful.

EDIT: Perhaps ranking by how much time you've spent reading the content could help with that.

On the other hand, your shell history is already filtered by your brain, it contains mostly potentially useful things.

hodanli · on March 16, 2022

i think https://memex.garden/ does what you are asking for. but lately they decided to turn their code into source available.

skinnymuch · on March 16, 2022

The OP brings up Memex. Says they don’t do it any more.

emptysongglass · on March 16, 2022

Indeed. Here's the deprecation notice [1]

[1] https://www.notion.so/Why-we-deprecated-the-history-search-d...

encima · on March 16, 2022

histre.com does this

emptysongglass · on March 16, 2022

Please correct me if I'm wrong but it looks like Histre requires an explicit save action.

kirubakaran · on March 16, 2022

Histre[0] automatically saves your browsing history (customizeable[0] of course). No explicit save action needed.

[0] https://histre.com/

[1] https://histre.com/articles/customize-logging/

pseingatl · on March 17, 2022

Subscription fatigue.

retube · on March 16, 2022

Doesn't chrome + google do this by default?

ksec · on March 16, 2022

No, because some site simply disappear. Either the search engine cant find them or they dont renew their server / domain. For example if HN lost its server or YC somehow decided to shut it down, everything you remember reading on HN will be gone with zero reference.

warrenm · on March 16, 2022

7 years of command line history?