SPyQL (
https://github.com/dcmoura/spyql) is SQL with Python in the middle, an open-source project fully written in Python for making command-line data processing more intuitive, readable and powerful. Try mixing in the same pot: a SQL SELECT for providing the structure, Python expressions for defining transformations and conditions, the essence of awk as a data-processing language, and the JSON handling capabilities of jq.
How does a SPyQL query looks like?
$ spyql “
IMPORT pendulum AS p
SELECT
(p.now() - p.from_timestamp(purchase_ts)).in_days() AS days_ago,
sum_agg(price * quantity) AS total
FROM csv
WHERE department.upper() == 'IT' and purchase_ts is not Null
GROUP BY 1
ORDER BY 1
TO json” < my_purchases.csv
In a single statement we are 1) reading a CSV (of purchases) with automatic header detection, dialect detection, type inference and casting, 2) filtering out records that do not belong to the IT department or do not have a purchase timestamp 3) summing the total purchases and grouping by how many days ago they happened, 4) sorting from the most to the least recent day and 5) writing the result in JSON format. All this without loading the full dataset into memory.
The Readme is loaded with recipes and there is also a demo video: https://vimeo.com/danielcmoura/spyqldemo
Any feedback is welcomed! Thank you.
Implementation-wise, is there a reason you wrote your own SQL parser rather than using an existing library? I ask not because I'd discourage handwritten SQL parsers but that I think the particular approach you took is going to suffer when it comes to error messages and lexical edge cases like `SELECT(SELECT 1)`.
Not to mention that it introduces a completely new SQL dialect users would have to learn. (I think even SQLite is a bit risky since most people know MySQL or PostgreSQL dialects.)
But I can appreciate you wanting your own syntax especially for nested objects. In dsq I solved that by giving nested fields a literal column name with a dot in it. But that means you have to quote the column name (e.g. `SELECT "city.address" FROM {}`).
The additional penalty you'd pay not backing this by SQLite or DuckDB or some other in-memory database is in query performance as the input grows. Among other existing tools like this octosql has its own query engine but cube2222 has put a lot of time into optimizing it and it's also written in Go.
Whatever the case, competition is good! Happy to chat more if you want. I love database projects!
[0] https://github.com/multiprocessio/dsq