Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: SPyQL – SQL with Python in the middle (github.com/dcmoura)
105 points by dmoura on Jan 26, 2022 | hide | past | favorite | 45 comments
SPyQL (https://github.com/dcmoura/spyql) is SQL with Python in the middle, an open-source project fully written in Python for making command-line data processing more intuitive, readable and powerful. Try mixing in the same pot: a SQL SELECT for providing the structure, Python expressions for defining transformations and conditions, the essence of awk as a data-processing language, and the JSON handling capabilities of jq.

How does a SPyQL query looks like?

  $ spyql “
    IMPORT pendulum AS p 
    SELECT
        (p.now() - p.from_timestamp(purchase_ts)).in_days() AS days_ago,
        sum_agg(price * quantity) AS total
    FROM csv
    WHERE department.upper() == 'IT' and purchase_ts is not Null
    GROUP BY 1
    ORDER BY 1
    TO json” < my_purchases.csv
In a single statement we are 1) reading a CSV (of purchases) with automatic header detection, dialect detection, type inference and casting, 2) filtering out records that do not belong to the IT department or do not have a purchase timestamp 3) summing the total purchases and grouping by how many days ago they happened, 4) sorting from the most to the least recent day and 5) writing the result in JSON format. All this without loading the full dataset into memory.

The Readme is loaded with recipes and there is also a demo video: https://vimeo.com/danielcmoura/spyqldemo

Any feedback is welcomed! Thank you.



Nice work! I have a similar tool built on SQLite [0] and there are a number of similar tools (compared in my readme) like octosql, q, textql, datasette, etc.

Implementation-wise, is there a reason you wrote your own SQL parser rather than using an existing library? I ask not because I'd discourage handwritten SQL parsers but that I think the particular approach you took is going to suffer when it comes to error messages and lexical edge cases like `SELECT(SELECT 1)`.

Not to mention that it introduces a completely new SQL dialect users would have to learn. (I think even SQLite is a bit risky since most people know MySQL or PostgreSQL dialects.)

But I can appreciate you wanting your own syntax especially for nested objects. In dsq I solved that by giving nested fields a literal column name with a dot in it. But that means you have to quote the column name (e.g. `SELECT "city.address" FROM {}`).

The additional penalty you'd pay not backing this by SQLite or DuckDB or some other in-memory database is in query performance as the input grows. Among other existing tools like this octosql has its own query engine but cube2222 has put a lot of time into optimizing it and it's also written in Go.

Whatever the case, competition is good! Happy to chat more if you want. I love database projects!

[0] https://github.com/multiprocessio/dsq


Thank you! I know DSQ, great work! Unlike q and others it manages JSONs, which is great.

Yes, there are a number of tools that use an in-memory DB. I work with JSONs and CSVs with several GBs, and loading them to sqlite is not an option, it is too slow and it uses too much memory. Sometimes I just need an average or a sum of a coupple of columns (eventually grouped by another column) and there is no need to load the entire dataset into memory. Also, this way you can tackle streaming data. I have compared performance with q for CSVs and spyql is several times faster. I have compared with jq for JSONs and performance is head to head, with spyql typically requiring less memory. I will be presenting SPyQL at FOSDEM22 where I will show performance comparisons against awk, jq, pandas and "pure" python implementations from scratch (https://fosdem.org/2022/schedule/event/python_spyql/). I can put some numbers here if you are interested.

Regarding the parser. First, I am not properly proud of the parser of spyql, it's a mean to an end. I started with a standard SQL parser but it was becoming too difficult conciliating with the python syntax. I could follow an AST approach but due to lack of experience I was unable to estimate the effort and eventual hurdles. I also want spyql to be compatible with different python versions that can have different ASTs. So I went with a basic approach based on regex and alike. It's definitely an area for improvement and if anyone finds this challenge interesting then please hop in!!

I had to avoid collisions with already existing functions of python... `sum` in Python sums lists/iterators, while sum in SQL is an aggregation function. I am adding the `_agg` suffix to aggregations to avoid collisions, but I confess that I forget to add them many times when writing my own queries...

I have included a PARTIALS modifier that simplifies window/analytical functions based on the premise that the window is defined by the order of the input. It makes so much easier writing running sums, differences between consecutive rows, etc. This is also not standard in SQL. It is also super useful for stats on streaming data.

There's a lot of space for debate... I am super happy to hear feedback from you! Great constructive feedback, would love to chat more :-) Thanks


I took a look to see how this works, expecting a super-complex parser/compiler - and it's mainly some straight-forward regular expression replacements! https://github.com/dcmoura/spyql/blob/4b5ea3ff52875d2c1a544f...

I'm very impressed - this is some very neat pragmatic software design.


I am just very lazy! :-)


Nice one! I like the SQL extension to denote the output format ("TO json"), great idea! It's also interesting that you allow calling python methods on values, really neat!

If you're interested in tools like this, also make sure to check out OctoSQL[0], which has an extensible plugin architecture and lets you work with both files (JSON, CSV, ...) and databases (PostgreSQL, MySQL, ...) - even use all of them in a single query. It also supports working with streaming data sources, with Kafka available soon.

Other than that, check out all the other notable tools in this area: datafusion, datasette, dsq, q, textql.

Disclaimer: I'm the author of OctoSQL.


I know OctoSQL, I found the project super interesting, especially as you can merge different data sources. SPyQL is very single-stream oriented, much in line with command line tools (e.g. awk, grep, sed). I should be implementing soon a left join use case based on dictionary lookups, but I am not planning to implement more complicated joins in the near term, you have OctoSQL for that :-)



Interesting hybrid.

Today I came across yet another Twitter thread inquiring if you are SQL-ista or Pythonista https://twitter.com/bennstancil/status/1486031369327292421

One response probability summed it best: “SQL is written for you to think in shapes while python is written for you to think in steps. SQL lets you focus on defining a shape, and the warehouse figures out how to compute it. That's a better flow & focus for data transformation work.”


Agree! There’s a lot of code you don’t see behind a simple SELECT statement. I would also say that when writing SQL you start thinking on the end result, describing what you want to get and not so much how you want it done. This is also part of the declarative nature of SQL.


Interesting concept.

It's always intriguing to me that no major database vendors have added support for Python, Javascript, Lua, et. al, for their expression and stored procedure definition languages. It would really make things more accessible than Transact-SQL[1], Oracle's P/SQL[2] or DB2's messy external procedure support.[3]

[1]: https://docs.microsoft.com/en-us/sql/t-sql/statements/create...

[2]: https://docs.oracle.com/cd/B19306_01/server.102/b14200/state...

[3]: https://www.ibm.com/docs/en/db2/11.5?topic=statements-create...


BigQuery supports JavaScript UDFs [1]. But I bet Python isn’t popular in a database for performance reasons.

[1]: https://cloud.google.com/bigquery/docs/reference/standard-sq...


Redshift lets you create UDFs in Python (2.7), and via Lambda functions (for any language runtime supported by AWS Lambda).

https://docs.aws.amazon.com/redshift/latest/dg/user-defined-...


You may find this interesting: https://docs.oracle.com/en/graalvm/enterprise/21/docs/exampl...

Examples show Javascript execution.


I stand corrected! Thanks for passing that along. It's been awhile since I worked with Oracle (last was an 11g instance).


Postgres has supported Python for quite a while.


yep, unfortunately due to security concerns is not supported everywhere (DB as a service): `ERROR: Extension "plpythonu" is not supported by Amazon RDS`


Unfortunately the Postgres implementation is marked unsafe for a reason. There's not a sufficient sandbox in place to prevent shenanigans.


If you’re writing the code and sanitizing inputs you only have yourself to blame. Just like every other app ever written?


Postgres also supports Javascript


Oracle has had java support in the RDBMS for more than 20 years, probably more than 25 years...


MongoDB supports JavaScript


C# and few other .net languages can be used in SQL Server for more than a decade: https://en.m.wikipedia.org/wiki/SQL_CLR


Yes, but similar to Python on Postgresql, it comes with a bunch of security warnings attached.

I should have clarified in my original post but I'm envisioning more of: database server has a parser that understands python/javascript/c#/etc but projects a subset of library functions to avoid security dangers.

Essentially, expose the same functionality stored procedures already have but in a more accessible syntax.

What vendors mostly seem to have opted for is embedding the full language parsers directly into the database servers, which yields a lot of power and a lot of risk.


I love this tool!

I use it every day to insert monitor data from Lunar (https://lunar.fyi) into The Monitor Database (https://db.lunar.fyi)

I also use it occasionaly to insert data from my Meilisearch instances into Postgres/SQLite for more accurate filtering when needed.


Thank you! Made my day :-)



PINQ is exactly the kind of python to SQL translation tool I've been wishing for for a while. Thank you!


Interesting, I assume SPyQL uses eval internally? Couple of years ago I wrote [RBQL](https://rbql.org) which is based on exact same principle and also supports JS expressions, would be curious to compare design deccisions in SPyQL and RBQL.


Yes it leverages eval. I did not know about RBQL... I will check it out! Also curious here, there a lot of non-trivial decisions :-)


Sounds awesome. Thanks for posting. It'd be great to be able to import this into a Jupyter notebook or script rather than doing it on the command line. Don't know how useful it is as a command line tool.


Thank you! One of my main goals was making data processing in the command-line more accessible and intuitive. If you use a shell you can leverage an extensive array of tools. please take a look at the recipes in the Readme. The shell is many times underrated for data processing!

Right now you can use it in Jupiter Notebooks using a shell kernel like: https://github.com/takluyver/bash_kernel

On the mid-term, developing a spyql kernel is appealing because of syntax highlighting, code autocompleting, and more. But unless several people show interest on this, I should tackle other features first.


Thanks for the reply. It's your tool mate and of course by all means work on the features you want. It's a really neat tool, there is not doubt. I'm sure it'll be helpful to a lot of people. I'm data analyst and I often find writing SQL statements from Python using libraries such as sqlalchemy and psycopg2 etc. I've written a few rudimentary functions that connect to my data sources and run PGSQL code and fetch result into a Pandas dataframe before doing further analysis. When I saw your tool I got excited and was trying figure out how to integrate it into my notebooks. hahaha. Cheers.


:-) It’s an open-source tool, so it should be community-driven. If you can share an example I can try to understand if spyql can make your life easier.


I'm happy to see the stdin -> spyql -> stdout model. Plain text is great :)

Columnar storage and processing could give you some performance benefits. SPyQL seems like it could potentially work with column text format, which a couple students and I developed last year. https://cran.r-project.org/web/packages/ctf/vignettes/overvi...


Yes, column storage is great, specially when you only need a small subset of columns as there much less reading operations. I have experience working with Parquet files, so I can see the benefits of CTFs, even if we don't take into account summary metadata and so forth.

In its current version, SPyQL only reads from the stdin so it could only read a single column (a single file), unless you had an util merging the required columns beforehand (which could very well eliminate any performance gains of CTFs). I am planning to support reading from files/directories (as an argument) and in that case it's a matter of extending the processor class to deal with CTFs (column names = file names, etc). For taking full advantage of CTFs we would need to only open/read the required columns/files, but that seems very doable.

How often with CTFs would you have your column files (individually) compressed? What would be the most popular compression formats? Thanks!

BTW, nice R lib, I also love R :-)


Having Python expressions within a declarative language is a really good idea because we can combine low level logic of computations of values with high level logic of set processing.

A similar approach is implemented in the Prosto data processing toolkit:

https://github.com/asavinov/prosto

Although Prosto is viewed as an alternative to Map-Reduce by relying on functions, it also supports Python User-Defined Functions in its Column-SQL:

  prosto.column_sql(
    "CALCULATE  Sales(quantity, price) -> amount",
    lambda x: x["quantity"] * x["price"]
  )
In this Column-SQL statement we define a new calculated column the values of which are computed in Python (as a sum of two columns). An advantage is that we can process data in multiple tables without joins or groupbys which is much easier than in the existing set-oriented approaches. Another advantage is that we can combine many statements by defining a workflow in an Excel-like manner.


Interesting! I will take a look


I shouldn't love this frankenstein hybrid, but I have to say, it looks cool.


This is cool. The environments at my workplace have some YAML's so big they cause the IDE to chug hard, so I think I might be able to find a real use case for this!


Thank you :-) You can use https://github.com/mikefarah/yq to convert yaml to json on the fly and pipe it into spyql. There’s an example in the Readme.


The linked demo video is fantastic! I especially loved the demo of SPyQL reading and plotting in numpy data generated from the linspace command.


Thank you!


Nice work. The closest tool I wrote for www.convertcsv.io is CSV Query Tool.


Do you plan on adding capability to do more DDL and DML?


The tool has a flow: stdin -> spyql -> stdout You can choose the data format of the input and output. One of the output formats writes INSERT statements to stdout, enabling to import data to any SQL database. Currently, spyql does not have the option to generate a `CREATE TABLE` statement, but that is planned, as well as what to do with already existing data/tables (e.g. drop, truncate, ignore).

I wonder if UPDATEs and UPSERTs would also be useful, but haven't put much though into it, any feedback here would be welcomed.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: