Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Good question. The regex I tried is for extracting amounts in EUR and USD:

  /
  (?<=^|[ \t])
  (?<currency_prefix_with_space>
    (?<currency_prefix>
      €|EUR|\$|USD
    )
    [ \t]?
  )?
  (?<number>
    (?<integral>
      -?
      \d{1,3}
      (?:[\.,]\d{3}|\d*)
    )
    [\.,]
    (?<fraction>
      \d{2,3}
    )
  )
  (?<currency_postfix_with_space>
    [ \t]?
    (?<postfix_or_ending>
      (?<currency_postfix>
        €|EUR|\$|USD
      )
    ) | (?<ending>
          [ \t]|$|\n
        )
  )/x


I'd imagine many nested named capturing groups may trip even the best automated system! I do like the solution though.

I would've probably approached it differently, trying to first get the 'inverted' match (i.e. ignore anything that isn't a currency-like pattern) and refine from there. A bit like this one I did a while back, to parse garbled strings that may occur after OCR [0]. I imagine the approach does not translate fully, because it's pattern extraction rather than validation.

[0]: https://observablehq.com/@dleeftink/never-go-nuts


Thanks for sharing! I have to admit I do not have the necessary brain cycles to spare today, but OCR processing is indeed of interest to me, and I will take a more in-depth look in the upcoming days.

The idea of an exclusionary approach sounds interesting as well. I'll have to think about that a bit.


Check out WordNinja in case regex doesn't cut it! [0]

[0]: https://github.com/keredson/wordninja


Will do, thanks again!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: