[Feature #21943] Add StringScanner#integer_at#193
[Feature #21943] Add StringScanner#integer_at#193jinroq wants to merge 6 commits intoruby:masterfrom
StringScanner#integer_at#193Conversation
Add a method that returns a captured substring as an Integer without creating an intermediate Ruby String object. This is designed to improve performance of Date._strptime by avoiding temporary String allocation when extracting integers from regex captures. - Accept Integer index (positive, negative, zero), Symbol, or String for named capture groups, consistent with StringScanner#[] - Use rb_int_parse_cstr when available for zero-allocation parsing, with rb_str_to_inum fallback for older Ruby versions - Raise ArgumentError for non-digit characters or empty captures - Return nil when unmatched or index out of range
Allow manual CI runs from the GitHub Actions tab.
| return new_ary; | ||
| } | ||
|
|
||
| #ifdef HAVE_RB_INT_PARSE_CSTR |
.github/workflows/ci.yml
Outdated
| on: | ||
| - push | ||
| - pull_request | ||
| - workflow_dispatch |
There was a problem hiding this comment.
| - workflow_dispatch |
| have_func("onig_region_memsize(NULL)") | ||
| have_func("rb_reg_onig_match", "ruby/re.h") | ||
| have_func("rb_deprecate_constant") | ||
| have_func("rb_int_parse_cstr") |
There was a problem hiding this comment.
strscan requires Ruby 2.4 or later.
What is the minimum Ruby version to use rb_int_parse_cstr()?
There was a problem hiding this comment.
rb_int_parse_cstr has been available since Ruby 2.5.0. In Ruby 2.4, it is detected using have_func, and if it is not available, it falls back to rb_str_to_inum.
There was a problem hiding this comment.
OK. Can we use rb_cstr_parse_inum() with Ruby 2.4?
| #ifdef HAVE_RB_INT_PARSE_CSTR | ||
| VALUE rb_int_parse_cstr(const char *str, ssize_t len, char **endp, | ||
| size_t *ndigits, int base, int flags); | ||
| #define RB_INT_PARSE_SIGN 0x01 |
There was a problem hiding this comment.
If ruby/ruby#16322 is merged, this will report a duplicated definition warning.
There was a problem hiding this comment.
Can we omit rb_int_parse_cstr() prototype and RB_INT_PARSE_SIGN definition entirely when Ruby provides them?
ext/strscan/strscan.c
Outdated
| rb_define_method(StringScanner, "size", strscan_size, 0); | ||
| rb_define_method(StringScanner, "captures", strscan_captures, 0); | ||
| rb_define_method(StringScanner, "values_at", strscan_values_at, -1); | ||
| rb_define_method(StringScanner, "integer_at", strscan_integer_at, 1); |
There was a problem hiding this comment.
| rb_define_method(StringScanner, "integer_at", strscan_integer_at, 1); | |
| rb_define_method(StringScanner, "integer_at", strscan_integer_at, 1); |
This reverts commit d615cb7.
…tion Add "09" and "010" cases to verify integer_at always uses base 10, unlike Integer() which interprets leading zeros as octal.
|
If https://bugs.ruby-lang.org/issues/21932 gets merged it seems cleaner to reuse that than to reimplement it. @kou Do you know why I think it would be better to expose the MatchData object than keep defining methods similar to MatchData but with slightly different names. I think it makes it harder to learn the StringScanner API (i.e., it would be smaller and easier to approach if it didn't duplicate many MatchData methods). Among all StringScanner instance methods: These are just doing the same on the MatchData: And these are MatchData methods which StringScanner doesn't have: |
Mmh, but that likely wouldn't achieve as good a speedup as the current approach in the context of https://bugs.ruby-lang.org/issues/21943 as it would mean an extra MatchData allocation. Lines 57 to 58 in 3592c39 BTW the presense of |
| def test_integer_at_large_number | ||
| huge = '9' * 100 | ||
| s = create_string_scanner(huge) | ||
| s.scan(/(#{huge})/) |
There was a problem hiding this comment.
| s.scan(/(#{huge})/) | |
| s.scan(/(\d+)/) |
| end | ||
|
|
||
| def test_integer_at_leading_zeros | ||
| s = create_string_scanner("007") |
There was a problem hiding this comment.
007 is not a good data for this because 007 is valid both for base=10 and base=8. Do we need this test?
| # "09" would be invalid in octal, but integer_at always uses base 10 | ||
| s = create_string_scanner("09") | ||
| s.scan(/(\d+)/) | ||
| assert_equal(9, s.integer_at(1)) | ||
|
|
||
| # "010" is 8 in octal (Integer("010")), but 10 in base 10 | ||
| s = create_string_scanner("010") | ||
| s.scan(/(\d+)/) | ||
| assert_equal(10, s.integer_at(1)) |
There was a problem hiding this comment.
Do we need both of them? Can they to catch any different problem?
| have_func("onig_region_memsize(NULL)") | ||
| have_func("rb_reg_onig_match", "ruby/re.h") | ||
| have_func("rb_deprecate_constant") | ||
| have_func("rb_int_parse_cstr") |
There was a problem hiding this comment.
OK. Can we use rb_cstr_parse_inum() with Ruby 2.4?
| long j = 0; | ||
| if (ptr[0] == '-' || ptr[0] == '+') j = 1; | ||
| if (j >= len) { | ||
| rb_raise(rb_eArgError, | ||
| "non-digit character in capture: %.*s", | ||
| (int)len, ptr); | ||
| } | ||
| for (; j < len; j++) { | ||
| if (ptr[j] < '0' || ptr[j] > '9') { | ||
| rb_raise(rb_eArgError, | ||
| "non-digit character in capture: %.*s", | ||
| (int)len, ptr); | ||
| } | ||
| } | ||
| return rb_str_to_inum(rb_str_new(ptr, len), 10, 0); |
| GET_SCANNER(self, p); | ||
| if (! MATCHED_P(p)) return Qnil; | ||
|
|
||
| switch (TYPE(idx)) { | ||
| case T_SYMBOL: | ||
| idx = rb_sym2str(idx); | ||
| /* fall through */ | ||
| case T_STRING: | ||
| RSTRING_GETMEM(idx, name, i); | ||
| i = name_to_backref_number(&(p->regs), p->regex, name, name + i, rb_enc_get(idx)); | ||
| break; | ||
| default: | ||
| i = NUM2LONG(idx); | ||
| } | ||
|
|
||
| if (i < 0) | ||
| i += p->regs.num_regs; | ||
| if (i < 0) return Qnil; | ||
| if (i >= p->regs.num_regs) return Qnil; | ||
| if (p->regs.beg[i] == -1) return Qnil; |
There was a problem hiding this comment.
You copied this from strscan_aref(), right? Can we share common code with strscan_aref() and strsacn_integer_at()?
| end = adjust_register_position(p, p->regs.end[i]); | ||
| len = end - beg; | ||
|
|
||
| if (len <= 0) { |
There was a problem hiding this comment.
Can we use == 0 here?
len may be negative?
| len = end - beg; | ||
|
|
||
| if (len <= 0) { | ||
| rb_raise(rb_eArgError, "empty capture for integer conversion"); |
There was a problem hiding this comment.
| rb_raise(rb_eArgError, "empty capture for integer conversion"); | |
| rb_raise(rb_eArgError, "specified capture is empty: %"PRIsVALUE, idx); |
|
|
||
| if (endp != ptr + len) { | ||
| rb_raise(rb_eArgError, | ||
| "non-digit character in capture: %.*s", |
There was a problem hiding this comment.
Is there any other reason on failure?
|
|
||
| if (endp != ptr + len) { | ||
| rb_raise(rb_eArgError, | ||
| "non-digit character in capture: %.*s", |
There was a problem hiding this comment.
If the target string has a trailing space, it's difficult to find a problem. How about surround the target string something like the following?
| "non-digit character in capture: %.*s", | |
| "non-digit character in capture: <%.*s>", |
No. But if we create a
Yes. But it's before FYI: https://i.loveruby.net/ja/projects/strscan/doc/ChangeLog.html (Japanese) |
Yeah, and I guess that's the main reason StringScanner directly exposes MatchData-like methods.
Interesting, thank you for the link. |
Yes. But it's out-of-scope of this. |
Due to a change in branch, I have taken over #192.
see: https://bugs.ruby-lang.org/issues/21943
relation: https://bugs.ruby-lang.org/issues/21932