Skip to content

[Feature #21943] Add StringScanner#integer_at#193

Open
jinroq wants to merge 6 commits intoruby:masterfrom
jinroq:add_integer_at
Open

[Feature #21943] Add StringScanner#integer_at#193
jinroq wants to merge 6 commits intoruby:masterfrom
jinroq:add_integer_at

Conversation

@jinroq
Copy link

@jinroq jinroq commented Mar 6, 2026

Due to a change in branch, I have taken over #192.

see: https://bugs.ruby-lang.org/issues/21943

relation: https://bugs.ruby-lang.org/issues/21932

jinroq added 2 commits March 6, 2026 16:14
Add a method that returns a captured substring as an Integer without
creating an intermediate Ruby String object. This is designed to
improve performance of Date._strptime by avoiding temporary String
allocation when extracting integers from regex captures.

- Accept Integer index (positive, negative, zero), Symbol, or String
  for named capture groups, consistent with StringScanner#[]
- Use rb_int_parse_cstr when available for zero-allocation parsing,
  with rb_str_to_inum fallback for older Ruby versions
- Raise ArgumentError for non-digit characters or empty captures
- Return nil when unmatched or index out of range
Allow manual CI runs from the GitHub Actions tab.
return new_ary;
}

#ifdef HAVE_RB_INT_PARSE_CSTR
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on:
- push
- pull_request
- workflow_dispatch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- workflow_dispatch

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9e0d504 fixed it.

have_func("onig_region_memsize(NULL)")
have_func("rb_reg_onig_match", "ruby/re.h")
have_func("rb_deprecate_constant")
have_func("rb_int_parse_cstr")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strscan requires Ruby 2.4 or later.
What is the minimum Ruby version to use rb_int_parse_cstr()?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rb_int_parse_cstr has been available since Ruby 2.5.0. In Ruby 2.4, it is detected using have_func, and if it is not available, it falls back to rb_str_to_inum.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Can we use rb_cstr_parse_inum() with Ruby 2.4?

#ifdef HAVE_RB_INT_PARSE_CSTR
VALUE rb_int_parse_cstr(const char *str, ssize_t len, char **endp,
size_t *ndigits, int base, int flags);
#define RB_INT_PARSE_SIGN 0x01
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If ruby/ruby#16322 is merged, this will report a duplicated definition warning.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1630df8 fixed it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we omit rb_int_parse_cstr() prototype and RB_INT_PARSE_SIGN definition entirely when Ruby provides them?

rb_define_method(StringScanner, "size", strscan_size, 0);
rb_define_method(StringScanner, "captures", strscan_captures, 0);
rb_define_method(StringScanner, "values_at", strscan_values_at, -1);
rb_define_method(StringScanner, "integer_at", strscan_integer_at, 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
rb_define_method(StringScanner, "integer_at", strscan_integer_at, 1);
rb_define_method(StringScanner, "integer_at", strscan_integer_at, 1);

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1630df8 fixed it.

@jinroq jinroq requested a review from kou March 6, 2026 16:43
@eregon
Copy link
Member

eregon commented Mar 6, 2026

If https://bugs.ruby-lang.org/issues/21932 gets merged it seems cleaner to reuse that than to reimplement it.

@kou Do you know why StringScanner has an interface very similar to MatchData but yet doesn't expose the MatchData object?
In fact on TruffleRuby StringScanner uses MatchData objects internally.

I think it would be better to expose the MatchData object than keep defining methods similar to MatchData but with slightly different names. I think it makes it harder to learn the StringScanner API (i.e., it would be smaller and easier to approach if it didn't duplicate many MatchData methods).

Among all StringScanner instance methods:

  <<, [], beginning_of_line?, captures, charpos, check, check_until,
  concat, eos?, exist?, fixed_anchor?, get_byte, getch, initialize_copy,
  inspect, match?, matched, matched?, matched_size, named_captures,
  peek, peek_byte, pointer, pointer=, pos, pos=, post_match, pre_match,
  reset, rest, rest_size, scan, scan_byte, scan_integer, scan_until,
  size, skip, skip_until, string, string=, terminate, unscan, values_at

These are just doing the same on the MatchData:

  [], captures
  matched, matched?,
  matched_size (same as `byteend(0) - bytebegin(0)`, named_captures
  post_match, pre_match
  size, string, values_at

And these are MatchData methods which StringScanner doesn't have:

  begin, bytebegin, byteend, byteoffset, deconstruct,
  deconstruct_keys, end, length, match,
  match_length, names, offset,
  regexp, to_a

@eregon
Copy link
Member

eregon commented Mar 6, 2026

If https://bugs.ruby-lang.org/issues/21932 gets merged it seems cleaner to reuse that than to reimplement it.

Mmh, but that likely wouldn't achieve as good a speedup as the current approach in the context of https://bugs.ruby-lang.org/issues/21943 as it would mean an extra MatchData allocation.
The strscan extension seems to save the matched captures but not a MatchData object:

/* the regexp register; legal only when MATCHED_P(s) */
struct re_registers regs;

BTW the presense of StringScanner.must_C_version makes me wonder, was StringScanner once written in Ruby?

def test_integer_at_large_number
huge = '9' * 100
s = create_string_scanner(huge)
s.scan(/(#{huge})/)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
s.scan(/(#{huge})/)
s.scan(/(\d+)/)

end

def test_integer_at_leading_zeros
s = create_string_scanner("007")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

007 is not a good data for this because 007 is valid both for base=10 and base=8. Do we need this test?

Comment on lines +1070 to +1078
# "09" would be invalid in octal, but integer_at always uses base 10
s = create_string_scanner("09")
s.scan(/(\d+)/)
assert_equal(9, s.integer_at(1))

# "010" is 8 in octal (Integer("010")), but 10 in base 10
s = create_string_scanner("010")
s.scan(/(\d+)/)
assert_equal(10, s.integer_at(1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need both of them? Can they to catch any different problem?

have_func("onig_region_memsize(NULL)")
have_func("rb_reg_onig_match", "ruby/re.h")
have_func("rb_deprecate_constant")
have_func("rb_int_parse_cstr")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Can we use rb_cstr_parse_inum() with Ruby 2.4?

Comment on lines +1944 to +1958
long j = 0;
if (ptr[0] == '-' || ptr[0] == '+') j = 1;
if (j >= len) {
rb_raise(rb_eArgError,
"non-digit character in capture: %.*s",
(int)len, ptr);
}
for (; j < len; j++) {
if (ptr[j] < '0' || ptr[j] > '9') {
rb_raise(rb_eArgError,
"non-digit character in capture: %.*s",
(int)len, ptr);
}
}
return rb_str_to_inum(rb_str_new(ptr, len), 10, 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this accept 1_234?
See also: #192 (comment)

Comment on lines +1894 to +1913
GET_SCANNER(self, p);
if (! MATCHED_P(p)) return Qnil;

switch (TYPE(idx)) {
case T_SYMBOL:
idx = rb_sym2str(idx);
/* fall through */
case T_STRING:
RSTRING_GETMEM(idx, name, i);
i = name_to_backref_number(&(p->regs), p->regex, name, name + i, rb_enc_get(idx));
break;
default:
i = NUM2LONG(idx);
}

if (i < 0)
i += p->regs.num_regs;
if (i < 0) return Qnil;
if (i >= p->regs.num_regs) return Qnil;
if (p->regs.beg[i] == -1) return Qnil;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You copied this from strscan_aref(), right? Can we share common code with strscan_aref() and strsacn_integer_at()?

end = adjust_register_position(p, p->regs.end[i]);
len = end - beg;

if (len <= 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use == 0 here?
len may be negative?

len = end - beg;

if (len <= 0) {
rb_raise(rb_eArgError, "empty capture for integer conversion");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
rb_raise(rb_eArgError, "empty capture for integer conversion");
rb_raise(rb_eArgError, "specified capture is empty: %"PRIsVALUE, idx);


if (endp != ptr + len) {
rb_raise(rb_eArgError,
"non-digit character in capture: %.*s",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any other reason on failure?


if (endp != ptr + len) {
rb_raise(rb_eArgError,
"non-digit character in capture: %.*s",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the target string has a trailing space, it's difficult to find a problem. How about surround the target string something like the following?

Suggested change
"non-digit character in capture: %.*s",
"non-digit character in capture: <%.*s>",

@kou
Copy link
Member

kou commented Mar 7, 2026

Do you know why StringScanner has an interface very similar to MatchData but yet doesn't expose the MatchData object?

No. But if we create a MatchData, it causes performance overhead, right? (StringScanner doesn't use MatchData internally.) It'll reduce a merit of this optimization.

BTW the presense of StringScanner.must_C_version makes me wonder, was StringScanner once written in Ruby?

Yes. But it's before StringScanner was imported to Ruby itself.

FYI: https://i.loveruby.net/ja/projects/strscan/doc/ChangeLog.html (Japanese)

@eregon
Copy link
Member

eregon commented Mar 7, 2026

No. But if we create a MatchData, it causes performance overhead, right?

Yeah, and I guess that's the main reason StringScanner directly exposes MatchData-like methods.
StringScanner could still have a new method to return a MatchData, so MatchData methods which are not mirrored in StringScanner could be used.

FYI: https://i.loveruby.net/ja/projects/strscan/doc/ChangeLog.html (Japanese)

Interesting, thank you for the link.

@kou
Copy link
Member

kou commented Mar 7, 2026

StringScanner could still have a new method to return a MatchData, so MatchData methods which are not mirrored in StringScanner could be used.

Yes. But it's out-of-scope of this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants