[Feature #21943] Add `StringScanner#integer_at` by jinroq · Pull Request #193 · ruby/strscan

jinroq · 2026-03-06T08:28:39Z

Due to a change in branch, I have taken over #192.

see: https://bugs.ruby-lang.org/issues/21943

relation: https://bugs.ruby-lang.org/issues/21932

Add a method that returns a captured substring as an Integer without creating an intermediate Ruby String object. This is designed to improve performance of Date._strptime by avoiding temporary String allocation when extracting integers from regex captures. - Accept Integer index (positive, negative, zero), Symbol, or String for named capture groups, consistent with StringScanner#[] - Use rb_int_parse_cstr when available for zero-allocation parsing, with rb_str_to_inum fallback for older Ruby versions - Raise ArgumentError for non-digit characters or empty captures - Return nil when unmatched or index out of range

Allow manual CI runs from the GitHub Actions tab.

jinroq · 2026-03-06T08:29:47Z

ext/strscan/strscan.c

    return new_ary;
 }

+#ifdef HAVE_RB_INT_PARSE_CSTR


see: ruby/ruby#16322

kou · 2026-03-06T08:59:25Z

.github/workflows/ci.yml

 on:
 - push
 - pull_request
+- workflow_dispatch


Suggested change

- workflow_dispatch

9e0d504 fixed it.

kou · 2026-03-06T09:03:39Z

ext/strscan/extconf.rb

  have_func("onig_region_memsize(NULL)")
  have_func("rb_reg_onig_match", "ruby/re.h")
  have_func("rb_deprecate_constant")
+  have_func("rb_int_parse_cstr")


strscan requires Ruby 2.4 or later.
What is the minimum Ruby version to use rb_int_parse_cstr()?

rb_int_parse_cstr has been available since Ruby 2.5.0. In Ruby 2.4, it is detected using have_func, and if it is not available, it falls back to rb_str_to_inum.

OK. Can we use rb_cstr_parse_inum() with Ruby 2.4?

kou · 2026-03-06T09:04:22Z

ext/strscan/strscan.c

+#ifdef HAVE_RB_INT_PARSE_CSTR
+VALUE rb_int_parse_cstr(const char *str, ssize_t len, char **endp,
+                        size_t *ndigits, int base, int flags);
+#define RB_INT_PARSE_SIGN 0x01


If ruby/ruby#16322 is merged, this will report a duplicated definition warning.

1630df8 fixed it.

Can we omit rb_int_parse_cstr() prototype and RB_INT_PARSE_SIGN definition entirely when Ruby provides them?

kou · 2026-03-06T09:05:24Z

ext/strscan/strscan.c

    rb_define_method(StringScanner, "size",        strscan_size,        0);
    rb_define_method(StringScanner, "captures",    strscan_captures,    0);
    rb_define_method(StringScanner, "values_at",   strscan_values_at,  -1);
+    rb_define_method(StringScanner, "integer_at",     strscan_integer_at,     1);


Suggested change

rb_define_method(StringScanner, "integer_at", strscan_integer_at, 1);

rb_define_method(StringScanner, "integer_at", strscan_integer_at, 1);

1630df8 fixed it.

This reverts commit d615cb7.

…tion Add "09" and "010" cases to verify integer_at always uses base 10, unlike Integer() which interprets leading zeros as octal.

eregon · 2026-03-06T18:14:07Z

If https://bugs.ruby-lang.org/issues/21932 gets merged it seems cleaner to reuse that than to reimplement it.

@kou Do you know why StringScanner has an interface very similar to MatchData but yet doesn't expose the MatchData object?
In fact on TruffleRuby StringScanner uses MatchData objects internally.

I think it would be better to expose the MatchData object than keep defining methods similar to MatchData but with slightly different names. I think it makes it harder to learn the StringScanner API (i.e., it would be smaller and easier to approach if it didn't duplicate many MatchData methods).

Among all StringScanner instance methods:

  <<, [], beginning_of_line?, captures, charpos, check, check_until,
  concat, eos?, exist?, fixed_anchor?, get_byte, getch, initialize_copy,
  inspect, match?, matched, matched?, matched_size, named_captures,
  peek, peek_byte, pointer, pointer=, pos, pos=, post_match, pre_match,
  reset, rest, rest_size, scan, scan_byte, scan_integer, scan_until,
  size, skip, skip_until, string, string=, terminate, unscan, values_at

These are just doing the same on the MatchData:

  [], captures
  matched, matched?,
  matched_size (same as `byteend(0) - bytebegin(0)`, named_captures
  post_match, pre_match
  size, string, values_at

And these are MatchData methods which StringScanner doesn't have:

  begin, bytebegin, byteend, byteoffset, deconstruct,
  deconstruct_keys, end, length, match,
  match_length, names, offset,
  regexp, to_a

eregon · 2026-03-06T18:54:23Z

If https://bugs.ruby-lang.org/issues/21932 gets merged it seems cleaner to reuse that than to reimplement it.

Mmh, but that likely wouldn't achieve as good a speedup as the current approach in the context of https://bugs.ruby-lang.org/issues/21943 as it would mean an extra MatchData allocation.
The strscan extension seems to save the matched captures but not a MatchData object:

strscan/ext/strscan/strscan.c

Lines 57 to 58 in 3592c39

    
           /* the regexp register; legal only when MATCHED_P(s) */ 
        
           struct re_registers regs;

BTW the presense of StringScanner.must_C_version makes me wonder, was StringScanner once written in Ruby?

kou · 2026-03-07T05:14:49Z

test/strscan/test_stringscanner.rb

+  def test_integer_at_large_number
+    huge = '9' * 100
+    s = create_string_scanner(huge)
+    s.scan(/(#{huge})/)


Suggested change

s.scan(/(#{huge})/)

s.scan(/(\d+)/)

kou · 2026-03-07T05:17:22Z

test/strscan/test_stringscanner.rb

+  end
+
+  def test_integer_at_leading_zeros
+    s = create_string_scanner("007")


007 is not a good data for this because 007 is valid both for base=10 and base=8. Do we need this test?

kou · 2026-03-07T05:18:34Z

test/strscan/test_stringscanner.rb

+    # "09" would be invalid in octal, but integer_at always uses base 10
+    s = create_string_scanner("09")
+    s.scan(/(\d+)/)
+    assert_equal(9, s.integer_at(1))
+
+    # "010" is 8 in octal (Integer("010")), but 10 in base 10
+    s = create_string_scanner("010")
+    s.scan(/(\d+)/)
+    assert_equal(10, s.integer_at(1))


Do we need both of them? Can they to catch any different problem?

kou · 2026-03-07T05:25:18Z

ext/strscan/extconf.rb

  have_func("onig_region_memsize(NULL)")
  have_func("rb_reg_onig_match", "ruby/re.h")
  have_func("rb_deprecate_constant")
+  have_func("rb_int_parse_cstr")


OK. Can we use rb_cstr_parse_inum() with Ruby 2.4?

kou · 2026-03-07T05:26:36Z

ext/strscan/strscan.c

+        long j = 0;
+        if (ptr[0] == '-' || ptr[0] == '+') j = 1;
+        if (j >= len) {
+            rb_raise(rb_eArgError,
+                     "non-digit character in capture: %.*s",
+                     (int)len, ptr);
+        }
+        for (; j < len; j++) {
+            if (ptr[j] < '0' || ptr[j] > '9') {
+                rb_raise(rb_eArgError,
+                         "non-digit character in capture: %.*s",
+                         (int)len, ptr);
+            }
+        }
+        return rb_str_to_inum(rb_str_new(ptr, len), 10, 0);


Does this accept 1_234?
See also: #192 (comment)

kou · 2026-03-07T05:28:14Z

ext/strscan/strscan.c

+    GET_SCANNER(self, p);
+    if (! MATCHED_P(p))        return Qnil;
+
+    switch (TYPE(idx)) {
+        case T_SYMBOL:
+            idx = rb_sym2str(idx);
+            /* fall through */
+        case T_STRING:
+            RSTRING_GETMEM(idx, name, i);
+            i = name_to_backref_number(&(p->regs), p->regex, name, name + i, rb_enc_get(idx));
+            break;
+        default:
+            i = NUM2LONG(idx);
+    }
+
+    if (i < 0)
+        i += p->regs.num_regs;
+    if (i < 0)                 return Qnil;
+    if (i >= p->regs.num_regs) return Qnil;
+    if (p->regs.beg[i] == -1)  return Qnil;


You copied this from strscan_aref(), right? Can we share common code with strscan_aref() and strsacn_integer_at()?

kou · 2026-03-07T05:29:29Z

ext/strscan/strscan.c

+    end = adjust_register_position(p, p->regs.end[i]);
+    len = end - beg;
+
+    if (len <= 0) {


Can we use == 0 here?
len may be negative?

kou · 2026-03-07T05:36:29Z

ext/strscan/strscan.c

+    len = end - beg;
+
+    if (len <= 0) {
+        rb_raise(rb_eArgError, "empty capture for integer conversion");


Suggested change

rb_raise(rb_eArgError, "empty capture for integer conversion");

rb_raise(rb_eArgError, "specified capture is empty: %"PRIsVALUE, idx);

kou · 2026-03-07T05:37:07Z

ext/strscan/strscan.c

+
+        if (endp != ptr + len) {
+            rb_raise(rb_eArgError,
+                     "non-digit character in capture: %.*s",


Is there any other reason on failure?

kou · 2026-03-07T05:38:15Z

ext/strscan/strscan.c

+
+        if (endp != ptr + len) {
+            rb_raise(rb_eArgError,
+                     "non-digit character in capture: %.*s",


If the target string has a trailing space, it's difficult to find a problem. How about surround the target string something like the following?

Suggested change

"non-digit character in capture: %.*s",

"non-digit character in capture: <%.*s>",

kou · 2026-03-07T05:44:52Z

Do you know why StringScanner has an interface very similar to MatchData but yet doesn't expose the MatchData object?

No. But if we create a MatchData, it causes performance overhead, right? (StringScanner doesn't use MatchData internally.) It'll reduce a merit of this optimization.

BTW the presense of StringScanner.must_C_version makes me wonder, was StringScanner once written in Ruby?

Yes. But it's before StringScanner was imported to Ruby itself.

FYI: https://i.loveruby.net/ja/projects/strscan/doc/ChangeLog.html (Japanese)

eregon · 2026-03-07T09:49:18Z

No. But if we create a MatchData, it causes performance overhead, right?

Yeah, and I guess that's the main reason StringScanner directly exposes MatchData-like methods.
StringScanner could still have a new method to return a MatchData, so MatchData methods which are not mirrored in StringScanner could be used.

FYI: https://i.loveruby.net/ja/projects/strscan/doc/ChangeLog.html (Japanese)

Interesting, thank you for the link.

kou · 2026-03-07T23:00:06Z

StringScanner could still have a new method to return a MatchData, so MatchData methods which are not mirrored in StringScanner could be used.

Yes. But it's out-of-scope of this.

jinroq added 2 commits March 6, 2026 16:14

Add workflow_dispatch trigger to CI workflow

d615cb7

Allow manual CI runs from the GitHub Actions tab.

jinroq commented Mar 6, 2026

View reviewed changes

ext/strscan/strscan.c

return new_ary;

}

#ifdef HAVE_RB_INT_PARSE_CSTR

Copy link

Author

jinroq Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see: ruby/ruby#16322

kou reviewed Mar 6, 2026

View reviewed changes

jinroq added 4 commits March 7, 2026 00:13

Revert "Add workflow_dispatch trigger to CI workflow"

9e0d504

This reverts commit d615cb7.

Guard RB_INT_PARSE_SIGN macro with #ifndef to avoid redefinition warning

1630df8

Align integer_at method registration formatting with surrounding lines

7659fbd

Add leading zero tests that distinguish decimal from octal interpreta…

08a7405

…tion Add "09" and "010" cases to verify integer_at always uses base 10, unlike Integer() which interprets leading zeros as octal.

jinroq requested a review from kou March 6, 2026 16:43

kou reviewed Mar 7, 2026

View reviewed changes

eregon mentioned this pull request Mar 9, 2026

Implement StringScanner for TruffleRuby in pure Ruby #195

Open

	rb_define_method(StringScanner, "integer_at", strscan_integer_at, 1);
	rb_define_method(StringScanner, "integer_at", strscan_integer_at, 1);

	rb_raise(rb_eArgError, "empty capture for integer conversion");
	rb_raise(rb_eArgError, "specified capture is empty: %"PRIsVALUE, idx);

	"non-digit character in capture: %.*s",
	"non-digit character in capture: <%.*s>",

Conversation

jinroq commented Mar 6, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eregon commented Mar 6, 2026

Uh oh!

eregon commented Mar 6, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kou commented Mar 7, 2026

Uh oh!

eregon commented Mar 7, 2026

Uh oh!

kou commented Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants