[Feature #21943] Add StringScanner#integer_at to extract capture group as Integer directly by jinroq · Pull Request #192 · ruby/strscan

jinroq · 2026-03-05T04:09:50Z

see: https://bugs.ruby-lang.org/issues/21943

relation: https://bugs.ruby-lang.org/issues/21932

This avoids intermediate String allocation when only the integer value of a captured group is needed (e.g. date parsing). The method converts the matched byte range to an Integer at the C level using rb_cstr2inum, which is semantically equivalent to `scanner[n].to_i` but without allocating a temporary String object.

kou

Can we use better name than get_int?
We already have #get_byte that returns the next byte not the data from the matched group.

kou · 2026-03-05T05:32:30Z

ext/strscan/strscan.c

+    ptr = S_PBEG(p) + beg;
+
+    {
+        char *buffer = RB_ALLOCV_N(char, buffer_v, len + 1);


Is it OK that we allocate C string?

@kou

Thank you for your feedback!

The original implementation did allocate a C string via RB_ALLOCV_N, which undermines the purpose of avoiding intermediate allocations. While RB_ALLOCV_N uses alloca (stack allocation) for small sizes and is much cheaper than a Ruby String, it is still an unnecessary allocation for the common case.

Updated in 4600b0a. For values up to 18 digits on 64-bit (9 on 32-bit), digits are now parsed directly from the source string's byte buffer with no allocation at all — neither Ruby String nor C string. The RB_ALLOCV_N + rb_cstr2inum path is only used as a fallback for bignum-range values (more than 18 digits).

In the primary use case of date component parsing (1-4 digit values), this means zero allocation of any kind.

You can use rb_int_parse_cstr.

@nobu

b12e653 fixed it.

Sorry, forgotten that is declared just in an internal header.
Maybe we should move them to a public header?

@nobu
If we've had no issues with an internal header up until now, this change alone isn't enough to justify making it a public header.

@kou
That's my opinion, but what do you think?

If we need to use rb_int_parse_cstr() in out side of Ruby (strscan), we should make it public.

Parse digits directly from the source string's byte buffer for values that fit in a long (up to 18 digits on 64-bit, 9 on 32-bit), avoiding the RB_ALLOCV_N + rb_cstr2inum overhead entirely. Fall back to rb_cstr2inum with a temporary buffer only for bignum-range values. This eliminates all allocation (both Ruby String and C buffer) in the primary use case of date component parsing (1-4 digit values).

jinroq · 2026-03-05T06:28:31Z

@kou

Can we use better name than get_int?
We already have #get_byte that returns the next byte not the data from the matched group.

Here are some alternatives to get_int:

integer_at(n)
- Symmetric with values_at. A singular form of integers_at, proposed during a Ruby Office Hour discussion.
capture_integer(n)
- Clearly retrieves from capture. Also symmetric with scan_integer (scan advances, capture retrieves from the previous match).
matched_integer(n)
- Consistent with matched / matched_size

I recommend integer_at.

kou

OK. Let's use integer_at(index).

Alternative idea: scanner[index, :integer]

kou · 2026-03-05T09:30:57Z

ext/strscan/strscan.c

+    {
+        long j = 0;
+        int negative = 0;
+        long digit_count;
+
+        if (ptr[0] == '-') { negative = 1; j = 1; }
+        else if (ptr[0] == '+') { j = 1; }
+
+        digit_count = len - j;
+
+        if (digit_count <= GET_INT_MAX_DIGITS) {
+            long result = 0;
+            for (; j < len; j++) {
+                result = result * 10 + (ptr[j] - '0');
+            }
+            return LONG2NUM(negative ? -result : result);
+        }
+    }


What does happen when matched string contains non numbers such as .?

It's a bug because it returns an unexpected value. 82b93ad fixes it.

The name "get_int" conflicts with the existing "get_byte" convention, which advances the scanner position. "integer_at" better reflects that this method reads from a capture group (like "values_at") rather than advancing the scanner.

Previously, non-digit characters in the captured substring produced silently wrong results (e.g. "1.5" returned 85). Now both the fast path and bignum path validate that all characters after an optional sign are digits, raising ArgumentError otherwise.

jinroq · 2026-03-05T10:33:43Z

@kou

OK. Let's use integer_at(index).

Thanks, I renamed it to integer_at in 46a2e1d.

On a different note, I was trying to rebase but it seems I ended up implementing it on the master branch. Can I remake the PR or should I go ahead with it?

Replace the hand-rolled fast path / slow path integer parsing with a single call to rb_int_parse_cstr, which accepts a length parameter and parses directly from the source buffer without NUL-terminated copy. This removes the INTEGER_AT_MAX_DIGITS constant, the manual digit validation loops, and the RB_ALLOCV_N buffer allocation.

tompng · 2026-03-05T11:21:46Z

ext/strscan/strscan.c

+    end = adjust_register_position(p, p->regs.end[i]);
+    len = end - beg;
+
+    if (len <= 0) return INT2FIX(0);


Looks like this integer_at parses more strict than Integer(string) that accepts "1_234".
So I think empty string case should also raise error like Integer("") raises ArgumentError.

In any case, I think it's worth adding empty string matched test case

@tompng

Thank you for your feedback. 0948c72 fixed it.

Consistent with Integer("") raising ArgumentError, empty captures now also raise instead of silently returning 0.

Cover sign-only captures (+/-), signed numbers (-42/+42), leading zeros (007), and full match containing non-digit separators (e.g. "2024-06-15" at index 0).

kou · 2026-03-06T00:58:23Z

On a different note, I was trying to rebase but it seems I ended up implementing it on the master branch. Can I remake the PR or should I go ahead with it?

You can close this PR and re-open a new PR with a topic branch.

kou · 2026-03-06T01:00:07Z

ext/strscan/strscan.c

+/* rb_int_parse_cstr is declared in internal/bignum.h which is not
+ * available to extensions. Declare it here since the symbol is
+ * exported from libruby. */
+VALUE rb_int_parse_cstr(const char *str, ssize_t len, char **endp,
+                        size_t *ndigits, int base, int flags);
+#define RB_INT_PARSE_SIGN 0x01
+


Could you move this from between the call-seq comment and strscan_integer_at() definition for RDoc?

kou · 2026-03-06T01:04:30Z

BTW, could you enable GitHub Actions on your fork to run CI in your side too.

* The problem is otherwise it would include PR workflows from a fork branch named "master", like ruby/strscan#192.

kou reviewed Mar 5, 2026

View reviewed changes

jinroq requested a review from kou March 5, 2026 07:00

kou reviewed Mar 5, 2026

View reviewed changes

jinroq added 2 commits March 5, 2026 19:13

Rename StringScanner#get_int to #integer_at

46a2e1d

The name "get_int" conflicts with the existing "get_byte" convention, which advances the scanner position. "integer_at" better reflects that this method reads from a capture group (like "values_at") rather than advancing the scanner.

jinroq changed the title ~~[Feature #21943] Add StringScanner#get_int to extract capture group as Integer directly~~ [Feature #21943] Add StringScanner#integer_at to extract capture group as Integer directly Mar 5, 2026

tompng reviewed Mar 5, 2026

View reviewed changes

jinroq added 2 commits March 5, 2026 23:14

Raise ArgumentError for empty capture in StringScanner#integer_at

0948c72

Consistent with Integer("") raising ArgumentError, empty captures now also raise instead of silently returning 0.

Add edge case tests for StringScanner#integer_at

8726c03

Cover sign-only captures (+/-), signed numbers (-42/+42), leading zeros (007), and full match containing non-digit separators (e.g. "2024-06-15" at index 0).

kou reviewed Mar 6, 2026

View reviewed changes

jinroq mentioned this pull request Mar 6, 2026

Expose rb_int_parse_cstr() as public API ruby/ruby#16322

Open

jinroq closed this Mar 6, 2026

jinroq reopened this Mar 6, 2026

jinroq closed this Mar 6, 2026

jinroq mentioned this pull request Mar 6, 2026

[Feature #21943] Add StringScanner#integer_at #193

Open

eregon added a commit to truffleruby/truffleruby-gem-tracker that referenced this pull request Mar 7, 2026

Only consider push workflows

e9dbf46

* The problem is otherwise it would include PR workflows from a fork branch named "master", like ruby/strscan#192.

Conversation

jinroq commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kou left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jinroq commented Mar 5, 2026

Uh oh!

kou left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jinroq commented Mar 5, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kou commented Mar 6, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kou commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jinroq commented Mar 5, 2026 •

edited

Loading